<<

' $ ' Beyond linearity $

Part IV: • So far, the majority of methods/models we have seen are linear

• In general, it is highly unlikely that the true underlying f(·) is expansion methods linear ! either on the original scale, or some transformed scale Reading • Linear models can serve as good approximations • HTF: Chapters 5, 8 ! locally, any function can be well approximated by a line • RWC: Chapters 3, 4, 5 ! convenient analytically ! interpretable

• Here we would like to move beyond linearity

&188 S. Haneuse; Biostat/Stat 572 % &189 S. Haneuse; Biostat/Stat 572 %

' Hormone replacement therapy over time $ ' $ 550

BCSC data 450

• Rates of hormone replacement therapy (HRT) use and cancer, between 350 HRT use 1997 and 2004 250 Invasive cancer DCIS • Data obtained from the Breast Cancer Surveillance Consortium 150 ! on-going study of mammography in the US

! collected from 7 sites around the country 6 4 Cancer 2 • Rates based on data from women aged 50-69 0 ! age- and site-adjusted 1997 1998 1999 2000 2001 2002 2003 2004

! n = 84, month-specific values Calendar date

&190 S. Haneuse; Biostat/Stat 572 % &191 S. Haneuse; Biostat/Stat 572 % 'Linear regression $ ' $ Linear regression • Consider the task of describing the trend in HRT use over time 550

• Although the structure is evidently non-linear, we could start by

considering a linear model 450

th • Let xi denote the date of the i observation, and yi the corresponding observed rate of HRT use 350 HRT use (per 1,000) • The form of the linear regression model is 250

yi = β0 + β1xi + #i, 150

with the corresponding fit shown next 1997 1998 1999 2000 2001 2002 2003 2004 Calendar date

&192 S. Haneuse; Biostat/Stat 572 % &193 S. Haneuse; Biostat/Stat 572 %

'• The systematic component of this model consists of a $ '• After standardizing, so that x ∈ [0, 1], the basis can be displayed as $ of two functions: 1.0

h0(x) = 1 0.8

h1(x) = x 0.6

• These two functions are referred to as basis functions and form the basis 0.4 of the model 0.2

• The basis corresponds to the columns of the design matrix 0.0

0.0 0.2 0.4 0.6 0.8 1.0

1 x1 . . X =  . .  • The core idea of Part IV is to augment/replace the input vector by   considering transformations of the X-space via families of basis functions  1 xn     

&194 S. Haneuse; Biostat/Stat 572 % &195 S. Haneuse; Biostat/Stat 572 % ' $ ' $ basis expansion Quadratic regression model: p = 2 550

• A straightforward way of augmenting the design matrix is to consider higher-degree 450

• A pth-degree polynomial basis corresponds to the basis functions 350 j hj(x) = x j = 0, . . . , p HRT use (per 1,000)

and yield the design matrix 250

p 1 x1 . . . x1 . . . 150 X =  . . .  . . . 1997 1998 1999 2000 2001 2002 2003 2004  p   1 xn . . . xn  Calendar date    

&196 S. Haneuse; Biostat/Stat 572 % &197 S. Haneuse; Biostat/Stat 572 %

' $ ' $ Cubic regression model: p = 3 10^th!degree 550 550 450 450 350 350 HRT use (per 1,000) HRT use (per 1,000) 250 250 150 150

1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004

Calendar date Calendar date

&198 S. Haneuse; Biostat/Stat 572 % &199 S. Haneuse; Biostat/Stat 572 % '• Basis representations: $ ' Piecewise polynomials $

Quadratic regression basis Cubic regression basis

1.0 1.0 • The polynomial basis imposes a single structure throughout the X-space

0.8 0.8 • We could relax this by partitioning the X-space into a series of disjoint intervals, and adopt a polynomial structure within each 0.6 0.6

0.4 0.4 • For the HRT data, we could partition the [1997, 2004] interval into three sub-intervals by defining two split points 0.2 0.2

ξ = (ξ1, ξ2) = (2000, 2002.5) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

&200 S. Haneuse; Biostat/Stat 572 % &201 S. Haneuse; Biostat/Stat 572 %

'Discrete piecewise polynomials $ ' $ Piecewise constant • Given ξ, the simplest model would assume f(X) is piecewise constant 550 ! estimation reduces to calculating the sample mean within the three intervals 450 • This model can be represented with three basis functions:

∗ 350 h0(x) = 1[x<ξ1] ∗

h1(x) = 1[ξ1≤x<ξ2] HRT use (per 1,000)

∗ 250 h2(x) = 1[x≥ξ2]

so that 150 !1 !2 3 ∗ 1997 1998 1999 2000 2001 2002 2003 2004 f(x) = βj hj (x) j=0 Calendar date '

&202 S. Haneuse; Biostat/Stat 572 % &203 S. Haneuse; Biostat/Stat 572 % '• An alternative representation is given by $ ' $ Piecewise linear ∗ h0(x) = 1 550 ∗ h1(x) = 1[ξ1≤x<ξ2] ∗ h2(x) = 1[x≥ξ2] 450

• We can relax this model by allowing the piecewise components to be linear 350 ! require three additional basis functions HRT use (per 1,000) 250

h00(x) = 1, h10(x) = 1[ξ1≤x<ξ2], h20(x) = 1[x≥ξ2]

and 150 !1 !2

1997 1998 1999 2000 2001 2002 2003 2004 h01(x) = x, h11(x) = 1[ξ1≤x<ξ2]x, h21(x) = 1[x≥ξ2]x Calendar date

&204 S. Haneuse; Biostat/Stat 572 % &205 S. Haneuse; Biostat/Stat 572 %

' $ '• Can extend this to arbitrary degrees, where we define for j = 0, . . ., p $ Piecewise cubic j h0j(x) = x

550 j h1j(x) = 1[ξ1≤x<ξ2]x j h2j(x) = 1[x≥ξ2]x 450

• Given K interior split points, or knots, this results in 350

(K + 1) × (p + 1) HRT use (per 1,000) 250 parameters 150 !1 !2

1997 1998 1999 2000 2001 2002 2003 2004

Calendar date

&206 S. Haneuse; Biostat/Stat 572 % &207 S. Haneuse; Biostat/Stat 572 % 'Continuous piecewise polynomials $ ' $ Continuous piecewise linear • Typically, we don’t expect the true underlying function to be disjoint at a 550 set of (essentially) arbitrary knots

• Ensure continuity by imposing constraints at the knot locations 450

• For the piecewise linear model there are two constraints: 350

ξ1 : β00 + ξ1β01 = β10 + ξ1β11 HRT use (per 1,000) 250 ξ2 : β10 + ξ2β11 = β12 + ξ1β12 150 ! 6 parameters + 2 constraints = 4 free parameters !1 !2

1997 1998 1999 2000 2001 2002 2003 2004

Calendar date

&208 S. Haneuse; Biostat/Stat 572 % &209 S. Haneuse; Biostat/Stat 572 %

'• Alternatively, we can use a basis which directly incorporates the $ 'Truncated-power basis $ constraints: • We might also be interested in imposing a certain degree of smoothness at h0(x) = 1, h1(x) = x the knot locations h2(x) = (x − ξ1)+, h3(x) = (x − ξ2)+

! t+ denotes the positive part • Achieved by incorporating additional constraints at the knot locations ! 1st constraint ensures continuity of f(·) nd $ • After standardizing x, we 1.0 ! 2 constraint ensures continuity of the first derivative f (·) can represent this basis ! 3rd constraint ensures continuity of the second derivative f $$(·) 0.8 graphically as: ! etc. 0.6 • Q: Can we intuitively ex-

0.4 • For a polynomial of degree p, imposing K × p constraints yields a plain how this representa- function, f(·), which has continuous derivatives, wrt x, up to order p − 1

tion achieves the goal? 0.2 • Total degrees of freedom are: 0.0

0.0 0.2 0.4 0.6 0.8 1.0 (K + 1) × (p + 1) − K × p = K + p + 1

&210 S. Haneuse; Biostat/Stat 572 % &211 S. Haneuse; Biostat/Stat 572 % '• Build on the representation for continuous piecewise linear model, and $ ' $ increase the order of the local polynomial Piecewise linear: df = 6 550 • General form of the truncated-power basis:

! degree-p spline, with knots ξk, k = 1, . . . , K 450

j hj (x) = x , j = 0, . . . , p p hp+k(x) = (x − ξk)+, k = 1, . . . , K 350 HRT use (per 1,000)

! K + p + 1 basis functions 250 150 !1 !2

1997 1998 1999 2000 2001 2002 2003 2004

Calendar date

&212 S. Haneuse; Biostat/Stat 572 % &213 S. Haneuse; Biostat/Stat 572 %

' $ ' $ Linear spline: df = 4 Quadratic spline: df = 5 550 550 450 450 350 350 HRT use (per 1,000) HRT use (per 1,000) 250 250 150 150 !1 !2 !1 !2

1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004

Calendar date Calendar date

&214 S. Haneuse; Biostat/Stat 572 % &215 S. Haneuse; Biostat/Stat 572 % ' $ 'B-splines $ Cubic spline: df = 6 • The form of the truncated power basis leads to numerical instability 550 ! powers of large numbers lead to rounding problems

450 • B-splines provide an alternative, computationally convenient, and equivalent representation of the truncated power basis ! equivalent in that the two bases span the same set of functions 350

• B-spline representation is via a series of polynomial basis functions which HRT use (per 1,000)

250 have local support

• Two key aspects of their definition are 150 !1 !2 ! they are locally defined, and hence require an augmentation of the knot 1997 1998 1999 2000 2001 2002 2003 2004 sequence Calendar date ! a recursive relation used to build up the degree of the polynomial functions

&216 S. Haneuse; Biostat/Stat 572 % &217 S. Haneuse; Biostat/Stat 572 %

'• To help understand the of the representation, we can consider the cubic $ ' $ spline (p = 3) on the unit interval, x ∈ [0, 1] with K = 3 interior knots B!spline basis functions of degree 0: df = 4

ξ = (0.25, 0.50, 0.75) 1

• Initially, consider truncated power basis of degree 0 ! i.e., a piecewise constant function ) x ( 0 i • The degree 0 B-spline representation is a series of locally constant B functions over the support of X

! requires augmenting ξ with boundary knots, say, ξ0 = 0 and

ξK+1 = ξ4 = 1, to give 0 !0 !1 !2 !3 !4 ∗ ξ = (0, 0.25, 0.50, 0.75, 1) 0.00 0.25 0.50 0.75 1.00

x ! boundary knots define the support over which the spline is evaluated

&218 S. Haneuse; Biostat/Stat 572 % &219 S. Haneuse; Biostat/Stat 572 % '• More formally, the B-spline basis functions of degree 0 are $ '• The construction of the B-spline representation of a linear spline is $

obtained by taking a weighted average of the Bi,0(·) functions:

1 x ∈ [ξi, ξi+1) Bi,0(x) = , x − ξi ξi+2 − x  0 otherwise Bi,1(x) = Bi,0(x) + Bi+1,0(x)  ξi+1 − ξi ξi+2 − ξi+1  i = 0, . . . , K ! two components defined on the intervals [ξi, ξi+1) and [ξi+1, ξi+2) respectively ! K + p + 1 = 4 basis functions

• We see that each component of B (·) is the product of two functions of x • Note that each basis function ‘spans’ two knots i,1 ! a locally constant function ! boundary knots are required to ensure all of the locally-defined Bi,0(·) are well-defined ! a locally linear function

• Hence each product is locally linear, as is the overall basis function Bi,1(·)

&220 S. Haneuse; Biostat/Stat 572 % &221 S. Haneuse; Biostat/Stat 572 %

' $ '• From the definition, we can see that each Bi,1(·) is defined over an $ B!spline basis functions of degree 1: df = 5 interval which ‘spans’ three knots

1 • To ensure that each Bi,1(·) is well-defined, we are again required to further augment the knot sequence with two additional boundary knots; ξ−1 and ξK+2

) • As such, K + p + 1 = 5 basis functions are defined x ( 1 i

B ! index i = −1, . . ., K

• By convention, knots beyond the boundary knots, are set to equal ξ0 or ξK+1, as appropriate 0 ! in the present example, this gives ξ∗ = (0, 0, 0.25, 0.50, 0.75, 1, 1) !0 !1 !2 !3 !4 ! to avoid division by zero, we can adopt the convention: 0.00 0.25 0.50 0.75 1.00

x Bi,0(x) = 0 if ξi = ξi+1

&222 S. Haneuse; Biostat/Stat 572 % &223 S. Haneuse; Biostat/Stat 572 % '• The construction of the B-spline representation of a quadratic spline is $ ' $ again based on a recursion relation, and also requires further augmentation B!spline basis functions of degree 2: df = 6

of the knot sequence; ξ−2 and ξK+3 1 • The form of each degree-2 basis function is obtained by taking a weighted

average of the Bi,1(·) functions:

x − ξi ξi+3 − x )

Bi,2(x) = Bi,1(x) + Bi+1,1(x) x ( 2 ξi+2 − ξi ξi+3 − ξi+1 i B

for i = −2, . . ., K

• Each component of Bi,2(·) is the product of two locally linear functions 0 !0 !1 !2 !3 !4 ! defined on the intervals [ξi, ξi+2) and [ξi+1, ξi+3) respectively 0.00 0.25 0.50 0.75 1.00

• Hence each product is locally quadratic, as is the overall basis function x

! defined on the interval [ξi, ξi+3)

&224 S. Haneuse; Biostat/Stat 572 % &225 S. Haneuse; Biostat/Stat 572 %

'• Finally, the construction of the B-spline representation of a cubic spline is $ ' $

again based on a recursion relation, and additional knots; ξ−3 and ξK+4 B!spline basis functions of degree 3: df = 7

• The form of each degree-3 basis function is obtained by taking a weighted 1

average of the Bi,2(·) functions:

x − ξi ξi+4 − x Bi,3(x) = Bi,2(x) + Bi+1,2(x) ξi+3 − ξi ξi+4 − ξi+1 ) x ( 3 i for i = −3, . . ., K B

• Each component of Bi,3(·) is the product of a locally linear and locally quadratic function 0 ! ! ! ! ! ! defined on the intervals [ξi, ξi+3) and [ξi+1, ξi+4) respectively 0 1 2 3 4

0.00 0.25 0.50 0.75 1.00 • Hence each product is locally cubic, as is the overall basis function x ! defined on the interval [ξi, ξi+4)

&226 S. Haneuse; Biostat/Stat 572 % &227 S. Haneuse; Biostat/Stat 572 % ' $ '• Summary: degree-p B-spline $ B!spline basis functions of degree 3: df = 7 ! define an augmented knot sequence:

∗ 1 ξ = (ξ−p, . . . , ξ0, ξ, ξK+1, . . . , ξK+p+1)

! for i = −p, . . ., K + p, let

1 x ∈ [ξi, ξi+1) Bi,0(x) = )  x 0 otherwise ( 3

i  B  where, by convention, Bi,0(x) = 0 if ξi = ξi+1

! the ith B-spline basis function of degree j, j = 1, . . ., p, is given by 0 !0 !1 !2 !3 !4 x − ξi ξi+j+1 − x Bi,j(x) = Bi,j−1(x) + Bi+1,j−1(x) 0.00 0.25 0.50 0.75 1.00 ξi+j − ξi ξi+j+1 − ξi+1

x for i = −p, . . ., K + p − j

&228 S. Haneuse; Biostat/Stat 572 % &229 S. Haneuse; Biostat/Stat 572 %

## bs3: 3 + 3 + 1 = 7 basis functions of degree 3 'R code $ ' $ ## 1 2 3 4 5 6 7 • B-spline basis representations of the truncated power basis can be [1,] 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 generated using the bs() in the splines library [2,] 0.88362 0.11398 0.00239 0.00001 0.00000 0.00000 0.00000 [3,] 0.77664 0.21396 0.00931 0.00009 0.00000 0.00000 0.00000 ## [4,] 0.67866 0.30064 0.02041 0.00030 0.00000 0.00000 0.00000 library(splines) [5,] 0.58929 0.37470 0.03531 0.00070 0.00000 0.00000 0.00000 X <- seq(from=0, to=1, length=100) [6,] 0.50813 0.43683 0.05366 0.00137 0.00000 0.00000 0.00000 knots <- c(0.25, 0.50, 0.75) [7,] 0.43479 0.48774 0.07509 0.00237 0.00000 0.00000 0.00000 [8,] 0.36887 0.52811 0.09925 0.00377 0.00000 0.00000 0.00000 ## Generate the B-spline design matrix [9,] 0.30997 0.55864 0.12576 0.00563 0.00000 0.00000 0.00000 ## [10,] 0.25770 0.58002 0.15427 0.00801 0.00000 0.00000 0.00000 bs1 <- bs(X, degree=1, knots=knots, intercept=T) [11,] 0.21167 0.59293 0.18441 0.01099 0.00000 0.00000 0.00000 bs2 <- bs(X, degree=2, knots=knots, intercept=T) [12,] 0.17147 0.59808 0.21582 0.01463 0.00000 0.00000 0.00000 bs3 <- bs(X, degree=3, knots=knots, intercept=T) [13,] 0.13671 0.59615 0.24814 0.01900 0.00000 0.00000 0.00000 [14,] 0.10700 0.58785 0.28100 0.02415 0.00000 0.00000 0.00000 ## Can also specify the degrees of freedom, in which case [15,] 0.08194 0.57385 0.31404 0.03017 0.00000 0.00000 0.00000 ## bs() uses the quantiles of X to choose df-degree-1 (internal) knots [16,] 0.06113 0.55486 0.34690 0.03710 0.00000 0.00000 0.00000 ## [17,] 0.04419 0.53156 0.37922 0.04503 0.00000 0.00000 0.00000 bs3alt <- bs(X, degree=3, df=7, intercept=T) [18,] 0.03070 0.50466 0.41063 0.05401 0.00000 0.00000 0.00000

&230 S. Haneuse; Biostat/Stat 572 % &231 S. Haneuse; Biostat/Stat 572 % ' $ ' $ ## bs() default has intercept=F since modeling functions in R typically Natural cubic splines ## include the intercept by default ## - the following are equivalent in that they produce the same fit ## fit1 <- lm(Y ~ bs(X, degree=3, knots=knots)) • A potential difficulty with the use of polynomials is that they are unstable fit2 <- lm(Y ~ bs(X, degree=3, knots=knots, intercept=T) - 1) as the degree increases ! the variability of predictions can increase substantially at the boundary Question: of the X-range

How would you report the association between X and Y from such a • The figure on the next slide shows pointwise standard errors for predictions model? based on various model fits ! X is uniformly distributed on [0, 1] ! Y was obtained from a linear model with an additive error term ! sample size of n = 50

• Cubic spline, with knots ξ = (0.25, 0.50, 0.75), is highly variable for X close to either boundary

&232 S. Haneuse; Biostat/Stat 572 % &233 S. Haneuse; Biostat/Stat 572 %

' $ '• Natural cubic splines aim to reduce this uncertainty by imposing linearity $ Pointwise standard errors in the tails beyond the boundary knots ! impose two constraints at each boundary; f $$(x) = f $$$(x) = 0 Cubic spline: df = 6 ! free up 4 degrees of freedom

0.4 Natural cubic spline: df = 4 Global quadratic: df = 3 Global linear: df = 2 • Given K knots, a natural cubic spline has K basis functions: 0.3

N0(x) = 1 N (x) = x 0.2 1

N1+k(x) = dk(x) − dK−1(x)

0.1 where

3 3 (x − ξk)+ − (x − ξK )+ 0.00 0.25 0.50 0.75 1.00 dk(x) = ξK − ξk X

&234 S. Haneuse; Biostat/Stat 572 % &235 S. Haneuse; Biostat/Stat 572 % ' $ 'R code $ Natural cubic spline basis • Natural cubic splines also have a B-spline representation 1.0 ! Chambers and Hastie (1992) Statistical models in S, Chapter 7:

0.8 Generalized additive models.

0.6 • In R, they are also implemented in the splines package:

0.4 ## library(splines) X <- seq(from=0, to=1, length=100) 0.2 knots <- c(0.25, 0.50, 0.75) 0.0 !1 !2 !3 ## ncs <- ns(X, knots=knots, intercept=T) 0.00 0.25 0.50 0.75 1.00

X

&236 S. Haneuse; Biostat/Stat 572 % &237 S. Haneuse; Biostat/Stat 572 %

' $ ' Smoothing splines $ B!spline represeptation of the natural cubic spline

0.6 • Each of the methods we have seen so far move beyond linearity via an explicit transformation of X 0.4

0.2 X −→ X (ξK , p) 0.0 • The X∗ may then be included into any analysis framework 0.2 ! • Fixed-knot splines are often referred to as regression splines 0.4 ! !1 !2 !3 • ‘Complexity’ is controlled by two quantities: 0.00 0.25 0.50 0.75 1.00 ! the number and location of the knots, ξK X ! the degree, p

&238 S. Haneuse; Biostat/Stat 572 % &239 S. Haneuse; Biostat/Stat 572 % '• Smoothing splines are an alternative basis method which uses a penalty $ '• λ ∈ (0, ∞) is a tuning or smoothing parameter which controls complexity $ term to control complexity by establishing a trade-off between the two terms

• Consider estimation of f(·) via minimization of the following penalized • λ = 0: residual sum of squares: ! f(·) can be any function which interpolates between the data ! very rough n 2 $$ RSS(f, λ) = {yi − f(xi)} + λ [f (x)] dx i=1 • λ = ∞: ' + ! the simple least squares fit, since no 2nd derivative can be tolerated ! f $$(·) is the second derivative ! very smooth ! restrict attention to the class of functions with two continuous derivatives • The solution is a natural cubic spline with knots at the unique values of xi, i = 1, . . ., n • The first term measures ‘closeness’ to the data, while the second term ! n knots ⇒ n parameters penalizes curvature in the function ! penalty term translates into regularization (shrinkage) of the corresponding coefficients

&240 S. Haneuse; Biostat/Stat 572 % &241 S. Haneuse; Biostat/Stat 572 %

'• Since the solution is a natural spline, it can be written as $ '• The solution has the same form as that of ridge regression $

n−1 T −1 T θˆ = (N N + λΩn) N y f(x) = Nj (x)θj j=0 ' • The fitted values are obtained simply as th ! Nj (·) is the j natural cubic spline basis T −1 T yˆ = N(N N + λΩn) N y = S y • So the minimization criterion reduces to λ

T T ! Sλ is an n × n smoother matrix RSS(f, λ) = (y − Nθ) (y − Nθ) + λθ Ωnθ

• We can take the trace of this matrix as a definition of the effective degrees where of freedom of a smoothing spline

{N}i,j = Nj (xi) dfλ = trace(Sλ) Ω $$ $$ { n}j,k = Nj (t)Nk (t)dt + ! understanding/specifying the extent of smoothness

&242 S. Haneuse; Biostat/Stat 572 % &243 S. Haneuse; Biostat/Stat 572 % 'R code $ 'HRT data $

• Smoothing splines are implemented using the stats package ## > library(foreign) ! one of the core R packages > raw <- read.csv("HTraw.csv") > names(raw) <- c("date", "cancer", "dcis", "ht") • smooth.spline(): ## ## Setting df = 2 gives (approximately) the linear model args(smooth.spline) ## > smooth.spline(raw$date, raw$ht, all.knots=T, cv=T, df=2) function (x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE, all.knots = FALSE, nknots = NULL, keep.data = TRUE, df.offset = 0, Smoothing Parameter spar= 1.499966 lambda= 86.90966 penalty = 1, control.spar = list()) Equivalent Degrees of Freedom (Df): 2.002377 Penalized Criterion: 318103.3 ! employs a B-spline representation of the underlying cubic spline PRESS: 4014.238

! spar ∈ (0, 1) is a one-to-one function of λ ## yhat is obtained straighforwardly: ! alternative, could specify the effective degrees of freedom: df ## > yhat <- fitted(smooth.spline(raw$date, raw$ht, all.knots=T, cv=T, df=2)) ! help(smooth.spline) will provide more information

&244 S. Haneuse; Biostat/Stat 572 % &245 S. Haneuse; Biostat/Stat 572 %

'## Increase the flexibility of the model $ ' $ ## Linear regression Degrees of freedom: 5 550 550 ## > smooth.spline(raw$date, raw$ht, all.knots=T, cv=T, df=5) 450 450 350 350 Smoothing Parameter spar= 0.9165963 lambda= 0.005298357 250 250

Equivalent Degrees of Freedom (Df): 5.000601 HRT use (per 1,000) HRT use (per 1,000)

Penalized Criterion: 21654.40 150 150 PRESS: 290.1555 1997 1999 2001 2003 1997 1999 2001 2003

Degrees of freedom: 10 Degrees of freedom: 50 ## 550 550 > smooth.spline(raw$date, raw$ht, all.knots=T, cv=T, df=50) 450 450 Smoothing Parameter spar= 0.3071915 lambda= 2.095859e-07 350 350 Equivalent Degrees of Freedom (Df): 50.00616 Penalized Criterion: 1429.207 250 250 HRT use (per 1,000) HRT use (per 1,000)

PRESS: 105.7059 150 150 1997 1999 2001 2003 1997 1999 2001 2003

&246 S. Haneuse; Biostat/Stat 572 % &247 S. Haneuse; Biostat/Stat 572 % '## Automatic selection of degree/lambda $ ' $ ##

## Cross-validation 550 ## 500 > smooth.spline(raw$date, raw$ht, all.knots=T, cv=T) 450 Smoothing Parameter spar= 0.4309523 lambda= 1.644301e-06 Equivalent Degrees of Freedom (Df): 30.99007 400 Penalized Criterion: 2859.563 350 PRESS: 90.05885 300 HRT use (per 1,000) ## Generalized cross-validation ## 250 CV: df = 31.0 GCV: df = 29.5 > smooth.spline(raw$date, raw$ht, all.knots=T) 200

Smoothing Parameter spar= 0.4433290 lambda= 2.020224e-06 150 Equivalent Degrees of Freedom (Df): 29.50564 1997 1998 1999 2000 2001 2002 2003 2004 Penalized Criterion: 3016.396 Calendar date GCV: 85.32268

&248 S. Haneuse; Biostat/Stat 572 % &249 S. Haneuse; Biostat/Stat 572 %

'Simulation study $ '• For each dataset, fit a smoothing spline model with λ ∈ {2, . . . , 50} $

ˆr • Examine the bias-variance trade-off, under squared-error loss, for estimate : fλ(x), r = 1, . . . , R smoothing splines as a function of λ • Monte carlo estimate pointwise mean-squared error • Assume

R Y = f(X) + # 1 2 M!SE (x) = fˆr(x) − f(x) λ R λ r=1 ' , - ! f(·) is the fitted function obtained by setting λ = 10 ! # ∼ (0, σ2), where σ2 = 10.5

• Generate R = 10,000 datasets under this mechanism ! n = 84 observations ! X-values taken to be those of the HRT data

&250 S. Haneuse; Biostat/Stat 572 % &251 S. Haneuse; Biostat/Stat 572 % ' $ ' $ Mean squared error 500 500 8000 400 400 df = 50 simY simY df = 30 300 300 df = 12 df = 5 6000 200 200 df = 2

1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004

Calendar date Calendar date 4000 500 500 2000 400 400 simY simY 300 300 0 200 200 1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004 Calendar date Calendar date Calendar date

&252 S. Haneuse; Biostat/Stat 572 % &253 S. Haneuse; Biostat/Stat 572 %

' $ ' Bias!variance trade!off $ Mean squared error df = 50 df = 30 70

500 df = 12 df = 5

35 df = 2 df = 50 Bias df = 30 0 400 df = 12 df = 5 35 ! 300 70 ! 100 200 75 50 100 Variance 25 0 0

1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004

Calendar date Calendar date

&254 S. Haneuse; Biostat/Stat 572 % &255 S. Haneuse; Biostat/Stat 572 % '• We could also summarize across X by evaluating the Monte carlo estimate $ ' Projection vs. Shrinkage $ of the mean integrated squared error

R 1 2 • Two main classes of spline models, both of which result in linear M!ISE = fˆr(x) − f(x) dx λ R λ .r=1 / estimators X+ ' , - • Truncated power basis and natural cubic spline:

400 ! for a given knot sequence ξ, the fitted values have the form

T −1 300 yˆ = Bξ(Bξ Bξ) Bξy

= Hξy 200

100 • Smoothing spline:

0 ! for smoothing parameter λ > 0, the fitted values have the form

0 10 20 30 40 50 T −1 T yˆ = N(N N + λΩn) N y Effective degrees of freedom = Sλy

&256 S. Haneuse; Biostat/Stat 572 % &257 S. Haneuse; Biostat/Stat 572 %

'Similarities and differences $ '• For Sλ, we can motivate the use of the trace by initially considering its’ $ Reinsch form: • Both H and S are n × n symmetric, positive semidefinite matrices T −1 T ξ λ Sλ = N(N N + λΩn) N = (I + λK)−1 • Hξ has rank M, the number of basis functions, while Sλ has rank n

T −1Ω −1 • In both cases, the trace can be used to obtain the degrees of freedom where K = [N ] n[N] does not depend on λ

• This leads to the following eigen-decomposition dfξ = trace(Hξ) = M n T Sλ = ρk(λ)ukuk k dfλ = trace(Sλ) ≤ N '=1 where 1 • For Hξ, this is reasonable since the trace gives the dimension of the ρ (λ) = k 1 + λd projection space, which is the number of basis functions and, hence, the k number of estimated parameters and dk is the corresponding eigenvalue of K

&258 S. Haneuse; Biostat/Stat 572 % &259 S. Haneuse; Biostat/Stat 572 % '• Using this, we can write $ '• By contrast, Hξ has M eigenvalues equal to 1, and the rest are equal to 0 $ ! imbedded within a saturated model where, for example, we set n yˆ = Sλy = ρk(λ)uk < uk, y > ∗ ξ = {xi; i = 1, . . . , n − 2} k '=1 we can view the regression spline approach in terms of components

• So, the Sλ works by decomposing y w.r.t. the basis {uk; k = 1, . . . , n}, which are either left alone or shrunk to zero

and differentially shrinking the contributions using ρk(λ) • Referred to as a projection smoother • Note that for λ > 0

ρk(λ) < 1

• Referred to as a shrinkage smoother

&260 S. Haneuse; Biostat/Stat 572 % &261 S. Haneuse; Biostat/Stat 572 %

' Nonparametric logistic regression $ '• The smoothing parameter λ establishes the trade-off between the two $ competing forces

• Consider logistic regression with a single continuous input X • As before, we find that the optimal fλ(·) is a natural cubic spline with knots at the {xi; i = 1, . . . , n}

logit(px) = f(x) • The functional form of the solution is therefore

where px = Pr(Y = 1| X = x). n−1 fλ(x) = Nj (x)θj • We can construct a penalized log-likelihood criterion as j=0 ' n λ $$ 2 l (f; λ) = [y log(p ) + (1 − y ) log(1 − p )] − [f (x)] dx • The first and second derivatives of lP (f; λ), wrt θ are P i xi i xi 2 i=1 ' + ∂l (f; λ) P = N T (y − p) − λΩθ ! log-likelihood measuring ‘closeness’ of f(·) to the observed data ∂θ ! penalty term, defined in terms of smoothness of f(·)

&262 S. Haneuse; Biostat/Stat 572 % &263 S. Haneuse; Biostat/Stat 572 % 'and $ ' $ ∂2l (f; λ) Predicted probability of LBW, as a function of mothers’ age P = −N T W N − λΩ ∂θ∂θT

0.10 X ! p is the n-vector with elements pxi

0.08 X ! W is a diagonal matrix of weights pxi (1 − pxi ) X X X 0.06 X X X • The solution requires an iterative algorithm, iterating between updates for X X X yHat

θ and computing new weights 0.04 X X ! the update for θ is X crude X 0.02 df = 1 df = 3

new T Ω −1 T old −1 0.00 df = 5 θ = (N W N − λ ) N W (Nθ + W (y − p)) df = 10

• The following is based on the King county birth weight data 15 20 25 30 35 40 45 Age, yrs

&264 S. Haneuse; Biostat/Stat 572 % &265 S. Haneuse; Biostat/Stat 572 %

' General definition of penalized splines $ '• Ruppert, Wand and Carroll (2003) define a general penalized spline for a $ continuous outcome as

T • We can imbed the methods we have seen into a broader class of f(x) = Bξ(x) β + # regularization methods where an estimate of β is obtained via minimization of a penalized • By combining the ideas of regression splines with those of smoothing generalized least squares criterion: splines, consider obtaining an estimate of f(·) via minimization of the criterion n 2 T {yi − f(xi)} + β Dβ n i=1 ' L(yi, f(xi)) + λJ(f) i=1 ' for some symmetric positive definite matrix D and λ > 0.

over H, the space of functions over which J(f) is defined

&266 S. Haneuse; Biostat/Stat 572 % &267 S. Haneuse; Biostat/Stat 572 % ' Assessing assocations $ '• To address these questions, we require: $ ! valid standard error estimation ! valid hypothesis testing • In this class we have been (generally) motivated by the desire to eventually perform prediction for independent data • Mixed effects models provide a framework within which penalized splines may be represented • It is often the case that interest lies in characterizing the association between X and Y , but without the restrictions imposed by ordinary • This framework presents a series of advantages that make their use parametric models attractive ! theory for estimation/inference is well established • Natural follow-up questions might include ! automatic smoothing ! what is the uncertainty associated with the point estimate fˆ(x)? ! availability of software ! is there an association? i.e., is f(x) = c ∀ x ! is f(·) linear or nonlinear? ! is the dip apparent in fˆ ‘really there’?

&268 S. Haneuse; Biostat/Stat 572 % &269 S. Haneuse; Biostat/Stat 572 %

'Linear mixed models $ '• Estimation of the unknown quantities can proceed by deriving an estimator $ for β and then taking the best linear unbiased predictor (BLUP) of b • Consider the general linear mixed effects model: • Let y = Xβ + Zb + $ V = ZGZT + R

where • The solutions are given by b 0 E = β˜ = (XT V −1X)−1XT V −1y  $   0      and

b G 0 ˜b = GZT V −1(y − Xβ˜) COV =  $   0 R      • In practice, the parameters in G and R would need to be estimated

&270 S. Haneuse; Biostat/Stat 572 % &271 S. Haneuse; Biostat/Stat 572 % '• These estimators can also be derived if we are willing to make additional $ '• In the special case where $ distributional assumptions 2 2 G = σ# I and R = σb I y| b ∼ Normal(Xβ + Zb, G) b ∼ Normal(0, R), the fitting criterion simplifies to

1 1 and maximize over the corresponding likelihood wrt (β, b) y Xβ Zb T y Xβ Zb bT b 2 ( − − ) ( − − ) + 2 σ# σb • This leads to the general estimation criterion

(y − Xβ − Zb)T R−1(y − Xβ − Zb) + bT G−1b

• Hence, estimation of (β, b) involves a generalized least squares with a penalty term

&272 S. Haneuse; Biostat/Stat 572 % &273 S. Haneuse; Biostat/Stat 572 %

2 'The linear spline model $ '• For an appropriate choice of D and dividing by σ# , the penalized spline $ fitting criterion (slide 267) can be written as

• Consider the linear spline model based on knots ξ = (ξ1, . . . , ξK ): 1 λ y Xβ Zb T y Xβ Zb bT b 2 ( − − ) ( − − ) + 2 K σ# σ# f(X) = β0 + β1X + bk(X − ξk)+ + # k=1 ' • This is equal the BLUP criterion (slide 273) by treating the b as a set of

2 random effects with where the # ∼ (0, σ# )

b 2I 2 2 • Given a sample of size n, define two design matrixes COV[ ] = σb and σb = σ# /λ

1 x1 (x1 − ξ1)+ . . . (x1 − ξK )+ • This provides a mixed model representation of the regression spline . . . . . X =  . .  and Z =  . .. .       1 xn   (xn − ξ1)+ . . . (xn − ξK )+         

&274 S. Haneuse; Biostat/Stat 572 % &275 S. Haneuse; Biostat/Stat 572 % 'Exploiting the BLUP representation $ '• Returning to the mixed model representation, and re-arranging, we can see $ that the smoothing parameter in the penalized linear spline is • The choice of D which lead to the above representation was 2 2 λ = σ# /σb

02×2 02×K D =  0  • Mixed modeling software such as PROC MIXED in SAS and lme in R K×2 IK×K 2 2 estimate σ and σ# , typically via ML or REML   b ! penalty imposed on the knot-specific regression parameters • From these we can obtain

ˆ 2 2 • For fixed λ, the penalized least squares solution is λ = σˆ# /σˆb

∗ βˆ = (CT C + λD)−1CT y ! hence, we can use standard mixed modeling software to provide an automatic, likelihood-based, approach to smoothing where C = [X, Z] • Q: How does this approach for selecting the degree of smoothing compare with those we have already seen?

&276 S. Haneuse; Biostat/Stat 572 % &277 S. Haneuse; Biostat/Stat 572 %

'• Returning to the BCSC HRT data $ '## $ lambda <- 0 library(foreign) betaHat <- solve((t(C) %*% C) + (lambda * D)) %*% t(C) %*% raw$ht raw <- read.csv("HTraw.csv") dof <- sum(diag(C %*% solve((t(C) %*% C) + (lambda * D)) %*% t(C))) yHat <- newC %*% betaHat ## K <- 10 ## knots <- (seq(from=1997, to=2004, length=(K+2)))[-c(1,K+2)] tempCol <- rep(c("lightgrey", "white"), (K/2+1)) postscript("NiceFigure.ps", height=10, width=10) ## par(cex=1.25) X <- cbind(1, raw$date) plot(raw$date, raw$ht, axes=F, ylim=c(140, 550), type="n", Z <- outer(raw$date, knots, "-") xlab="Calendar date", ylab="HRT use (per 1,000)") Z <- Z * (Z > 0) title(main=paste("lambda = 0; effective degrees of freedom = ", round(dof, 1)), C <- cbind(X, Z) col.main="blue", font.main=1, cex.main=1.5) D <- diag(c(0,0,rep(1, K))) ## polygon(c(1997, 1997, knots[1], knots[1]), c(150,550,550,150), ## col=tempCol[1], border=NA) newDates <- seq(from=1997, to=2004, length=500) for(k in 2:K) newX <- cbind(1, newDates) { newZ <- outer(newDate, knots, "-") polygon(c(knots[k-1], knots[k-1], knots[k], knots[k]), newC <- cbind(newX, newZ * (newZ > 0)) c(150,550,550,150), col=tempCol[k], border=NA)

&278 S. Haneuse; Biostat/Stat 572 % &279 S. Haneuse; Biostat/Stat 572 % '} $ ' $ polygon(c(knots[K], knots[K], 2004, 2004), c(150,550,550,150), lambda = 0; effective degrees of freedom = 12 col=tempCol[(K+1)], border=NA) ## 550 points(raw$date, raw$ht, cex=1.5, col="black", pch=1) axis(1, at=seq(from=1997, to=2004, by=1)) axis(2, at=seq(from=150, to=550, by=100)) ## 450 lines(newX[,2], yHat, col="red", lwd=4) dev.off() 350 HRT use (per 1,000) 250 150

1997 1998 1999 2000 2001 2002 2003 2004

Calendar date

&280 S. Haneuse; Biostat/Stat 572 % &281 S. Haneuse; Biostat/Stat 572 %

' $ '## $ lambda = 100; effective degrees of freedom = 2.6 library(nlme) newData <- data.frame(Y=raw$ht, X=raw$date, Z=Z, id=1:n) 550 ## formA <- as.formula(paste("Y ~ X", paste(" + Z.", 1:K, sep="", collapse=""), "| id"))

450 newGData <- groupedData(formA, data=newData)

##

350 formB <- as.formula(paste("~ -1", paste(" + Z.", 1:K, sep="", collapse=""))) fit0 <- lme(fixed=Y ~ X, random=pdIdent(formB), data=newGData) HRT use (per 1,000) ## 250 sigmaSq.eps <- fit0$sigma^2 sigmaSq.b <- sigmaSq.eps * exp(2 * unlist(fit0$modelStruct)) lambda <- sigmaSq.eps / sigmaSq.b

150 lambda

1997 1998 1999 2000 2001 2002 2003 2004 > 1.194208

Calendar date

&282 S. Haneuse; Biostat/Stat 572 % &283 S. Haneuse; Biostat/Stat 572 % ' $ ' $ lambda = 1.19; effective degrees of freedom = 5.9 lambda = 4.2; effective degrees of freedom = 5.7 550 550 450 450 350 350 HRT use (per 1,000) HRT use (per 1,000) 250 250 150 150

1997 1998 1999 2000 2001 2002 2003 2004 1997 1998 1999 2000 2001 2002 2003 2004

Calendar date Calendar date

&284 S. Haneuse; Biostat/Stat 572 % &285 S. Haneuse; Biostat/Stat 572 %

'• We can examine the impact of increasing the degree of the truncated $ ' $

power basis 550

• For example, consider the penalized cubic spline 450 K 3 f(X) = β0 + β1X + bk(X − ξk)+ + # k '=1 350

• By adjusting the Z design matrix, we can again use the mixed model HRT use (per 1,000) linear: df = 12 representation cubic: df = 12 250 linear: df = 5.9 cubic: df = 4.1 150

1997 1998 1999 2000 2001 2002 2003 2004

Calendar date

&286 S. Haneuse; Biostat/Stat 572 % &287 S. Haneuse; Biostat/Stat 572 % ' Other frameworks? $

• Generalized additive models

p

E[Y | X1, X2, . . . , Xp] = β0 + fk(Xk) k '=1

• Each of the fk(·) are permitted to be flexible

• Regression splines with estimation via least squares

• Smoothing splines with estimation via the backfitting algorithm

• For more details see Maggie and Ting (group 7)

&288 S. Haneuse; Biostat/Stat 572 %