Moving Beyond Linearity

David J. Hessen

Utrecht University

March 7, 2019

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 1 / 32 Statistical learning

Supervised learning

I a single response y

I multiple predictors x1, . . . , xp

Multiple regression (interval response): y = f(x1, . . . , xp) + ε

exp f(x , . . . , xp) Binary logistic regression: π = { 1 } 1 + exp f(x , . . . , xp) { 1 }

The assumption of linearity: f(x1, . . . , xp) = β0 + β1x1 + ... + βpxp

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 2 / 32 Statistical learning

Why this inflexible (very restrictive) approach?

I If linearity is true, then there is no bias and no more flexible method competes the variance of the estimator of f(x1, . . . , xp) will be smaller → I Often the linearity assumption is good enough I Very interpretable

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 3 / 32 Statistical learning

What can be done when linearity is not good enough? 1. regression 2. Piecewise 3. Regression splines 4. Smoothing splines 5. Local regression 6. Generalized additive models

Modeling approaches 1 to 5 are presented for the relationship between response y and a single predictor x

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 4 / 32 Polynomial regression

linear : f(x) = β0 + β1x 2 quadratic function : f(x) = β0 + β1x + β2x 2 3 cubic function : f(x) = β0 + β1x + β2x + β3x . . . . 2 d degree-d polynomial: f(x) = β0 + β1x + β2x + ... + βdx

It’s just the standard linear model

f(x1, . . . , xd) = β0 + β1x1 + β2x2 + ... + βdxd

where 2 d x1 = x, x2 = x , . . . , xd = x

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 5 / 32 Polynomial regression

I The coefficients β0, β1, . . . , βd can be easily estimated using least squares I The interest is more in the fitted value

ˆ ˆ ˆ ˆ 2 ˆ d f(x) = β0 + β1x + β2x + ... + βdx ˆ ˆ ˆ than in the coefficient estimates β0, β1,..., βd I Usually, either d is fixed to 3 or 4, or cross-validation is used to choose d I Especially near the boundary of x the polynomial can become overly flexible (bad for extrapolation)

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 6 / 32 Polynomial regression

Degree−4 Polynomial

| | || | || | | ||| || |||| | || | || ||| | | ||| | ||| | || || | || || | || | ||| | ) Age | 250 Wage > Wage ( Pr 50 100 150 200 250 300

|||||||||||||||||||||| || ||||||| ||| | |||||| |||||| |||||| | |||||| ||||||||| | ||| |||||| ||| || ||| || ||| ||||||| ||| ||| ||| |||| | ||| |||||| | | |||||| | || ||||| || | | ||| | || | | |||||| || |||||| | ||| ||| | ||| || |||| || | || |||| ||| || || | || ||| ||| ||||| ||| || | || || |||| | || ||| ||| | || | | |||| |||| |||||| | |||||| ||| || || |||| | ||| || | ||| ||| |||| |||| || |||| || || || | || ||| |||| |||||| | || ||||| |||||| ||| | |||||| ||||| |||||| | || | ||| ||| |||||||||| ||| | |||| ||| |||| ||| ||||| ||| ||||||| |||| ||| |||| ||| |||| | |||| | ||| ||| | ||| ||| ||| | ||| |||| | || | | ||| || || ||| | | | | | || || | || ||| || |||||| | | ||| ||| | ||| ||| | | | | | ||| ||| | | | | | ||| | | || || |||| | | | || | | | || || | | || | ||| || | | || 0.00 0.05 0.10 0.15 0.20

20 30 40 50 60 70 80 20 30 40 50 60 70 80

Age Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 7 / 32 Polynomial regression

I In the left-hand panel of the figure, the solid blue curve is given by ˆ ˆ ˆ ˆ 2 ˆ 3 ˆ 4 f(x) = β0 + β1x + β2x + β3x + β4x and the pair of dotted blue indicate an estimated 95% confidence interval given by

fˆ(x) 2 se fˆ(x) ± · { } I In the right-hand panel, the solid blue curve is given by πˆ(y > 250 x) = exp fˆ(x) /[1 + exp fˆ(x) ] = sigm fˆ(x) | { } { } { } and the pair of dotted blue curves indicate an estimated 95% confidence interval given by

sigm[fˆ(x) 2 se fˆ(x) ] ± · { }

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 8 / 32 Piecewise polynomials

A step function (a piecewise polynomial of order zero) can be used to avoid imposing a global structure

Cutpoints or knots c1, c2, . . . , cK in the range of x are chosen and are used to create K + 1 dummy variables

C0(x) = I(x < c1)

C1(x) = I(c1 x < c2) . ≤ . . . CK− (x) = I(cK− x < cK ) 1 1 ≤ CK (x) = I(x cK ) ≥ Least squares is used to fit

f(x) = β0 + β1C1(x) + β2C2(x) + ... + βK CK (x)

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 9 / 32 Piecewise polynomials

Knots: c1 = 35, c2 = 50, c3 = 65

Piecewise Constant

| | ||| || || | ||| || | | || | ||| || | ||||| || || | || | | || | | || || | | | | | ) Age | 250 Wage > Wage ( Pr 50 100 150 200 250 300

||||||||||||||||||| |||||| ||| |||||||||| | |||||||||| || |||||||| ||| || |||||| |||||||||||| ||| ||||||| |||||| | ||| || | ||| | ||| ||||||| | || |||| ||| ||||| ||| || || ||| ||| ||| | || |||| ||| | ||| |||||| | ||| || |||| ||| ||| ||| ||| ||| ||| || || | ||| |||| ||| |||| ||| ||| | ||| | ||| ||||| ||| ||| | || | || | ||| | |||||| |||||| |||| | || |||||| | ||| |||||| | |||||| ||| | |||| || ||| ||||||| || |||| | ||||||| ||||| |||||| || ||||||| || ||| |||| ||| ||| ||| |||| |||||||| || ||| ||| ||||| || ||| | ||||| | || |||| ||| |||| | || ||||| | | ||| ||| ||| ||| | ||| | || | ||| || || ||| | || ||| || | || | | ||| ||| | | | |||| |||| | || | | || | ||| || || | || || || || | | ||| | || || | | ||| ||| | ||| | | | ||| | | || | | | | | | || | | | | | || | 0.00 0.05 0.10 0.15 0.20

20 30 40 50 60 70 80 20 30 40 50 60 70 80

Age Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 10 / 32 Piecewise polynomials

I Step functions are easy to work with I Interactions can easily be created and are easy to interpret I The choice of the knots can be problematic I Piecewise constant functions can miss the action

A piecewise polynomial of degree-d

β + β x + β x2 + ... + β xd if x < c  00 10 20 d0 1  2 d β + β x + β x + ... + βd x if c x < c f(x) = 01 11 21 1 1 ≤ 2 . .  . .  2 d β K + β K x + β K x + ... + βdK x if x cK 0 1 2 ≥ Usually, separate low-degree polynomials are fitted over different regions of x

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 11 / 32 Piecewise polynomials

Piecewise Cubic Continuous Piecewise Cubic Wage Wage 50 100 150 200 250 50 100 150 200 250

20 30 40 50 60 70 20 30 40 50 60 70

Age Age

Cubic Spline Linear Spline Wage Wage 50 100 150 200 250 50 100 150 200 250

20 30 40 50 60 70 20 30 40 50 60 70

Age Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 12 / 32 Regression splines

A degree-d spline is a piecewise degree-d polynomial with continuity in up to degree d 1 at each knot (d constraints per knot) − A great reduction in complexity compared to a piecewise polynomial

The number of parameters of a degree-d piecewise polynomial is

(K + 1)(d + 1) = Kd + K + d + 1 (# intervals # coefficients) × The number of constraints is Kd (# knots # constraints per knot) × The number of parameters of a degree-d spline is therefore K + d + 1

Consequently, a smaller var fˆ(x) for a degree-d spline { }

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 13 / 32 Regression splines

A degree-d spline can be represented by the linear model

f(x) = β0 + β1b1(x)+ ... + βdbd(x)+ βd+1bd+1(x)+ ... + βd+K bd+K (x)

where

b1(x)= x 2 b2(x)= x . . . . d bd(x)= x (x c )d if x > c  b (x)= k k for k = 1,...,K d+k 0− otherwise are basis functions

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 14 / 32 Regression splines

A linear spline

f(x) = β0 + β1b1(x)+ β2b2(x)+ ... + β1+K b1+K (x)

where

b1(x)= x x c if x > c  b (x)= k k for k = 1,...,K 1+k 0− otherwise

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 15 / 32 Regression splines

A quadratic spline

f(x) = β0 + β1b1(x)+ β2b2(x)+ β3b3(x)+ ... + β2+K b2+K (x)

where

b1(x)= x 2 b2(x)= x (x c )2 if x > c  b (x)= k k for k = 1,...,K 2+k 0− otherwise

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 16 / 32 Regression splines

A cubic spline

f(x) = β0 + β1b1(x)+ β2b2(x)+ β3b3(x)+ β4b4(x)+ ... + β3+K b3+K (x)

where

b1(x)= x 2 b2(x)= x 3 b3(x)= x (x c )3 if x > c  b (x)= k k for k = 1,...,K 3+k 0− otherwise

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 17 / 32 Regression splines

Unfortunately, a spline can have high variance at the outer range of x A natural spline is forced to be linear in the extreme regions of x

Natural Cubic Spline Cubic Spline Wage 50 100 150 200 250

20 30 40 50 60 70

Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 18 / 32 Regression splines

Where should the K knots be placed?

I More knots in regions where f(x) is expected to change more rapidly and fewer knots in regions where f(x) is expected to be more stable I At uniform quantiles of the data

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 19 / 32 Regression splines

Knot placement at the 25th, 50th, and 75th percentiles of x

Natural Cubic Spline

| | || || ||| || || || | || |||| || | |||| | || | || || | || | || || || || | || || | ) Age | 250 Wage > Wage ( Pr 50 100 150 200 250 300

||| ||||||||||||| |||||||| ||||| ||| |||||||| |||| |||| |||||| ||| ||||| |||| ||||||| |||||| ||| | ||||| ||| | ||| | ||||| | ||| | | |||||| ||| | ||||||| ||| || | ||||||||| | |||||| ||||||||| |||| | ||| ||||| ||| ||| ||| | ||| || | ||| |||| | || | ||| | |||||| | | | ||| ||| |||| ||| | ||| ||| ||| ||||| || | ||| ||| || | |||||| ||||| ||| || ||| ||||||| ||| || ||| ||||||| |||||| ||| ||| ||||| ||| || | ||| | || ||| ||| ||| ||| ||||||| | ||| | ||||||| |||||| ||||||| ||| ||| ||| |||| || | |||| ||| ||||||| ||| ||| ||| ||| ||| |||| ||| |||| || || ||| ||| || || |||| ||| |||| | ||| |||| ||| ||| | ||| ||| | ||| |||| ||| | || ||| | | | ||| | ||| | | | | |||| |||| | || ||| ||||| || ||| | | ||| | | || | || || | || | ||| | | || | ||| | | ||| || || | | | | || | | | | | | | | | | | || 0.00 0.05 0.10 0.15 0.20

20 30 40 50 60 70 80 20 30 40 50 60 70 80

Age Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 20 / 32 Regression splines

How many knots should be used?

I Try out which number produces the best looking curve I Cross-validation for different numbers of knots the value of K that gives the smallest overall cross-validated RSS→ is chosen

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 21 / 32 Smoothing splines

Let (x1, y1),..., (xn, yn) be the observed sample data

A smoothing spline is a function f(x) that minimizes

n Z X 2 00 2 yi f(xi) + λ f (x) dx { − } i=1 for a fixed nonnegative tuning or smoothing parameter λ The first term is the RSS and tries to make f(x) match the data The second term is a roughness penalty and controls how wiggly f(x) is

I The smaller λ, the more wiggly f(x), eventually interpolating yi when λ = 0 I As λ , the function f(x) becomes linear → ∞

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 22 / 32 Smoothing splines

Remarkly, a smoothing spline is a natural cubic spline with knots at the unique values of x1, . . . , xn

I Smoothing splines avoid the knot-selection issue, leaving a single λ to be chosen I Instead of λ, the effective degrees of freedom (the number of parameters minus the number of constraints) can be specified (as λ increases from 0 to , dfλ decreases from n to 2) ∞ I The value of either λ or dfλ that minimizes the cross-validated RSS can be chosen

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 23 / 32 Smoothing splines

Smoothing Spline

16 Degrees of Freedom 6.8 Degrees of Freedom (LOOCV) Wage 0 50 100 200 300

20 30 40 50 60 70 80

Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 24 / 32 Local regression

Algorithm for local regression at x0 1. Gather the fraction s = k/n of cases whose x-values are closest to x0

2. Assign a weight wi to each of the k cases 3. Minimize the weighted sum of squares k X 2 wi yi f(xi) { − } i=1 with respect to the parameters of f(x) ˆ 4. The fitted value at x0 is given by f(x0)

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 25 / 32 Local regression

Local Regression

OO OO O O O O O O O O O O OO OO O O OO OO O O O O OO O O OO O O O O O O O O O O O O O O O OO O OO O O O O O O O O OO O O O OO O O O OO O OO O OO O OO O O O O O OO O O O O OO OO O OO O O O O OO O OO O O O O OOOO O O O O OOOO O O O O O O O O O O O O O O OO O O OO O O O O O O O O O O O O O O O O O O O OO O O O OO O O O O O O OOO O OOO O O O O O

O O −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 26 / 32 Local regression

Choices to be made

I The weights w1, . . . , wk I The form of f(x) I The span s The span s plays a role like λ in a smoothing spline The smaller s, the more wiggly Cross-validation can be used to choose s, or s can be specified directly

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 27 / 32 Local regression

Local

Span is 0.2 (16.4 Degrees of Freedom) Span is 0.7 (5.3 Degrees of Freedom) Wage 0 50 100 200 300

20 30 40 50 60 70 80

Age

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 28 / 32 Generalized additive models

A flexible way to predict response y from multiple predictors x1, . . . , xp A natural way to extend the standard multiple linear model

f(x1, . . . , xp) = β0 + β1x1 + ... + βpxp

in order to allow for non-linear relationships, is to write

f(x1, . . . , xp) = β0 + f1(x1) + ... + fp(xp)

where f1(x1), . . . , fp(xp) are (smooth) non-linear functions

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 29 / 32 Generalized additive models

Prediction of y = wage from x1 = year, x2 = age, and x3 = education

Coll ) ) ) age year ( ( 2 1 education ( f f 3 f −30 −20 −10 0 10 20 30 −50 −40 −30 −20 −10 0 10 20 −30 −20 −10 0 10 20 30 40

2003 2005 2007 2009 20 30 40 50 60 70 80 education year age

f(x1, x2, x3) = β0 + f1(x1) + f2(x2) + f3(x3), where f1(x1) and f2(x2) are natural splines, and f3(x3) is a piecewise

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 30 / 32 Generalized additive models

Prediction of y = wage from x1 = year, x2 = age, and x3 = education

Coll ) ) ) age year ( ( 2 1 education ( f f 3 f −30 −20 −10 0 10 20 30 −50 −40 −30 −20 −10 0 10 20 −30 −20 −10 0 10 20 30 40

2003 2005 2007 2009 20 30 40 50 60 70 80 education year age

f(x1, x2, x3) = β0 + f1(x1) + f2(x2) + f3(x3), where f1(x1) and f2(x2) are smoothing splines, and f3(x3) is a piecewise linear function

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 31 / 32 Conclusion

I Many approaches to account for non-linearities I Cross-validation can help in making different choices I All approaches discussed can be applied using R I Book: An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani

David J. Hessen (UU) Moving Beyond Linearity March 7, 2019 32 / 32