<<

Moving Beyond Linearity

Chapter 7 – Part I

1 Moving Beyond Linearity

Regression • Basis Functions • • Piecewise Constant and Step Functions

2 Moving Beyond Linearity

• Truth is never exactly linear. • Often can be good approximation. • But when it is not, need methods for non-linearity. • Want flexibility, but keep some interpretability if possible.

• First consider single predictor, then generalize to multiple predictors.

3 Polynomial Regression

• Replace simple

with polynomial regression

• Higher degree polynomials give more flexible fit.

4 Polynomial Regression

5 Polynomial Regression

• Where do the confidence intervals come from? ˆ • For any given , we need: Var() f() x0

• Then we use

• Have variances and covariances of coefficients. • Recall:

• Use repeatedly on • Or use matrix form: Var ()() cTTββˆˆ= c Var c

6 Polynomial Regression

• Seems to be distinct split in wages. • Most under 250, few above 250. • Instead of predicting wage itself, estimate probability of wage > 250.

7 Polynomial Regression

• Fit logistic regression.

Pr( Wagei > 250 Age i) = Pr()zx ii = 1

• Example uses 4th degree.

8 Basis Functions

• Polynomial regression is special case of basis approach. • Idea: • Construct set of fixed basis functions to apply to predictor variable, X.

• Fit regression:

9 Local Basis Functions

• Polynomial basis functions have problems.

• Polynomial basis leads to global fitting. • Local fitting can be better.

10 Piecewise Constant Basis: Step Functions

• Simplest local basis creates a step function. • Function is constant on each disjoint interval.

b01() xii= Ix[] < c,

b1() xii= Ic[] 12 ≤< x c , 

bK−−11() x i= Ic[ K ≤< x iK c],

bKi() x= Ix[] i ≥ c K 11 Piecewise Constant Basis: Step Functions

• Fit regression using constant basis functions.

• Omit indicator for first interval. Interpretation: • All basis functions in model are zero for values in first interval.

• What is the predicted value? • Can also do logistic regression if response is binary.

12 Piecewise Constant Basis: Step Functions

13 Piecewise Constant Basis: Step Functions

• Benefits:

• Drawbacks:

14 Polynomials and Piecewise Constants

• Polynomials have advantage of being smooth, not jumps. • Piecewise constant basis functions give local instead of global fit. • How about combining the two?

More next !

15 Moving Beyond Linearity: Regression Splines

Chapter 7 – Part II

16 Moving Beyond Linearity: Regression Splines

• Regression Splines • Piecewise Polynomial Basis • Splines • Choosing Number and Location of Knots

17 Piecewise Polynomials

• Generalize piecewise constant approach. • Instead of constant between knots, fit polynomials.

• Idea: Instead of an overall polynomial, can fit separate between knots.

18 Piecewise Polynomials

• Example: Piecewise cubic with a single cutpoint at c = 50.

• Looks strange because of jump at Age = 50.

19 Piecewise Polynomials

• Fit piecewise cubic constrained to be continuous at knot. • Much better. • Still odd with non-differentiable point at Age = 50. • Creates a sharp corner. • For “maximum smoothness” at knots for cubic, want:

20 Splines

21 Spline Basis and Regression Splines

• How do we implement the constraints needed for the splines? • In general, with K knots, splines have d + K + 1 degrees of freedom. • Unconstrained would have (K + 1) (d + 1) = K d + d + K + 1. • But, have d constraints for each of K knots. • So left with df = d + K + 1. • Cubic: d = 3, so df = K + 4 --- Linear: d = 1, so df = K + 2 • We can write set of d + K + 1 basis functions, known as a spline basis. • Then fit, for example using cubic splines:

• Using spline basis for regression, called regression splines.

22 Linear Spline Basis

• How can we write a spline basis? • Note: first term is intercept, only need d + K basis functions.

• Consider linear splines. • K knots gives K + 1 intervals. • Linear spline fully determined by: 1. Intercept. 2. Value of on each of the K + 1 intervals. • Equivalently, can use differences in between successive intervals.

23 Linear Spline Basis

• So, fit: yi=ββ0 + 11 bx()i + β 2 bx 2 () i + β 33 bx ()... i ++ β KK++1 b 1 () x i + ε i

with bx1 ()ii= x = − bx21()()ii x c+ = − bx32()()ii x c+  = − bK+1 ()() x i xc iK+ • Known as truncated power basis of order 1.

24 Cubic Spline Basis

yi=+++ββ0 11 bx() iiiiKKii β 22 bx () β 33 bx () + β 44 bx ()... + β++3 b 3 () x + ε

bx1 ()ii= x 2 bx2 ()ii= x 3 bx3 ()ii= x = − 3 bx41()()ii x c+  = − 3 bK+3 ()() x i xc iK+

25 Cubic Spline Basis

• Why are cubic splines the most popular degree? • Low order polynomial reduces overfitting problems. • Visual smoothness.

• Problem: First and last interval often highly variable if cubic allowed. • Natural Cubic Splines –

26 Natural Cubic Splines

27 Natural Cubic Splines

28 Choosing Number and Location of Knots

• Once again, have to decide on tuning parameter.

• Ideally, place more knots in regions that function varies more, and less in regions of lower variation. • Not really a feasible option.

29 Choosing Number and Location of Knots

• Placing knots at quantiles of data then determines locations once number is specified. • Can tune on number of knots, K. • Common to use cross-validation. • In software, can specify df desired for regression spline. • Software will place knots at quantiles.

30 Example: Natural Cubic Spline vs. Polynomial

31 Regression Splines

• Regression splines use spline basis to fit regression. • Can use in logistic regression problems. • More stable than global polynomial fit. • Choose knots or, equivalently degrees of freedom. • Next time will consider an alternative to regression splines, known as smoothing splines.

More next time!

32 Moving Beyond Linearity: Smoothing Splines

Chapter 7 – Part III

33 Moving Beyond Linearity: Smoothing Splines

• Smoothing Splines • Mathematical Formulation • Choosing the Smoothing Parameter

34 Smoothing Splines

• Goal is to find function g to make errors small, i.e. we want small • With no restrictions, we can simply make g match data and get perfect fit. • Idea: Make g smooth. • Polynomials and cubic splines accomplished this task. • Instead, formulate problem mathematically. • How to measure smoothness?

35 Smoothing Splines

• Consider squared 2nd at a specific point, t, given by • Want good fit, i.e. small RSS. • But also smoothness, i.e. small for all t, or at least on average.

• Penalization, or regularization, approach. • Find g to minimize

36 Smoothing Splines

• Find g to minimize • Again it is RSS plus penalty term, with choice of penalty parameter. • Here, penalty term is called roughness penalty. • With no penalty, get complete to data. • As penalty parameter grows, 2nd derivative forced towards zero.

• Solution is called smoothing spline.

37 Smoothing Splines: Solution

• For every value of penalty parameter, nice mathematical result shows:

• No penalty, df = n ---- infinite penalty, gives linear fit, so df = 2.

• What is true effective df for a particular choice of penalty?

38 Smoothing Splines: Degrees of Freedom

• Consider the fitted values at data points arising from smoothing spline. • This is a vector of length n.

• Denote by gˆ λλ= ( gxgxˆˆ()()12, λ ,..., gx ˆ λ()n ) • Note this depends on the choice of tuning parameter. • Can be shown that where is n x n smoother matrix that is a known function of natural cubic spline basis and tuning parameter. • Fitted values are smoother matrix response. • For linear regression, same thing, called hat matrix. • Degrees of freedom is trace of smoother matrix.

39 Choosing the Tuning Parameter

• For smoothing splines, we do not choose knots. • Instead, choose penalty parameter. • Equivalently, can specify effective degrees of freedom. • As with previous case in ridge or lasso using fraction of full estimate. • Cross-validation can be used for tuning. • Common approach for smoothing splines is Leave-One-Out-CV.

40 Choosing the Tuning Parameter

• Why Leave-One-Out CV (LOOCV)? • Recall, for linear regression, we had for LOOCV, that • Also applies to regression splines using basis functions. • For smoothing splines 1 CV ()λ = ()n n

41 Choosing the Tuning Parameter

42 Smoothing Splines

• Smoothing Splines are alternatives to regression splines. • Avoids choice of number and location of knots. • Instead chooses degree of smoothness. • Theoretically much easier to study its properties. • Problem: If n is extremely large, have a large n x n smoother matrix. • Computation may be difficult. Next time, will discuss local methods quite different than splines.

More next time!

43 Moving Beyond Linearity: Local Regression

Chapter 7 – Part IV

44 Moving Beyond Linearity: Local Regression

• Local Regression • K-Nearest Neighbors Regression • Local Linear Regression

45 Local Regression: KNN Regression

• Recall: K-Nearest Neighbors (KNN) Classification • For each point, consider its K nearest neighbors.

• Can use same idea for KNN Regression. • For each point, we predict by:

46 Local Regression: KNN Regression

47 Local Regression: KNN Regression

48 Local Regression: KNN Regression

49 Local Regression: KNN Regression

50 Local Regression: KNN Regression

51 Curse of Dimensionality for KNN

52 Local Linear Regression

• Instead of averaging neighbors, use linear regression in local neighborhoods. • Combine K-nearest neighbors and regression. • Idea: Can approximate any function by linear in small range. • At each point, we predict by using linear regression weighting only nearby points. • Choice of weight function, also called Kernel function. • Local regression sometimes known as Locally Weighted Regression, or Kernel Weighted Regression, or Kernel Smoothing.

53 Local Linear Regression

54 Local Linear Regression

• Algorithm as in LOESS (LOcal regrESSion, Cleveland, 1979)

55 Local Linear Regression

• For fit at a given point x 0 , need to choose

1. 2.

• For LOWESS, kernel is chosen as: 3 3 − xxi 0 Kxx()i ,10 = − 3 max xx− j∈ Neigborhood j 0 • Span often chosen via cross-validation. • Other types of local regression also possible. • Entire textbooks devoted to subject.

56 Local Linear Regression

x0

57 Local Regression

• KNN Regression averages observations within neighborhood for prediction. • Can also use kernel weights. • Equivalent to using zero degree polynomial in LOWESS. • Advantages of local regression:

• Drawback: Curse of dimensionality. • Next we consider the most common approach to extend the univariate smoothers to multiple .

More next time!

58 Generalized Additive Models

Chapter 7 – Part V

59 Generalized Additive Models

• Generalized Additive Models (GAMs) • GAMs for regression. • GAMs for classification. • Advantages and drawbacks of GAMs.

60 Generalized Additive Models

• Local methods, such as KNN regression or classification naturally apply to any . • Although problematic in high dimensions. • Regression splines and smoothing splines do not.

• Possible to do multivariate basis functions.

• Common Approach: Generalized Additive Models

61 GAMs for Regression

• Generalized Additive Models extend linear regression via:

62 Example: GAM for Wage Data

• For wage data:

• 3 predictors: year, age --- continuous education --- categorical

63 Example: GAM for Wage Data

64 Example: GAM for Wage Data

• Can use smoothing splines or local regression on each function for continuous predictor, instead of basis functions. • But fitting is more complex. • Backfitting:

65 GAMs for Classification

• Can use GAMs within logistic regression.

• Model log odds as GAM.

• Regression splines most common. • Just logistic regression with basis functions as predictors.

66 GAMs for Classification • For wage data to predict high earners:

67 GAMs for Classification

• Looking at data, find no individuals with wages above $250,000 in first education category, i.e. education level below HS graduate.

68 Generalized Additive Models

• GAMs allow for multiple regression nonlinear fit. • Advantages: 1. Allows for nonlinear fit which can give more accurate predictions. 2. Additive model allows to examine effect of each predictor separately. 3. Can specify smoothness on each function. 4. Avoids curse of dimensionality. • Problems: 1. Potentially many tuning parameters. 2. Assumption of additivity does not include interactions.

69 Tuning Parameters in GAMs • For the Wage data, used 4 basis functions on one predictor and 5 on another. • How to choose this? • Ideally, we fit all possible combinations of integer pairs. • Not possible in dimension more than 2 or so. • Possible solutions: 1. 2.

• Both solutions reduce from p-dimensional tuning problem, to sequence of solutions to consider.

70 Interactions in Nonlinear Regression

• To include interaction terms 1. Direct approach: Can include products of basis functions into regression. • May be lots of terms. 2. Can fit local regression on pairs, or triplets of predictors. Use these as the blocks in additive model. Use backfitting block by block. • Includes interactions within blocks, but additive across variable blocks. 3. Bivariate smoothing splines (thin-plate splines). More complex. 4. More flexible approaches, such as tree-based approaches. • Will begin discussion next time.

More next time!

71