<<

Chapter 5

Nonparametric Regression

1 Contents

5 Nonparametric Regression 1 1 Introduction ...... 3 1.1 Example: Prestige ...... 3 1.2 Smoothers ...... 4 2 Local : Lowess ...... 4 2.1 Window-span ...... 6 2.2 Inference ...... 7 3 Kernel smoothing ...... 9 4 Splines ...... 10 4.1 Number and position of knots ...... 11 4.2 Smoothing splines ...... 12 5 Penalized splines ...... 15

2 1 Introduction

A is desirable because it is simple to fit, it is easy to understand, and there are many techniques for testing the assumptions. However, in many cases, data are not linearly related, therefore, we should not use .

The traditional model fits the model y = f(Xβ) + 

0 where β = (β1, . . . βp) is a vector of parameters to be estimated and X is the matrix of predictor variables. The function f(.), relating the average value of the response y on the predictors, is specified in advance, as it is in a linear regression model. But, in some situa- tions, the structure of the data is so complicated, that it is very difficult to find a functions that estimates the relationship correctly. A solution is: Nonparametric regression. The general nonparametric regression model is written in similar manner, but f is left unspecified:

y = f(X) +  = f(x1,... xp) +  Most nonparametric regression methods assume that f(.) is a smooth, continuous function, 2 and that i ∼ NID(0, σ ). An important case of the general nonparametric model is the nonparametric simple regres- sion, where we only have one predictor y = f(x) +  Nonparametric simple regression is often called scatterplot smoothing, because an important application is to tracing a smooth curve through a scatterplot of y against x and display the underlying structure of the data.

1.1 Example: Prestige data The data set contains data on prestige and some characteristics of 102 occupations in Canada in 1970. The variables in the data set are • prestige: Average prestige, rating from 0 to 100 • income: Average occupational income, in dollars • education: Average years of schooling • type: A with three levels: – bc (blue collar) – wc (white collar) – prof (professional and managerial) Fitting a linear model between income and prestige, would give the result shown in Figure 1. A linear model it is clearly not appropriate for the data, it also would be difficult to find a nonlinear model that fits the data correctly.

3 100

● ● ●

80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

60 ● ● ● ● ● ● ●●● ● Prestige ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● 40 ● ● ● ●● ●●● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●●

20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 1: Linear model for prestige data

1.2 Smoothers A smoother is a tool for summarizing the trend of a response variable y as a function of one or more linear predictors x. Since it is an estimate of the trend, it is less variable than y, that it is why it is called smoother (in this respect, even linear regression is a smoother).

There are several types of nonparametric regression, but all of them have in common that they rely on the data to specify the form of the model: the curve at any point depends only on the observations at that point and some specified neighboring points. Some of the nonparametric regression techniques are:

1. Locally weighted regression smoother, lowess.

2. Kernels

3. Splines

4. Penalized splines

2 Local polynomial regression: Lowess

The idea of local linear regression was proposed by Cleveland (1979). We try to fit the model

yi = f(xi) + i

The steps to fit a lowess smoother are:

1. Define the window width (m): That encloses the closest neighbors to each data ob- servation. For this example we use m = 50, i.e., for each data point we select the 50 nearest neighbors (a window including the 50 nearest x-neighbors of x(80) is shown in

4 (a) (b)

● ●●●● ● ● ● ● ● ● ● 80 ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ●● ● ● ● ● ● ●●●● ● 60 ● ● ● ●● ● ●● ●●● ●●●● ● ●● ●● ● ● ●● ● ●● ●●●●● ● ● ● ●● 0.4

Prestige ● ● 40 ●● ●●●●● ● ●●●● ● ●● ● ●●● ● ● ● ●●● Tricube Weight ●● ● ● ● ●●● ● ● ● ● 20 ● ●● ● 0.0

0 5000 15000 25000 0 5000 15000 25000

Average Income Average Income

(c) (d)

● ● ● ● ● ● 80 ● 80 ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●●●● ● 60 ● ● 60 ● ● ● ●●● ● ●●● ●●●● ●● ●●●● ●●● ● ●●● ● ●● ● ●●●●● ●● ●●●●● ● ●● ● ●● ●●

Prestige ● Prestige ● ● 40 ●●●●● 40 ●● ●●●●● ●●●● ●●●● ● ●●●● ●● ●● ● ●●● ● 20 20 ●

0 5000 15000 25000 0 5000 15000 25000

Average Income Average Income

Figure 2: Lowess smoother

Figure 2(a)).

2. Weighting the data: We use a kernel weight function to give the greatest weight to observations that are closest to the observation of interest x0. In practice, the tricube weight function is usually used:

 (1 − |z|3)3 for |z| < 1 W (z) = 0 for |z| ≥ 1

where zi = (xi −x0)/h, and h is the half-width of the window. Notice, that observations more than h away from x0 receive a weight of 0. It is typical to adjust h so that each includes a fixed proportion of the data, s, which is called the span of the smoother. Figure 2 (b) shows the tricube weights for observations in this neighborhood.

3. Locally weighted : Then, we apply a polynomial regression using to x0, only using the nearest neighbor observations to minimize the weighted residual sum of squares. Typically a local linear regression or local quadratic regression is used, but higher order polynomials are also possible.

2 p yi + bi(xi − x0) + b2(xi − x0) + ... + bp(xi − x0) + ei

From this regression, we calculated the fitted value corresponding to x0 and plot it on the scatterplot. Figure 2 (c) shows the locally weighted regression line fit to the data in the neighborhood of x0, the fitted valuey ˆ|x(80) is represented in this graph as a larger solid dot.

5 4. Nonparametric curve:Steps 1-3 are repeated for each observation in the data. There- fore, there is a separate local regression for each value of x, and the fitted value from these regressions is plotted on the scatterplot. The fitted values are connected, pro- ducing the nonparametric curve (see Figure 2 (d)).

In R we can do this easily:

library(car) data(Prestige) attach(Prestige) plot(income, prestige, xlab="Average Income", ylab="Prestige", main="(d)") lines(lowess(income, prestige, f=0.5, iter=0), lwd=2)

In nonparametric regression we have no parameter estimates, our interest is on the fitted curve, i.e., we focus on how well the estimated curve represents the population curve.

The assumptions under the lowess model are much less restrictive than the assumptions for the linear model, no strong global assumptions are made about µ, however, we assume, that locally around a point x0, µ can be approximated by a polynomial function. Still, the errors i are assumed independent and randomly distributed with 0. Finally, a number of choices: the span, the type of polynomial and the type of weight function, affect the trade of between the bias and the of the fitted curve.

2.1 Window-span Recall that the span s is the percentage of cases across the of x. The size of s has an important effect of the curve. A span that is too small (meaning insufficient data fall within the window) produces a curve characterized by a lot of noise. In other words, this results in a large variance. If the span is too large, the regression will be oversmooth and thus the local polynomial may not fit the data well, this might result in loss of important information, and the fit will have large bias.

We may choose the span in different ways: 1. Constant bandwidth: h is constant, i.e., a constant range of x is used to find the observations for the local regression. This works satisfactorily if the distribution of x is uniform and/or with large sample sizes. However, if x has a non-uniform distribution, since it fails to capture the true trend because some local neighborhoods may have none or too few cases. This is particularly problematic at the boundary regions. 2. Nearest neighbor bandwidth: This method overcomes the sparse data problem. The span s is chosen so that each local neighborhood always contains a specified pro- portion of the observations. Typically this is done by trial and error, changing the span until we have removed most of the roughness in the curve. The span s = 0.5 is always a good starting point. In the function lowess, the default span is s = 0.75.

6 ● ● ●

80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● s=0.1 ● ● 40 ● ● ● ●● ●●● ● s=0.37 ●●● ● ● ● ● ● s=0.63 ● ● ●● s=0.9 ● ● ● ● ● ● ●●● ●● ● ● ●● 20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 3: Effect of the span on the fitted curve

Figure 3 shows the effects of 4 different values of the span on the fitted curve for the prestige data.

2.2 Inference Degrees of freedom The concept of degrees of freedom for non-parametric regression models is not as intuitive as for linear models, since there are no parameters estimated. However, the degrees of freedom for a nonparametric model are a generalization of the number of parameters in a . In a linear model the number of d.f. was equal to the number of estimated parameters and this coincides with:

• Rank(H) the hat-matrix

• T race(H) = trace(HH0) = trace(2HHH0)

Analogous degrees of freedom for nonparametric models are obtained by substituting the H matrix by the smoother matrix S which plays the same roll, i.e., transform y intoy ˆ (although it is not idempotent). The approximated degrees of freedom are defined in several ways:

• T race(S)

• T race(SS0)

• T race(2S − SS0)

7 see Hastie and Tibshirani (1990) for more details. The residual degrees of freedom are defined as dfRES = n−df and the estimated error variance 2 P 2 S = i ei /dfRES. Unlike the linear case, the d.f. are not necessarily whole numbers.

Confidence band for the curve Assuming normally distributed errors, an approximate 95% confidence band for the popula- tion regression µ|xi is q ˆ yˆi ± 2 V (ˆyi) since yˆ = Sy: 2 0 ˆ 2 0 V (yˆ) = σ SS ⇒ V (ˆyi) =σ ˆ (SS )ii lo1=loess(prestige~income, span=0.7,degree=1) plot(income, prestige, xlab="Average Income", ylab="Prestige",ylim=c(10,100)) lines(income[ord],lo1$fitted[ord]) lo2=predict(lo1,se=T) lines(income[ord],lo1$fitted[ord]-2*lo2$se.fit[ord],lty=2,col=4) lines(income[ord],lo1$fitted[ord]+2*lo2$se.fit[ord],lty=2,col=4) 100

● ● ●

80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

60 ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● Prestige ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● 40 ● ● ● ●● ●●● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●●

20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 4: Fitted curve and confidence band for prestige data

Hypothesis test F-test comparing residual sum of squares for nested models can be carried out in exactly the same way as for linear models (these test are only approximated, because of the approxima- tion of the d.f.).

8 Perhaps, the most useful aspect of nonparametric regression is that it allows us to test for nonlinearity by contrasting it with linear regression model, these models are nested, since a linear model is a special case of a general relationship. The F-test for testing:

H0 : Model M0

H1 : Model M1 takes the form: RSS0 − RSS1/(T race(S) − 2 F0 = RSS1/(n − T race(S))

where RSS0 and RSS1 are the residual sum of squares for model M0 and M1, and T race(S) are the approximated d.f. for the nonparametric model.

We construct the following function in R: anova.loess<-function(model0,model1){ RSS0=sum(model0$resid^2) RSS1=sum(model1$resid^2) df0=model0$trace df1=model1$trace n=model0$n

Ftest=((RSS0-RSS1)/(df1-df0))/(RSS1/(n-df1)) p=1-(pf(Ftest,df1,n-df1)) output=cbind(Ftest,p) output }

lo0=loess(prestige~income, span=100,degree=1) anova.loess(lo0,lo1) Ftest p [1,] 8.478734 2.604558e-06 Therefore, there is a statistically significant departure from linearity.

3 Kernel smoothing

A kernel smoother uses weights that decrease in a smooth fashion as the line moves away from the point of interest x0. The weight for the j − th point for the estimate x0 is defined by:   c0 x0 − xj S0j = d λ λ

where d(t) is a decreasing function in |t|, λ is the window-width (or bandwidth), and c0 is a constant such that the weight sum to 1. Typically d(.) is the standard normal density, for

9 the standard Gaussian kernel smoother. These smoothers generally perform worse at the endpoints that the lowess (see Figure 5). plot(income, prestige, xlab="Average Income", ylab="Prestige") lines(ksmooth(income[ord], prestige[ord],x.points=unique(income[ord]), "normal", bandwidth=6000))

● ● ●

80 ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● ● ● 40 ● ●● ●● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ●●● 20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 5: Fitted curve using Gaussian kernel smoother

4 Splines

Splines are piecewise polynomial functions that are constrained to join at points called knots, they divide the range of x into regions. Splines require consideration of three elements: 1. Degree of the polynomial

2. Number of knots

3. Location of knots Although many different configurations are possible, a popular choice consists of piecewise cubic polynomials constrained to be continuous and have continuous first and second deriva- tives at the knots (these force the polynomials to joint smoothly at these points, see Figure 6)

A cubic spline with two knots c1 y c2 is as follows:

2 3 3 3 y = β0 + β1x + β2x + β3x + β4(x − c1)+ + β5(x − c1)+,

10 Figure 6: A series of piecewise-cubic polynomials, with increasing orders of continuity

where (u)+ = u if u > 0 and o otherwise. If there are k knots, the function will require k + 4 regression coefficients.

Natural cubic splines extend the regular cubic spline by constraining the pattern to be linear after the boundary knots, therefore, it requires k + 2 parameters, this is because the requirement that the derivative is continuous at c1 and ck is not imposed, thus removing two parameters at each end of the data

3 3 y = β0 + β1x + β2(x − c1)+ + β3(x − c1)+,

4.1 Number and position of knots Cubic splines are fixed-knots splines, therefore, we need to select the number and placement of the knots. It is more important the number of the knots that where they are, placing the knots at evenly spaced fixed quantiles of the explanatory variable’s distribution works well.

Typically a choice of 3 ≤ k ≤ 7 works well. For large sample sizes (n ≥ 100) with a continuous response variable, k = 5 usually provides a good compromise between flexibility and loss of precision. For small sample sizes (n ≤ 30) k = 3 is a good starting point. Akaike Information Criteria (AIC) can be used to select k. library(splines) natspl=lm(prestige~ns(income,df=5)) plot(income, prestige, xlab="Average Income", ylab="Prestige") lines(income[ord],natspl$fitted[ord])

11 ● ● ●

80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● ● ● 40 ● ● ● ●● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ●● 20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 7: Natural cubic spline with 4 knots at suitable quantiles of income.

4.2 Smoothing splines Smoothing splines are the solution to the problem of minimizing penalized residual sum of squares: Z xn X n 2 00 2 RSS(f, λ) = i = 1 (yi − f(xi)) + λ f (x) dx x1 The first term measures the closeness to the data, and the second term penalized the curva- ture in the function. Hera, λ is the smoothing parameter, and it establishes the trade-off between bias and variance of the fitted curve. If λ = 0, the curve interpolates the data, and if λ → ∞, the second derivative is constrained to 0, i.e., we have a least squares fit (we are fitting a line).

Smoothing splines are natural cubic splines with knots at all unique values of x. This might seem an over-parameterized model, however, the penalty term ensures that the coef- ficients are shrunk towards linearity, limiting the number of degrees of freedom used. The smoothing spline estimator is linear in the sense that for each unique xi, there are basis functions h(xi) such that, X n fλ(x) = i = 1 h(xi)yi We can rewrite the residual sum of squares as:

RSS(θ, λ) = (y − hθ)0(y − hθ) + λθ0Ωθ

The solution is , θˆ = (h0h + λΩ)−1h0y

12 It is obvious the parallel with ridge regression. remember that in ridge regression, the larger λ, the smaller the coefficients become; the same happens with smoothing splines. The fitted smoothing splines is given by: h ˆ X ˆ f(x) = hj(x)θj j=1 The smoothing spline estimator creates a new problem: How do we determine the ap- propriate value of the smoothing parameter λ for a given data set?

Smoothing parameter selection The “best” choice of smoothing parameter in one that minimizes the average mean squared error n −1 X ˆ 2 MSE(λ) = n E(fλ(xi) − f(xi)) i=1 we must estimate MSE(λ) in order to get an estimate of λ. Another quantity that differs from MSE only by a constant function of σ2 is the average predicted squared error

n −1 X ∗ ˆ 2 PSE(λ) = n E(yi − fλ(xi)) i=1

∗ ∗ ∗ where yi is a new observation at xi, that is yI = f(xi) + i . It is easy to show that PSE = MSE + σ2

1. Cross-validation Cross-validation works by leaving points (xi, yi) out one at a time and estimating the smooth at xi based on the remaining n1 points, and we construct the cross-validation sum of squares n −1 X ˆ−i 2 CV (λ) = n (yi − fλ (xi)) i=1 ˆ−i where fλ (xi) indicates the fit at xi computing by leaving out the ith data point. We use CV for smoothing parameter selection as follows: compute CV (λ) for a number of values of λ and select the value of λ which minimizes CV (λ), this procedure is justified by the fact that E(CV (λ)) ≈ PSE(λ) see chapter 3 of Hastie and Tibshirani (1990) for details.

In the case of a linear smoother:

n ˆ !2 X yi − fλ(xi) CV (λ) = n−1 1 − S (λ) i=1 ii where S(λ) is the hat-matrix of the model, i.e. yˆ = Sy.

13 spline1=smooth.spline(income,prestige,cv=TRUE) > spline1$cv.crit [1] 127.4208 > spline1$lambda [1] 0.01474169 spline1$fit lines(spline1$x,spline1$y,col=2)

● ● ●

80 ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● ● ● 40 ● ●● ●● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ●●● 20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 8: Natural cubic spline with 4 knots at suitable quantiles of income (black) and smoothing spline with smoothing parameter selected by cross-validation (red).

2. Generalized cross-validation Until recently it was not known how to compute the diagonal elements of S for a smoothing spline in an efficient way, and this led to generalized cross-validation (GCV) which replaces Sii by its average value T race(S)/n which is easier to compute, n ˆ !2 X yi − fλ(xi) GCV (λ) = n−1 1 − T race(S)/n i=1 In most cases, CV and GCV behave in similar way, however, in some situations CV tends to undersmooth the data compared to GCV. spline2=smooth.spline(income,prestige) lines(spline2$x,spline2$y,col=4)

3. Choose the degrees of freedom Another possibility is to choose the degrees of freedom, approximated by T race(S), and the appropriate λ for those d.f. will be selected

14 ● ● ●

80 ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● ● ● 40 ● ●● ●● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ●●● 20 ● ●

0 5000 10000 15000 20000 25000

Average Income

Figure 9: Natural cubic spline with 4 knots at suitable quantiles of income (black), smoothing spline with smoothing parameter selected by cross-validation (red), smoothing spline with smoothing parameter selected by generalized cross-validation (blue).

5 Penalized splines

B-splines are local basis functions, consisting of polynomial segments of low degree, com- monly quadratic or cubic. The positions where the segments join are called the knots. Bsplines have local support and are thus suitable for smoothing and interpolating data with complex patterns. They also reduce smoothing to linear regression, with large advantages when one needs standard errors, builds semiparametric models or works with non-normal data. Unfortunately, control over smoothness is limited: one can only change the number and positions of the knots. If there are no reasons to assume that smoothness is non-uniform, the knots will be equally spaced and the only tuning parameter is their (discrete) number. B-splines also have difficulties with sparse data: some of the basis functions may have little or no support and so their coefficients may be unstable or even impossible to estimate. Eilers and Marx (1996) proposed the P-spline recipe: 1) use a (quadratic or cubic) Bspline basis with a large number of knots, say 10 to 50; 2) introduce a penalty on (second or third order) differences of the B-spline coefficients; 3) minimize the resulting penalized ; 4) tune smoothness with the weight of the penalty, using cross-validation or AIC to determine the optimal weight. Let y and x, each vectors of length n, represent the observed and explanatory variables re- spectively. Once a set of knots is chosen, the B-spline basis B follows from x. If there are m basis functions then B is n × m. In the case of normally distributed observations the model is y = Ba + , with independent errors . Non-normal data are discussed in Marx and Eilers (1999). In the case of B-spline regression the sum of squares of residuals ||y − Ba||2 is minimized and the normal equations B0Baˆ = B0y are obtained; the explicit solution aˆ = (B0B)−1B0y

15 respectively.respectively. Once Once a set a set of ofknots knots is is chosen, chosen, the the B-spline B-spline basis basisBBfollowsfollows from fromx. Ifx. there If there are are n basisn basis functions functions then thenB isBmis m××n.n In. In the the case case of of normally normally distributed distributed observations observations the modelthe model is y =is yBα= +Bαe,+ withe, with independent independent errors errorsee.. Non-normal Non-normal data data will will be be discussed discussed later. later. In the In the 2 results.case ofcase B-spline of B-spline regression regression the the sum sum of of squares squares of of residualskkyy−−BαBαk2kis minimizedis minimized and and the the 0 0 0 −1 0 The penalizednormalnormal equations spline equations (P-spline)B BBαˆ0B=αˆ =B approachBy 0arey are obtained; obtained; minimizes the the explicit the penalized solution solutionα leastˆ αˆ== (B squares (0B)B−1)B function0yBresults.y results. The P-splineThe P-spline approach approach minimizes minimizes the the penalized penalized least least squares squares function function 2 2 S = ||y − Ba|| +2 λ||Dda|| 2 S = ky − Bαk 2 + λkDdαk2 , (1) S = ky − Bαk + λkDdαk , (1) d where Dd is a matrix that forms differences of order d, i.e. Dda = ∆ ad . Example of this where Dd is a matrix that forms differences of order d, i.e. Ddα =d ∆ α. Examples of this matrix forwhered = 1D andd is ad matrix= 2 are: that Minimizingforms differences S leads of order to thed, i.e. penalizedDdα = ∆ normalα. Examples equations of this matrix,matrix, for d for=d 1=and 1 andd =d 2=are: 2 are:     −1−1 1 1 0 0 0 0   11 −−22 1 1 0 0 0  0 D =  0 −1 1 0  ; D =  0 1 −2 1 0  . 1 D =  0 −1 1 0 ; D2 =  0 1 −2 1 0  .  (2) (2) 1 0 0 −1 1  2  0 0 1 −2 1 0 0 −1 1 0 0 1 −2 1 Minimizing S leads to the penalized normal equations Minimizing S leads to the penalized normal equations 0 0 0 (B0B(B+B0 λ+DλD0 Dd0D)aˆd)ˆ=α B=0By0, y, (3) (B B +dλDddDd)ˆα = B y, (3) from which an explicit solution for αˆ follows immediately. An important point is that the from whichfrom an which explicit an explicit solution solution for aˆ forfollowsαˆ follows immediately. immediately. An An important important point point is is that that the the size of the system of equations is equal to the number of B-spline functions in the basis, size of thesize system of the of system equations of equations is equal is equal to the to the number number of of B-spline B-spline functions in in the the basis, basis, independent of the number of observations. independentindependent of the number of the number of observations. of observations. The positive parameter λ determines the influence of the penalty. If λ is zero, we are The positiveThe parameter positiveλ parameterdeterminesλ determines the influence the influence of the of penalty. the penalty. If λ Ifisλ zero,is zero, we we are are back back to B-spline regression; increasing λ makes αˆ, and hence µˆ = Bαˆ, smoother. Eilers and to B-spline regression; increasing λ makes λaˆ, andαˆ hence µˆ =µˆ B= aˆB,αˆ smoother. Eilers and Marxback (1996) to B-spline show that, regression; as λ → increasing ∞, µˆ approachesmakes , the and best hence fitting (in, the smoother. least squares Eilers and sense) Marx (1996) show that, as λ → ∞, µˆ approaches the best fitting (in the least squares sense) Marx (1996) showd that,− 1 as λ → ∞, µˆ approaches the best fitting (in the least squares sense) polynomialpolynomial of degree of degreed − 1. . Itpolynomial is easy to seeof degree that d − 1. It is easy to seeIt is that, easy to see that 0 0 −1 0 yˆ =yˆB=(BB(BB+0Bλ+DλDdD0dD) 0 )B−1By 0=y =HyHy. (4) yˆ = B(B0B + λDd0 Dd0 )−1B0y = Hy. (4) The matrix H is the hat matrix; it is an extremelyd d useful tool. It first shows that, for a The matrix H is commonly called the hat matrix; it is an extremely useful tool. It first shows given λ, the smoother is linear. And, again, the trace of H gives a measure of the effective that,The for amatrix givenHλis, commonly the smoother called is the linear. hat matrix; Further, it is the an extremely trace of usefulH gives tool. a It measure first shows of the dimension of the model (Hastie and Tibshirani (1990), p52): ED ranges from m, the size effectivethat, dimension for a given ofλ, the the model smoother (Hastie is linear. and Further, Tibshirani, the trace 1990, of p52):H gives a measure of the effective dimension of the model (Hastie and Tibshirani, 1990, p52): 0 0 −1 0 0 0 −1 0 ED = tr(H) = tr(B(B B + λD Dd) B ) = tr((B B + λD Dd) B B). (5) 0 d0 −1 0 0 0 d −1 0 ED = tr(H) = tr(B(B B + λDdDd) B ) = tr((B B + λDdDd) B B). (5) ED ranges from n, the size of the B-spline basis, when λ = 0, to d, the order of the differ- of theences, B-splineED whenranges basis,λ becomes from whenn, theλ very= size 0,large. of to thed Note, B-spline the that order basis,H ofis when not the a differences,λ projection= 0, to d, thematrix, when orderλ i.e., ofbecomes theH2 differ-6= H very, as 2 2 large.in Note standardences, that when linearH isλ not regression:becomes a projection very smoothing large. matrix, Noteyˆ thatagain i.e.,HH doesis not6= not aH projection, reproduce as in standard matrix,yˆ. i.e., linearH regression:6= H, as smoothingAiny commonˆ standardagain does linear way not regression:to optimize reproduce smoothing a smoothingyˆ yˆ again parameter does not reproduce is leave-one-outyˆ. cross-validation. ImagineA leaving common observation way to optimizei out of a the smoothing data, fitting parameter the model is leave-one-out and computing cross-validation. an interpolated source("Pspline.R") valueImagineyˆ−i and leaving the deletion observation residuali outy ofi − theyˆ− data,i. Repeating fitting the this model for and all computing observations, an interpolatedi = 1, . . . , m, library(gam) P 2 we arrivevalue atyˆ− thei and cross-validation the deletion residual sumy ofi − squaresyˆ−i. RepeatingCV = thisi for(yi all− observations,yˆ−i) (Hastiei = and 1, Tibshirani,. . . , m, ps1=pspline.fit(prestige,income,ps.intervals=10,degree=3,order=2)P 2 1990,we p43). arrive If at followed the cross-validation literally, this sum would of squares be aCV rather= expensivei(yi − yˆ−i) recipe,(Hastie certainly and Tibshirani, when we have1990, many p43). observations. If followed However literally, wethis can would show be a that rather expensive recipe, certainly when we have many observations. However we can show that yi − yˆi yi − yˆ−i = , i = 1, . . . , m, (6) 1y−i −hyˆiii yi − yˆ−i = , i = 1, . . . , m, (6) 1 − hii 3 3

16 P−spline fit with twice std error bands

O O O 80 O O O OO OO O OO O O O O O O O O O O O

60 O O O O OO O OOO O O O O O OO O OO O O O OO O estimated mean O OO O O O OOO 40 OOOO OOO OOO OOOOOOO O O O OO OO O O O OOOO OO O OOO 20 O O

0 5000 10000 15000 20000 25000

regressor

Figure 10: Penalize spline with cubic Bspline basis with 10 knots, second order penalty and λ = 10

17 Bibliography

Cleveland, W. (1979). Robust locally-weighted regression and smoothing scatterplots. Jour- nal of the American Statistical Association, 74:829–836.

Eilers, P. and Marx, B. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11:89–121.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Monographs on and Applied Probability. Chapman and Hall, London.

Marx, B. and Eilers, P. (1999). Generalized linear regression on sampled signals and curves: A p-spline approach. Technometrics, 41:1–13.

18