Chapter 5 Nonparametric Regression
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 5 Nonparametric Regression 1 Contents 5 Nonparametric Regression 1 1 Introduction . 3 1.1 Example: Prestige data . 3 1.2 Smoothers . 4 2 Local polynomial regression: Lowess . 4 2.1 Window-span . 6 2.2 Inference . 7 3 Kernel smoothing . 9 4 Splines . 10 4.1 Number and position of knots . 11 4.2 Smoothing splines . 12 5 Penalized splines . 15 2 1 Introduction A linear model is desirable because it is simple to fit, it is easy to understand, and there are many techniques for testing the assumptions. However, in many cases, data are not linearly related, therefore, we should not use linear regression. The traditional nonlinear regression model fits the model y = f(Xβ) + 0 where β = (β1, . βp) is a vector of parameters to be estimated and X is the matrix of predictor variables. The function f(.), relating the average value of the response y on the predictors, is specified in advance, as it is in a linear regression model. But, in some situa- tions, the structure of the data is so complicated, that it is very difficult to find a functions that estimates the relationship correctly. A solution is: Nonparametric regression. The general nonparametric regression model is written in similar manner, but f is left unspecified: y = f(X) + = f(x1,... xp) + Most nonparametric regression methods assume that f(.) is a smooth, continuous function, 2 and that i ∼ NID(0, σ ). An important case of the general nonparametric model is the nonparametric simple regres- sion, where we only have one predictor y = f(x) + Nonparametric simple regression is often called scatterplot smoothing, because an important application is to tracing a smooth curve through a scatterplot of y against x and display the underlying structure of the data. 1.1 Example: Prestige data The data set contains data on prestige and some characteristics of 102 occupations in Canada in 1970. The variables in the data set are • prestige: Average prestige, rating from 0 to 100 • income: Average occupational income, in dollars • education: Average years of schooling • type: A categorical variable with three levels: – bc (blue collar) – wc (white collar) – prof (professional and managerial) Fitting a linear model between income and prestige, would give the result shown in Figure 1. A linear model it is clearly not appropriate for the data, it also would be difficult to find a nonlinear model that fits the data correctly. 3 100 ● ● ● 80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 60 ● ● ● ● ● ● ●●● ● Prestige ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● 40 ● ● ● ●● ●●● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● 20 ● ● 0 5000 10000 15000 20000 25000 Average Income Figure 1: Linear model for prestige data 1.2 Smoothers A smoother is a tool for summarizing the trend of a response variable y as a function of one or more linear predictors x. Since it is an estimate of the trend, it is less variable than y, that it is why it is called smoother (in this respect, even linear regression is a smoother). There are several types of nonparametric regression, but all of them have in common that they rely on the data to specify the form of the model: the curve at any point depends only on the observations at that point and some specified neighboring points. Some of the nonparametric regression techniques are: 1. Locally weighted regression smoother, lowess. 2. Kernels 3. Splines 4. Penalized splines 2 Local polynomial regression: Lowess The idea of local linear regression was proposed by Cleveland (1979). We try to fit the model yi = f(xi) + i The steps to fit a lowess smoother are: 1. Define the window width (m): That encloses the closest neighbors to each data ob- servation. For this example we use m = 50, i.e., for each data point we select the 50 nearest neighbors (a window including the 50 nearest x-neighbors of x(80) is shown in 4 (a) (b) ● ●●●● ● ● ● ● ● ● ● 80 ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ●● ● ● ● ● ● ●●●● ● 60 ● ● ● ●● ● ●● ●●● ●●●● ● ●● ●● ● ● ●● ● ●● ●●●●● ● ● ● ●● 0.4 Prestige ● ● 40 ●● ●●●●● ● ●●●● ● ●● ● ●●● ● ● ● ●●● Tricube Weight ●● ● ● ● ●●● ● ● ● ● 20 ● ●● ● 0.0 0 5000 15000 25000 0 5000 15000 25000 Average Income Average Income (c) (d) ● ● ● ● ● ● 80 ● 80 ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●●●● ● 60 ● ● 60 ● ● ● ●●● ● ●●● ●●●● ●● ●●●● ●●● ● ●●● ● ●● ● ●●●●● ●● ●●●●● ● ●● ● ●● ●● Prestige ● Prestige ● ● 40 ●●●●● 40 ●● ●●●●● ●●●● ●●●● ● ●●●● ●● ●● ● ●●● ● 20 20 ● 0 5000 15000 25000 0 5000 15000 25000 Average Income Average Income Figure 2: Lowess smoother Figure 2(a)). 2. Weighting the data: We use a kernel weight function to give the greatest weight to observations that are closest to the observation of interest x0. In practice, the tricube weight function is usually used: (1 − |z|3)3 for |z| < 1 W (z) = 0 for |z| ≥ 1 where zi = (xi −x0)/h, and h is the half-width of the window. Notice, that observations more than h away from x0 receive a weight of 0. It is typical to adjust h so that each local regression includes a fixed proportion of the data, s, which is called the span of the smoother. Figure 2 (b) shows the tricube weights for observations in this neighborhood. 3. Locally weighted least squares: Then, we apply a polynomial regression using weighted least squares to x0, only using the nearest neighbor observations to minimize the weighted residual sum of squares. Typically a local linear regression or local quadratic regression is used, but higher order polynomials are also possible. 2 p yi + bi(xi − x0) + b2(xi − x0) + ... + bp(xi − x0) + ei From this regression, we calculated the fitted value corresponding to x0 and plot it on the scatterplot. Figure 2 (c) shows the locally weighted regression line fit to the data in the neighborhood of x0, the fitted valuey ˆ|x(80) is represented in this graph as a larger solid dot. 5 4. Nonparametric curve:Steps 1-3 are repeated for each observation in the data. There- fore, there is a separate local regression for each value of x, and the fitted value from these regressions is plotted on the scatterplot. The fitted values are connected, pro- ducing the nonparametric curve (see Figure 2 (d)). In R we can do this easily: library(car) data(Prestige) attach(Prestige) plot(income, prestige, xlab="Average Income", ylab="Prestige", main="(d)") lines(lowess(income, prestige, f=0.5, iter=0), lwd=2) In nonparametric regression we have no parameter estimates, our interest is on the fitted curve, i.e., we focus on how well the estimated curve represents the population curve. The assumptions under the lowess model are much less restrictive than the assumptions for the linear model, no strong global assumptions are made about µ, however, we assume, that locally around a point x0, µ can be approximated by a polynomial function. Still, the errors i are assumed independent and randomly distributed with mean 0. Finally, a number of choices: the span, the type of polynomial and the type of weight function, affect the trade of between the bias and the variance of the fitted curve. 2.1 Window-span Recall that the span s is the percentage of cases across the range of x. The size of s has an important effect of the curve. A span that is too small (meaning insufficient data fall within the window) produces a curve characterized by a lot of noise. In other words, this results in a large variance. If the span is too large, the regression will be oversmooth and thus the local polynomial may not fit the data well, this might result in loss of important information, and the fit will have large bias. We may choose the span in different ways: 1. Constant bandwidth: h is constant, i.e., a constant range of x is used to find the observations for the local regression. This works satisfactorily if the distribution of x is uniform and/or with large sample sizes. However, if x has a non-uniform distribution, since it fails to capture the true trend because some local neighborhoods may have none or too few cases. This is particularly problematic at the boundary regions. 2. Nearest neighbor bandwidth: This method overcomes the sparse data problem. The span s is chosen so that each local neighborhood always contains a specified pro- portion of the observations. Typically this is done by trial and error, changing the span until we have removed most of the roughness in the curve. The span s = 0.5 is always a good starting point. In the function lowess, the default span is s = 0.75. 6 ● ● ● 80 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● Prestige ● ● ●● ● ● ● ● ●● ●● s=0.1 ● ● 40 ● ● ● ●● ●●● ● s=0.37 ●●● ● ● ● ● ● s=0.63 ● ● ●● s=0.9 ● ● ● ● ● ● ●●● ●● ● ● ●● 20 ● ● 0 5000 10000 15000 20000 25000 Average Income Figure 3: Effect of the span on the fitted curve Figure 3 shows the effects of 4 different values of the span on the fitted curve for the prestige data. 2.2 Inference Degrees of freedom The concept of degrees of freedom for non-parametric regression models is not as intuitive as for linear models, since there are no parameters estimated. However, the degrees of freedom for a nonparametric model are a generalization of the number of parameters in a parametric model. In a linear model the number of d.f. was equal to the number of estimated parameters and this coincides with: • Rank(H) the hat-matrix • T race(H) = trace(HH0) = trace(2HHH0) Analogous degrees of freedom for nonparametric models are obtained by substituting the H matrix by the smoother matrix S which plays the same roll, i.e., transform y intoy ˆ (although it is not idempotent). The approximated degrees of freedom are defined in several ways: • T race(S) • T race(SS0) • T race(2S − SS0) 7 see Hastie and Tibshirani (1990) for more details. The residual degrees of freedom are defined as dfRES = n−df and the estimated error variance 2 P 2 S = i ei /dfRES. Unlike the linear case, the d.f.