Nonparametric Regression 10/36-702 Larry Wasserman

1 Introduction

Now we focus on the following problem: Given a sample (X1,Y1),...,(Xn,Yn), where d Xi ∈ R and Yi ∈ R, estimate the regression function m(x) = E(Y |X = x) (1) without making parametric assumptions (such as linearity) about the regression function m(x). Estimating m is called nonparametric regression or smoothing. We can write Y = m(X) +  where E() = 0. This follows since,  = Y − m(X) and E() = E(E(|X)) = E(m(X) − m(X)) = 0

A related problem is nonparametric prediction. Given a pair (X,Y ), we want to predict Y from X. The optimal predictor (under squared error loss) is the regression function m(X). Hence, estimating m is of interest for its own sake and for the purposes of prediction.

Example 1 Figure 1 shows on bone mineral density. The plots show the relative change in bone density over two consecutive visits, for men and women. The smooth estimates of the regression functions suggest that a growth spurt occurs two years earlier for females. In this example, Y is change in bone mineral density and X is age.

Example 2 Figure 2 shows an analysis of some diabetes data from Efron, Hastie, Johnstone and Tibshirani (2004). The outcome Y is a measure of disease progression after one year. We consider four covariates (ignoring for now, six other variables): age, bmi (body mass index), and two variables representing blood serum measurements. A nonparametric regression model in this case takes the form Y = m(x1, x2, x3, x4) + . (2) A simpler, but less general model, is the additive model

Y = m1(x1) + m2(x2) + m3(x3) + m4(x4) + . (3)

Figure 2 shows the four estimated functions mb 1, mb 2, mb 3 and mb 4.

Notation. We use m(x) to denote the regression function. Often we assume that Xi has a density denoted by p(x). The support of the distribution of Xi is denoted by X . We assume that X is a compact subset of Rd. Recall that the trace of a square matrix A is denoted by tr(A) and is defined to be the sum of the diagonal elements of A.

1 Females Change in BMD −0.05 0.00 0.05 0.10 0.15 0.20 10 15 20 25 Age

Males Change in BMD −0.05 0.00 0.05 0.10 0.15 0.20 10 15 20 25 Age

Figure 1: Bone Mineral Density Data

2 The Bias– Tradeoff

Let mb (x) be an estimate of m(x). The pointwise risk (or pointwise squared error) is 2 R(m(x), mb (x)) = E (mb (x) − m(x)) . (4) The predictive risk is 2 R(m, mb ) = E((Y − mb (X)) ) (5) where (X,Y ) denotes a new observation. It follows that Z 2 2 R(m, mb ) = σ + E (m(x) − mb (x)) dP (x) (6) Z Z 2 2 = σ + bn(x)dP (x) + vn(x)dP (x) (7) where bn(x) = E(mb (x)) − m(x) is the bias and v(x) = Var(mb (x)) is the variance.

The estimator mb typically involves smoothing the data in some way. The main challenge is to determine how much smoothing to do. When the data are oversmoothed, the bias term

2 150 160 170 180 190 100 150 200 250 300

−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15 Age Bmi 120 160 200 240 110 120 130 140 150 160 −0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15 Map Tc

Figure 2: Diabetes Data

is large and the variance is small. When the data are undersmoothed the opposite is true. This is called the bias–variance tradeoff. Minimizing risk corresponds to balancing bias and variance.

An estimator m is consistent if b P kmb − mk → 0. (8) The minimax risk over a set of functions M is

Rn(M) = inf sup R(m, mb ) (9) mb m∈M and an estimator is minimax if its risk is equal to the minimax risk. We say that mb is rate optimal if R(m, mb )  Rn(M). (10) Typically the minimax rate is of the form n−2β/(2β+d) for some β > 0.

3 3 The Kernel Estimator

The simplest nonparametric estimator is the kernel estimator. The word “kernel” is often used in two different ways. Here are we referring to smoothing kernels. Later we will discuss Mercer kernels which are a distinct (but related) concept.

A one-dimensional smoothing kernel is any smooth, symmetric function K such that K(x) ≥ 0 and Z Z Z 2 2 K(x) dx = 1, xK(x)dx = 0 and σK ≡ x K(x)dx > 0. (11)

Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel esti- mator is defined by   Pn Y K kx−Xik n i=1 i h X m(x) ≡ mh(x) =   = Yi`i(x) (12) b b Pn kx−Xik i=1 K h i=1 P where `i(x) = K(kx − Xik/h)/ j K(kx − Xjk/h).

Thus mb (x) is a local average of the Yi’s. It can be shown that the optimal kernel is the Epanechnikov kernel. But, as with , the choice of kernel K is not too important. Estimates obtained by using different kernels are usually numerically very similar. This observation is confirmed by theoretical calculations which show that the risk is very insensitive to the choice of kernel. What does matter much more is the choice of bandwidth h which controls the amount of smoothing. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates.

The kernel estimator can be derived by minimizing the localized squared error

n    2 X x − Xi K c − Y . (13) h i i=1

A simple calculation shows that this is minimized by the kernel estimator c = mb (x) as given in equation (12).

Kernel regression and kernel density estimation are related. Let pb(x, y) be the kernel density estimator and define Z R ypb(x, y)dy mb (x) = Eb(Y |X = x) = ypb(y|x)dy = (14) pb(x) R where pb(x) = pb(x, y)dy. Then mb (x) is the Nadaraya-Watson kernel regression estimator Homework: Prove this.

4 4 Analysis of the Kernel Estimator

The bias-variance decomposition has some surprises with important practical consequences. Let’s start with the one dimensional case.

Theorem 3 Suppose that:

1. X ∼ PX and PX has density p. 2. x is in the interior of the support of PX . 3. lim|u|→∞ uK(u) = 0. 4. E(Y 2) < ∞. 5. p(x) > 0 where p is the density of P . 6. m and p have two bounded, continuous derivatives. 7. h → 0 and nh → ∞ as n → ∞.

Then h4  m0(x)p0(x) 1 σ2(x) (m(x) − m(x))2 = m00(x) + 2 µ (K) + ||K||2 +r(x) (15) E b 4 p(x) 2 nh p(x) | {z } | {z } bias variance R 2 2 where µ2(K) = t K(t)dt, σ (x) = Var(Y |X = x) and r(x) = o(bias + variance).

−1 P Proof Outline. We can write mb (x) = ba(x)/pb(x), ba(x) = n i YiKh(x, Xi), pb(x) = −1 P −1 n i Kh(x, Xi) where Kh(x, y) = h K((x − y)/h). Note that pb(x) is just the kernel density estimator of the true density p(x). Recall (from 36-705) that E[pb(x)] = p(x) + 2 00 (h /2)p (x)µ2(K). Then

ba(x) mb (x) − m(x) = − m(x) pb(x) a(x)  p(x)  p(x) = b − m(x) b + 1 − b pb(x) p(x) p(x) a(x) − m(x)p(x) (m(x) − m(x))(p(x) − p(x)) = b b + b b . p(x) p(x)

It can be shown that the dominant term is the first term. Now, since Yi = m(Xi) + i, 1  X − x Z Z [a(x)] = Y K i = h−1 m(u)K((x−u)/h)p(u)du = m(x+th)K(t)p(x+th)dt. E b hE i h So Z  t2h2   t2h2  [a(x)] ≈ m(x) + thm0(x) + m00(x) p(x) + thp0(x) + p00(x) K(t)dt. E b 2 2

5 Recall that R tK(t) = 0 and R t3K(t) = 0. So

h2 h2 [a(x)] ≈ p(x)m(x) + m(x)p00(x) µ (K) + h2m0(x)p0(x)µ (K) + m00(x)p(x) E b 2 2 2 2

00 2 and m(x)pb(x) ≈ m(x)p(x) + m(x)p (x)(h /2)µ2(K) and so the mean of the first term is h2 2m0(x)p0(x)µ (K)  2 + m00(x) . 2 p(x)

Squaring this and keeping the dominant terms (of order O(h2)) gives the stated bias. The variance term is calculated similarly.  The bias term has two disturbing properties: it is large if p0(x) is non-zero and it is large if p(x) is small. The dependence of the bias on the density p(x) is called design bias. It can also be shown that the bias is large if x is close to the boundary of the support of p. This is called boundary bias. Before we discuss how to fix these problems, let us state the multivariate version of this theorem.

Theorem 4 Let mb be the multivariate kernel regression estimator with bandwidth matrix H. Thus P i YiKH (Xi − x) mb (x) = P i KH (Xi − x) −1 −1 where H is a symmetric positive definite bandwidth matrix and KH (y) = [det(H)] K(H y). Then ∇m(x)T HHT ∇p(x) 1 (m(x)) − m(x) ≈ µ (K) + µ (K)tr(HT m00(x)H) E b s p(x) 2 2 and 1 σ(x) Var(m(x)) ≈ ||K||2 . b ndet(H) 2 p(x)

The proof is the same except that we use multivariate Taylor expansions. Now suppose that H = hI. Then we see that the risk (mean squared error) is

C C h4 + 2 . 1 nhd

1 4 C2  4+d 1  4+d This is minimized by h = n . The resulting risk is Rn  n . The main assump- tion was bounded, continuous second derivatives. If we repeat the analysis with bounded, continuous derivatives of order β (and we use an appropriate kernel), we get

2β  1  2β+d R  . (16) n n

6 Later we shall see that this is the minimax rate.

R 2 Remark: Note that this also gives the rate for the integrated risk E[ (mb (x) − m(x)) dx]. How do we reduce the design bias and the boundary bias? The answer is to make a small modification to the kernel estimator. We discuss this in the next section.

5 Local Polynomials Estimators

Recall that the kernel estimator can be derived by minimizing the localized squared error

n    2 X x − Xi K c − Y . (17) h i i=1 To reduce the design bias and the boundary bias we simply replace the constant c with a polynomial. In fact, it is enough to use a polynomial of order 1; in other words, we fit a local linear estimator instead of a local constant. The idea is that, for u near x, we can write, m(u) ≈ β0(x) + β1(x)(u − x). We define βb(x) = (βb0(x), βb1(x)) to minimize

n   !2 X x − Xi K Y − β (x) − β (x)(X − x) . h i 0 1 i i=1

Then mb (u) ≈ βb0(x) + βb1(x)(u − x). In particular, mb (x) = βb0(x). The minimizer is easily seen to be T T −1 T βb(x) = (βb0(x), βb1(x)) = (X W X) X W Y

where Y = (Y1,...,Yn),     1 X1 − x Kh(x − X1) 0 ··· 0  1 X − x   0 K (x − X ) ··· 0   2   h 1  X =  . .  ,W =  . .  .  . .   . . ······  1 Xn − x 0 0 ··· Kh(x − Xn)

Then mb (x) = βb0(x). For the multivariate version we minimize

n X T 2 KH (x − Xi)(Yi − β0(x) − β1 (x)(Xi − x)) . i=1 The minimizer is T T −1 T βb(x) = (βb0(x), βb1(x)) = (X W X) X W Y

7 where Y = (Y1,...,Yn),

 T    1 (X1 − x) KH (x − X1) 0 ··· 0  1 (X − x)T   0 K (x − X ) ··· 0   2   H 1  X =  . .  ,W =  . .  .  . .   . . ······  T 1 (Xn − x) 0 0 ··· KH (x − Xn)

Then mb (x) = βb0(x). Now comes the amazing part.

Theorem 5 The local linear estimator has risk 1 1 σ(x) µ2(K)(tr(HT p00(x)H))2 + ||K||2 . (18) 4 2 ndet(H) 2 p(x)

The proof uses the same techniques as before but is much more involved. The key thing to note are that the rate is the same but the design bias (dependence of the bias on p0(x)) has disappeared.

It can also be shown that the boundary bias disappears. In fact, the kernel estimator has squared bias of order h2 near the boundary of the support while the local linear estimator has squared bias of order h4 near the boundary of the support. To the best of my knowledge, there is no simple way to get rid of these biases from RKHS estimators and other penalization- based estimators.

Remark: If you want to reduce bias at maxima and minima of the regression function, then use local quadratic regression.

Example 6 (LIDAR) These are data from a light detection and ranging (LIDAR) exper- iment. LIDAR is used to monitor pollutants. The response is the log of the ratio of light received from two lasers. The of one laser is the resonance frequency of mercury while the second has a different frequency. Figure 3 shows the 221 observations. The top left plot shows the data and the fitted function using local . The cross-validation curve (not shown) has a well-defined minimum at h ≈ 37 corresponding to 9 effective degrees of freedom. The fitted function uses this bandwidth. The top right plot shows the residuals. There is clear (nonconstant variance). The bottom left plot shows the es- timate of σ(x) = pVar(Y |X = x). This estimate was obtained by smoothing the residuals 2 (Yi − mb (Xi)) . The bands in the lower right plot are mb (x) ± 2σ(x). As expected, there is much greater uncertainty for larger values of the covariate.

8 log ratio residuals −1.0 −0.6 −0.2 −0.4 0.0 0.2 0.4 400 500 600 700 400 500 600 700 range log ratio sigma(x) 0.00 0.05 0.10 −1.0 −0.6 −0.2 400 500 600 700 400 500 600 700 range range

Figure 3: The LIDAR data from Example 6. Top left: data and the fitted function using local linear regression with h ≈ 37 (chosen by cross-validation). Top right: the residuals. Bottom left: estimate of σ(x). Bottom right: variability bands. 6 Linear Smoothers

Kernel estimators and local polynomial estimator are examples of linear smoothers.

Definition: An estimator mb of m is a linear smoother if, for each x, there is a vector T `(x) = (`1(x), . . . , `n(x)) such that n X T mb (x) = `i(x)Yi = `(x) Y (19) i=1

T where Y = (Y1,...,Yn) .

K(kx−Xik/h) For kernel estimators, `i(x) = Pn . For local linear estimators, we can deduce j=1 K(kx−Xj k/h) the weights from the expression for βb(x). Here is an interesting fact: the following estimators are linear smoothers: Gaussian process regression, splines, RKHS estimators.

Example 7 You should note confuse linear smoothers with linear regression. In linear re- gression we assume that m(x) = xT β. In fact, linear regression is a special

9 T case of linear smoothing. If βb denotes the least squares estimator then mb (x) = x βb = xT (XT X)−1XT Y = `(x)T Y where `(x) = xT (XT X)−1XT .

T Define the vector of fitted values Yb = (mb (X1),..., mb (Xn)) . It follows that Yb = LY where  T    `(X1) `1(X1) `2(X1) ··· `n(X1)  `(X )T   ` (X ) ` (X ) ··· ` (X )   2   1 2 2 2 n 2  L =  .  =  . . . .  . (20)  .   . . . .  T `(Xn) `1(Xn) `2(Xn) ··· `n(Xn)

The matrix L defined in (20) is called the smoothing matrix. The ith row of L is called the effective kernel for estimating m(Xi). We define the effective degrees of freedom by ν = tr(L). (21)

The effective degrees of freedom behave very much like the number of parameters in a linear regression model.

Remark. The weights in all the smoothers we will use have the property that, for all x, Pn i=1 `i(x) = 1. This implies that the smoother preserves constants.

7 Penalized Regression

Another smoothing method is penalized regression (or regularized regression) where mb is defined to be the minimizer of n X 2 (Yi − mb (Xi)) + λJ(mb ) (22) i=1 where λ ≥ 0 and J(mb ) is a penalty (or regularization) term. A popular choice of J is Z J(g) = (g00(x))2dx.

To find the minimizer of (22) we need to use cubic splines. Let (a, b) be an interval and let x1, . . . , xk be k points such that a < x1 < ··· < xk < b. A continuous function f on (a, b) is a cubic spline with knots {x1, . . . , xn} if f is cubic polynomial over the intervals (x1, x2), (x2, x2),... and f has continuous first and second derivatives at the knots.

R 00 2 Theorem 8 Let mb be the minimizer of (22) where J(g) = (g (x)) dx. Then mb is a cubic spline with knots at the points X1,...,Xn.

10 According to this result, the minimizer mb of (22) is contained in Mn, the set of all cubic splines with knots at {X1,...,Xn}. However, we still have to find which function in Mn is the minimizer.

2 3 Define B1(x) = 1, B2(x) = x, B3(x) = x , B4(x) = x and 3 Bj(x) = (x − Xj−4)+ j = 5, . . . , n + 4.

It can be shown that B1,...,Bn+4 form a basis for the Mn. (In practice, another basis for M called the B-spline basis is used since it has better numerical properties.) Thus, PN every g ∈ Mn can be written as g(x) = j=1 βjBj(x) for some coefficients β1, . . . , βN . PN If we substitute mb (x) = j=1 βjBj(x) into (22), the minimization problem becomes: find β = (β1, . . . , βN ) to minimize T T (Y − Bβ) (Y − Bβ) + λβ Ωβ (23) R 00 00 where Y = (Y1,...,Yn), Bij = Bj(Xi) and Ωjk = Bj (x)Bk (x)dx. The solution is T −1 T βb = (B B + λΩ) B Y and hence X T mb (x) = βbjBj(x) = `(x) Y j T −1 T T where `(x) = b(x)(B B + λΩ) B and b(x) = (B1(x),...,BN (x)) . Hence, the spline smoother is another example of a linear smoother.

The parameter λ is a smoothing parameter. As λ → 0, mb tends to the interpolating function mb (Xi) = Yi. As λ → ∞, mb tends to the least squares linear fit. For RKHS regression, we minimize n X 2 2 (Yi − mb (Xi)) + λ||m||K (24) i=1 where ||m||K is the RKHS norm for some Mercer kernel K. In the homework, you proved P that the estimator is mb (x) = j αbjK(x, Xi) where −1 αb = (K + λI) Y and K(j, k) = K(j, k). Note that RKHS regression is just another linear smoother. In my opinion, RKHS regression has several disadvantages:

1. There are two parameters to pick: the bandwidth in the kernel and λ. 2. It is very hard to derive the bias and variance of the estimator. 3. It is not obvious how to study and fix the design bias and boundary bias. 4. The kernel and local linear estimators are local averages. This makes them easy to understand. The RKHS estimators is the solution to a constrained minimization. This makes them much less intuitive.

11 8 Basis Functions and Dictionaries

Suppose that ( Z b ) 2 m ∈ L2(a, b) = g :[a, b] → R : g (x)dx < ∞ . a

R 2 Let φ1, φ2,... be an orthonormal basis for L2(a, b). This that φ (x)dx = 1, R R j φjφk(x)dx = 0 for j 6= k and the only function b(x) such that b(x)φj(x)dx = 0 for all j is b(x) = 0. It follows that any m ∈ L2(a, b) can be written as

∞ X m(x) = βjφj(x) j=1 R where βj = m(x)φj(x)dx. For [a, b] = [0, 1], an example is the cosine basis √ φ0(x) = 1, φj(x) = 2 cos(πjx), j = 1, 2,...

To use a basis for nonparametric regression, we regress Y on the first J basis functions and PJ we treat J as a smoothing parameter. In other words we take mb (x) = j=1 βbjφj(x) where T −1 T βb = (B B) B Y and Bij = φj(Xi). It follows that mb (x) is a linear smoother. See Chapters 7 and 8 of Wasserman (2006) for theoretical properties of orthogonal function smoothers.

It is not necessary to use orthogonal functions for smoothing. Let D = {ψ1, . . . , ψN } be any collection of functions, called a dictionary. The collection D could be very large. For example, D might be the union of several different bases. The smoothing problem is to decide which functions in D to use for approximating m. One way to approach this problem is to use the : regress Y on D using an `1 penalty.

9 Choosing the Smoothing Parameter

The estimators depend on the bandwidth h. Let R(h) denote the risk of mb h when bandwidth h is used. We will estimate R(h) and then choose h to minimize this estimate. As with linear regression, the training error

n 1 X 2 Re(h) = (Yi − mh(Xi)) (25) n b i=1 is biased downwards. We will estimate the risk using the cross-validation.

12 9.1 Leave-One-Out Cross-Validation

The leave-one-out cross-validation score is defined by

n 1 X 2 Rb(h) = (Yi − m(−i)(Xi)) (26) n b i=1

th where mb (−i) is the estimator obtained by omitting the i pair (Xi,Yi), that is, mb (−i)(x) = Pn j=1 Yj`j,(−i)(x) and ( 0 if j = i

`j,(−i)(x) = `j (x) (27) P if j 6= i. k6=i `k(x)

Theorem 9 Let mb be a linear smoother. Then the leave-one-out cross-validation score Rb(h) can be written as n  2 1 X Yi − mh(Xi) Rb(h) = b (28) n 1 − L i=1 ii th where Lii = `i(Xi) is the i diagonal element of the smoothing matrix L.

The smoothing parameter h can then be chosen by minimizing Rb(h). An alternative is generalized cross-validation in which each Lii in equation (28) is replaced with its average −1 Pn n i=1 Lii = ν/n where ν = tr(L) is the effective degrees of freedom. (Note that ν depends on h.) Thus, we minimize Re GCV(h) = . (29) (1 − ν/n)2 Usually, GCV and cross-validation are very similar. Using the approximation (1 − x)−2 ≈ 1 + 2x we see that n 1 X 2νσ2 GCV(h) ≈ (Y − m(X ))2 + b ≡ C (30) n i b i n p i=1 2 −1 Pn 2 where σb = n i=1(Yi − mb (Xi)) . Equation (30) is the nonparametric version of the Cp that we saw in linear regression.

Example 10 (Doppler function) Let

 2.1π  m(x) = px(1 − x) sin , 0 ≤ x ≤ 1 (31) x + .05 which is called the Doppler function. This function is difficult to estimate and provides a good test case for nonparametric regression methods. The function is spatially inhomogeneous which means that its smoothness (second derivative) varies over x. The function is plotted

13 −1 0 1 −1 0 1 0.0 0.5 1.0 0.0 0.5 1.0 −1 0 1 100 150 200 0.0 0.5 1.0

Figure 4: The Doppler function estimated by local linear regression. The function (top left), the data (top right), the cross-validation score versus effective degrees of freedom (bottom left), and the fitted function (bottom right).

in the top left plot of Figure 4. The top right plot shows 1000 data points simulated from Yi = m(i/n) + σi with σ = .1 and i ∼ N(0, 1). The bottom left plot shows the cross- validation score versus the effective degrees of freedom using local linear regression. The minimum occurred at 166 degrees of freedom corresponding to a bandwidth of .005. The fitted function is shown in the bottom right plot. The fit has high effective degrees of freedom and hence the fitted function is very wiggly. This is because the estimate is trying to fit the rapid fluctuations of the function near x = 0. If we used more smoothing, the right-hand side of the fit would look better at the cost of missing the structure near x = 0. This is always a problem when estimating spatially inhomogeneous functions.

9.2 Data Splitting

As with density estimation, stronger guarantees can be made using a data splitting version of cross-validation. Suppose the data are (X1,Y1),..., (X2n,Y2n). Now randomly split the data into two halves that we denote by n o D = (Xe1, Ye1),..., (Xen, Yen)

14 and n ∗ ∗ ∗ ∗ o E = (X1 ,Y1 ),..., (Xn,Yn ) .

Construct regression estimators M = {m1, . . . , mN } from D. Define the risk estimator

n 1 X ∗ ∗ 2 Rb(mj) = |Y − mj(X )| . n i i i=1 Finally, let mb = argminm∈MRb(m).

2 Theorem 11 Let m∗ ∈ M minimize kmj − mkP . There exists C > 0 such that C log N (km − mk2 ) ≤ 2 km − mk2 + . E b P E ∗ P n

10 Minimax Properties

Consider the nonparametric regression model

Yi = m(Xi) + i where X1,...,Xn ∼ P and E(i) = 0.

Theorem 12 Let P denote all distributions for X. Then C inf sup km − mk2 ≥ . (32) E b P 2β/(2β+d) mb m∈Σ(β,L),P ∈P n

This theorem will be proved when we cover minimax theory. It follows that kernel regression is minimax for β = 1. If we were to impose stronger assumptions, then the rate of convergence for kernel regression would be n−4/(4+d) and hence would be minimax for β = 2.

11 Additive Models

Interpreting and visualizing a high-dimensional fit is difficult. As the number of covariates increases, the computational burden becomes prohibitive. A practical approach is to use an additive model. An additive model is a model of the form

d X Y = α + mj(xj) +  (33) j=1

15 The Backfitting Algorithm

Initialization: set αb = Y and set initial guesses for mb 1,..., mb d. Now iterate the following steps until convergence. For j = 1, . . . , d do: P • Compute Yei = Yi − αb − k6=j mb k(Xi), i = 1, . . . , n.

• Apply a smoother to Ye on Xj to obtain mb j. −1 Pn • Set mb j(x) ←− mb j(x) − n i=1 mb j(Xi). • end do.

Figure 5: Backfitting.

where m1, . . . , md are smooth functions. The model (33) is not identifiable since we can add any constant to α and subtract the same constant from one of the mj’s without changing the regression function. This problem can be fixed in a number of ways; the simplest is to αb = Y Pn and then regard the mj’s as deviations from Y . In this case we require that i=1 mb j(Xi) = 0 for each j.

There is a simple algorithm called backfitting for turning any one-dimensional regression smoother into a method for fitting additive models. This is essentially a coordinate descent, Gauss-Seidel algorithm. See Figure 5.

Example 13 This example involves three covariates and one response variable. The data are 48 rock samples from a petroleum reservoir, the response is permeability (in milli-Darcies) and the covariates are:√ the area of pores (in pixels out of 256 by 256), perimeter in pixels and shape (perimeter/ area). The goal is to predict permeability from the three covariates. We fit the additive model

permeability = m1(area) + m2(perimeter) + m3(shape) + .

We could scale each covariate to have the same variance and then use a common bandwidth for each covariate. Instead, we perform cross-validation to choose a bandwidth hj for co- variate xj during each iteration of backfitting. The bandwidths and the functions estimates converged rapidly. The estimates of m1, m2 and m3 are shown in Figure 6. Y was added to each function before plotting it.

16 log permeability log permeability 7 8 9 7 8 9

1000 2000 3000 4000 5000 0.1 0.2 0.3 0.4 area perimeter log permeability 7 8 9

0 200 400 600 800 1000 shape

Figure 6: The rock data. The plots show mb 1, mb 2, and mb 3 for the additive model Y = mb 1(X1) + mb 2(X2) + mb 3(x3) + . 12 SpAM

Ravikumar, Lafferty, Liu and Wasserman (2007) introduced a sparse version of additive models called SpAM (Sparse Additive Models). This is a functional version of the grouped lasso (Yuan and Lin 2006) and is closely related to the COSSO (Lin and Zhang 2006).

We form an additive model d X Yi = α + βjgj(Xij) + i (34) j=1 R R 2 with the identifiability conditions gj(xj)dP (xj) = 0 and gj (xj)dP (xj) = 1. Further, we Pd impose the sparsity condition j=1 |βj| ≤ Ln and the smoothness condition gj ∈ Sj where Sj is some class of smooth functions. While this optimization problem makes plain the role `1 regularization of β to achieve sparsity, it is convenient to reexpress the model as 2  Pd  min E Y − j=1 mj(Xj) mj ∈Hj subject to d X q 2 E(mj (Xj)) ≤ L, E(mj) = 0, j = 1, . . . , d. j=1

17 The Lagrangian for the optimization problem is

d 2 1  d  X q X L(f, λ, µ) = Y − P m (X ) + λ (m2(X )) + µ (m ). (35) 2E j=1 j j E j j jE j j=1 j

Theorem 14 The minimizers m1, . . . , mp of (35) satisfy   λ mj = 1 −  Pj a.e. (36) q 2 E(Pj ) +

where [·]+ denotes the positive part, and Pj = E[Rj | Xj] denotes the projection of the residual P Rj = Y − k6=j mk(Xk) onto Hj.

To solve this problem, we insert sample estimates into the population algorithm, as in stan- dard backfitting. We estimate the projection Pj by smoothing the residuals:

Pbj = SjRj (37) where Sj is a linear smoother, such as a local linear or kernel smoother. Let q 1 2 sj = √ kPbjk2 = mean(Pb ). (38) b n j

q 2 be the estimate of E(Pj ). We have thus derived the SpAM backfitting algorithm given in Figure 7.

We choose λ by minimizing an estimate of the risk. Let νj be the effective degrees of freedom th for the smoother on the j variable, that is, νj = trace(Sj) where Sj is the smoothing matrix 2 for the j-th dimension. Also let σb be an estimate of the variance. Define the total effective degrees of freedom X df(λ) = νjI (kmb jk= 6 0) . (39) j Two estimates of risk are

n d !2 1 X X 2σ2 C = Y − m (X ) + b df(λ) (40) p n i b j j n i=1 j=1

and 1 Pn P 2 (Yi − mj(Xij)) GCV(λ) = n i=1 j b . (41) (1 − df(λ)/n)2

18 SpAM Backfitting Algorithm

Input: Data (Xi,Yi), regularization parameter λ.

Initialize mb j = 0, for j = 1, . . . , p. Iterate until convergence:

For each j = 1, . . . , p: P (1) Compute the residual: Rj = Y − k6=j fbk(Xk);

(2) Estimate Pj = E[Rj | Xj] by smoothing: Pbj = SjRj; (3) Estimate norm: s2 = 1 Pn P 2(i); bj n i=1 bj

(4) Soft-threshold: mb j = [1 − λ/sbj]+ Pbj; (5) Center: mb j ← mb j − mean(mb j). P Output: Component functions mb j and estimator mb (Xi) = j mb j(Xij).

Figure 7: The SpAM backfitting algorithm. The first two steps in the iterative algorithm are the usual backfitting procedure; the remaining steps carry out functional soft thresholding.

The first is Cp and the second is generalized cross validation but with degrees of freedom defined by df(λ). A proof that these are valid estimates of risk is not currently available. Thus, these should be regarded as heuristics.

Synthetic Data. We generated n = 150 observations from the following 200-dimensional additive model:

Yi = m1(xi1) + m2(xi2) + m3(xi3) + m4(xi4) + i (42) 1 1 m (x) = −2 sin(2x), m (x) = x2 − , m (x) = x − , m (x) = e−x + e−1 − 1 1 2 3 3 2 4 and mj(x) = 0 for j ≥ 5 with noise i ∼ N (0, 1). These data therefore have 196 irrelevant dimensions.

The results of applying SpAM with the plug-in bandwidths are summarized in Figure 8. The top-left plot in Figure 8 shows regularization paths as a function of the parameter λ; each curve is a plot of kmb jk2 versus λ for a particular variable Xj. The estimates are generated efficiently over a sequence of λ values by “warm starting” mb j(λt) at the previous value mb j(λt−1). The top-center plot shows the Cp statistic as a function of λ. The top-right plot compares the empirical probability of correctly selecting the true four variables as a function of sample size n, for p = 128 and p = 256. This behavior suggests the same threshold phenomenon that was proven for the lasso.

19 ; the dashed λ x3 x6 sample size zero score. Upper right: p=128 p=256 l1=90.65 p -axis is proportional to C x

0 10 20 30 40 50 60 70 80 90 110 130 150

. . . . . 1.0 0.8 0.6 0.4 0.2 0.0

6− 2246 4 2 −2 −4 −6 6− 2246 4 2 −2 −4 −6 prob. of correct recovery correct of prob. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 m6 m3 norm of the estimated components 2 ` ; the value on the λ x2 x5 zero 20 which has the smallest

l1=88.36

λ

6− 2246 4 2 −2 −4 −6 0.0 0.2 0.4 6 0.6 0.8 4 1.0 2 −2 −4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 m5 m2

01 14 12 10 8 6 4 2

scores against the regularization parameter Cp p C x4 x1 . Lower (from left to right): Estimated (solid lines) versus true additive n l1=79.26 l1=97.05 Component Norms Component . Upper center: The

2

194 9 94 2 4 3 k 4− 6 4 2 −2 −4 4− 4 2 −2 −4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 m1 m4 j b m 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0 k j as plotted against the regularization parameter Figure 8: (Simulated data) Upper left: The empirical P vertical line correspondsThe to proportion the of value 200of of trials sample where size thecomponent correct functions relevant (dashed variables lines) are forzero. selected, the as first 6 a dimensions; function the remaining components are Boston Housing. The Boston housing data were collected to study house values in the suburbs of Boston. There are 506 observations with 10 covariates. The dataset has been studied by many authors with various transformations proposed for different covariates. To explore the sparsistency properties of our method, we added 20 irrelevant variables. Ten of them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the original ten covariates. The model is

Y = α + m1(crim) + m2(indus) + m3(nox) + m4(rm) + m5(age)

+ m6(dis) + m7(tax) + m8(ptratio) + m9(b) + m10(lstat) + . (43)

The result of applying SpAM to this 30 dimensional dataset is shown in Figure 9. SpAM identifies 6 nonzero components. It correctly zeros out both types of irrelevant variables. From the full solution path, the important variables are seen to be rm, lstat, ptratio, and crim. The importance of variables nox and b are borderline. These results are basically consistent with those obtained by other authors. However, using Cp as the selection criterion, the variables indux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies.

13 Partitions and Trees

Simple and interpretable estimators can be derived by partitioning the range of X. Let Πn = {A1,...,AN } be a partition of X and define

N X mb (x) = Y j I(x ∈ Aj) j=1

−1 Pn where Y j = nj i=1 YiI(Xi ∈ Aj) is the average of the Yi’s in Aj and nj = #{Xi ∈ Aj}. (We define Y j to be 0 if nj = 0.) The simplest partition is based on cubes. Suppose that X = [0, 1]d. Then we can partition X into N = kd cubes with lengths of size h = 1/k. Thus, N = (1/h)d. The smoothing parameter is h.

Theorem 15 Let mb (x) be the partition estimator. Suppose that ( ) d m ∈ M = m : |m(x) − m(x)| ≤ Lkx − zk, x, z, ∈ R (44) and that Var(Y |X = x) ≤ σ2 < ∞ for all x. Then c km − mk2 ≤ c h2 + 2 . (45) E b P 1 nhd 21 ; the dashed vertical line λ 0.2 0.4 0.6 0.8 1.0 norm of the estimated components versus scores against 0.0 0.2 0.4 0.6 0.8 1.0

2

03 05 07 80 70 60 50 40 30 20 p

` 1 020 10 −10 1 020 10 −10 Cp 0.0 0.0 0.2 0.4 0.6 0.8 1.0 C 22 . Center: The x1 x8 λ l1=478.29 l1=177.14 score. Right: Additive fits for four relevant variables. p

C 1 020 10 −10 1 020 10 −10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 m8 m1 Component Norms Component 17 7 5 6 3 8 10 4 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 the regularization parameter corresponds to best Figure 9: (Boston housing) Left: The empirical Hence, if h  n−1/(d+2) then c km − mk2 ≤ . (46) E b P n2/(d+2)

The proof is virtually identical to the proof of Theorem 17.

A regression tree is a partition estimator of the form

M X m(x) = cmI(x ∈ Rm) (47) m=1

where c1, . . . , cM are constants and R1,...,RM are disjoint rectangles that partition the space of covariates and whose sides are parallel to the coordinate axes. The model is fit in a greedy, recursive manner that can be represented as a tree; hence the name.

th Denote a generic covariate value by x = (x1, . . . , xj, . . . , xd). The covariate for the i ob- servation is Xi = (Xi1,...,Xij,...,Xid). Given a covariate j and a split point s we define the rectangles R1 = R1(j, s) = {x : xj ≤ s} and R2 = R2(j, s) = {x : xj > s} where, in th th this expression, xj refers the the j covariate not the j observation. Then we take c1 to be the average of all the Yi’s such that Xi ∈ R1 and c2 to be the average of all the Yi’s such that X ∈ R . Notice that c and c minimize the sums of squares P (Y − c )2 and i 2 1 2 Xi∈R1 i 1 P (Y − c )2. The choice of which covariate x to split on and which split point s to use Xi∈R2 i 2 j is based on minimizing the residual sums if squares. The splitting process is on repeated on each rectangle R1 and R2. Figure 10 shows a simple example of a regression tree; also shown are the corresponding rectangles. The function estimate mb is constant over the rectangles. Generally one first grows a very large tree, then the tree is pruned to form a subtree by collapsing regions together. The size of the tree is a tuning parameter and is usually chosen by cross-validation.

Example 16 Figure 11 shows a tree for the rock data. Notice that the variable shape does not appear in the tree. This means that the shape variable was never the optimal covariate to split on in the algorithm. The result is that tree only depends on area and peri. This illustrates an important feature of tree regression: it automatically performs variable selection in the sense that a covariate xj will not appear in the tree if the algorithm finds that the variable is not important.

23 X1 < 50 ≥ 50

X2 c3 < 100 ≥ 100

c1 c2

R2 110 R3 X2 R1

50 X1

Figure 10: A regression tree for two covariates X1 and X2. The function estimate is mb (x) = c1I(x ∈ R1) + c2I(x ∈ R2) + c3I(x ∈ R3) where R1,R2 and R3 are the rectangles shown in the lower plot.

24 area < 1403

area < 1068 area < 3967

area < 3967 peri < .1949 peri < .1991

7.746 8.407 8.678 8.893 8.985 8.099 8.339

Figure 11: Regression tree for the rock data.

14 Confidence Bands

To get an idea of the precision of the estimate mb (x), one may compute a confidence band 2 for m(x). First, we get the squared residuals ri = (Yi − mb (Xi)) . If we regress the ri’s on 2 the Xi’s (using kernel regression for example) then we get an estimate of σb (x). We then call mb (x) ± 2σb(x) a variability band. It is not really a confidence band but it does give an idea of the variability of the estimator.

There are two reasons why this is not a confidence band. First, we have to account for uniformity over x. This can be done using the bootstrap. A more serious issue is bias. Note that √ √ √ n(m(x) − m(x)) n(m(x) − m(x)) n(m(x) − m(x)) b = b + σb(x) σb(x) σb(x) where m(x) = E[mb (x)]. Under weak conditions, the first term satisfies a . But the second term is the ratio of bias and variance. If we choose the bandwidth optimally, these terms are balanced. Hence, the second term does not go to 0 as n → ∞. This means that the band is not centered at m(x). To deal with this, we can estimate the bias and subtract the bias estimate. The resulting estimator undersmooths: it reduces bias but increases variance. The result will be an asymptotically valid confidence interval centered around a sub-optimal estimator.

The bottom line is this: in nonparametric estimation you can have optimal prediction or valid inferences but not both.

25 Appendix: Finite Sample Bounds for Kernel Regression

This analysis is from Gy¨orfi,Kohler, Krzy˙zakand Walk (2002). For simplicity, we will use the spherical kernel K(kxk) = I(kxk ≤ 1); the results can be extended to other kernels. Hence, Pn Pn i=1 Yi I(kXi − xk ≤ h) i=1 Yi I(kXi − xk ≤ h) mb (x) = Pn = i=1 I(kXi − xk ≤ h) n Pn(B(x, h))

where Pn is the empirical measure and B(x, h) = {u : kx − uk ≤ h}. If the denominator is 0 we define mb (x) = 0. Let ( ) d M = m : |m(x) − m(x)| ≤ Lkx − zk, x, z, ∈ R . (48)

The proof of the following theorem is from Chapter 5 of Gy¨orfi,Kohler, Krzy˙zakand Walk (2002).

Theorem 17 Suppose that the distribution of X has compact support and that Var(Y |X = x) ≤ σ2 < ∞ for all x. Then c km − mk2 ≤ c h2 + 2 . (49) E b P 1 nhd Hence, if h  n−1/(d+2) then c km − mk2 ≤ . (50) E b P n2/(d+2)

Note that the rate n−2/(d+2) is slower than the pointwise rate n−4/(d+2) because we have made weaker assumptions.

Proof. Let Pn i=1 m(Xi)I(kXi − xk ≤ h) mh(x) = . nPn(B(x, h))

Let An = {Pn(B(x, h)) > 0}. When An is true, ! Pn Var(Y |X )I(kX − xk ≤ h) σ2 (m (x) − m (x))2 X ,...,X = i=1 i i i ≤ . E b h h 1 n 2 2 n Pn (B(x, h)) nPn(B(x, h))

Since m ∈ M, we have that |m(Xi) − m(x)| ≤ LkXi − xk ≤ Lh for Xi ∈ B(x, h) and hence

2 2 2 2 c |mh(x) − m(x)| ≤ L h + m (x)IAn(x) .

26 Therefore, Z Z Z 2 2 2 E (mb h(x) − m(x)) dP (x) = E (mb h(x) − mh(x)) dP (x) + E (mh(x) − m(x)) dP (x) Z 2 Z σ 2 2 2 c ≤ E IAn(x)dP (x) + L h + m (x)E(IAn(x) )dP (x). (51) nPn(B(x, h))

To bound the first term, let Y = nPn(B(x, h)). Note that Y ∼ Binomial(n, q) where q = P(X ∈ B(x, h)). Now, n I(Y > 0)  2  X 2  n  ≤ = qk(1 − q)n−k E Y E 1 + Y k + 1 k k=0 n 2 X  n + 1  = qk+1(1 − q)n−k (n + 1)q k + 1 k=0 n+1 2 X  n + 1  ≤ qk(1 − q)n−k+1 (n + 1)q k k=0 2 2 2 = (q + (1 − q))n+1 = ≤ . (n + 1)q (n + 1)q nq Therefore, Z 2 Z σ IAn(x) 2 dP (x) E dP (x) ≤ 2σ . nPn(B(x, h)) nP (B(x, h)) SM We may choose points z1, . . . , zM such that the support of P is covered by j=1 B(zj, h/2) d where M ≤ c2/(nh ). Thus, Z M Z M Z dP (x) X I(z ∈ B(zj, h/2)) X I(z ∈ B(zj, h/2)) ≤ dP (x) ≤ dP (x) nP (B(x, h)) nP (B(x, h)) nP (B(z , h/2)) j=1 j=1 j M c ≤ ≤ 1 . n nhd The third term in (51) is bounded by Z Z 2 2 n c m (x)E(IAn(x) )dP (x) ≤ sup m (x) (1 − P (B(x, h))) dP (x) x Z ≤ sup m2(x) e−nP (B(x,h))dP (x) x Z nP (B(x, h)) = sup m2(x) e−nP (B(x,h)) dP (x) x nP (B(x, h)) Z 1 ≤ sup m2(x) sup(ue−u) dP (x) x u nP (B(x, h)) 2 −u c1 c2 ≤ sup m (x) sup(ue ) d = d . x u nh nh 

27 15 References

Gy¨orfi,Kohler, Krzy˙zakand Walk (2002). A Distribution Free Theory of Nonparametric Regression. Springer.

Hardle, Muller, Sperlich and Werwatz (2004). Nonparametric and Semiparametric Models. Springer.

Wasserman, L. (2006). All of . Springer.

28