Lecture 3: Regression: Nonparametric Approaches Instructor: Yen-Chi Chen

STAT 535: Statistical Machine Learning Autumn 2019 Lecture 3: Regression: Nonparametric approaches Instructor: Yen-Chi Chen Reference: Section 6 of All of Nonparametric Statistics by Larry Wasserman. 3.1 Introduction Let (X1;Y1); ··· ; (Xn;Yn) be a bivariate random sample. In the regression analysis, we are often interested in the regression function m(x) = E(Y jX = x): Sometimes, we will write Yi = m(Xi) + i; where i is a mean 0 noise. The simple linear regression model is to assume that m(x) = β0 + β1x, where β0 and β1 are the intercept and slope parameter. In the first part of the lecture, we will talk about methods that direct estimate the regression function m(x) without imposing any parametric form of m(x). This approach is called the nonparametric regression. 3.2 Regressogram (Binning) We start with a very simple but extremely popular method. This method is called regressogram but people often call it binning approach. You can view it as regressogram = regression + histogram: For simplicity, we assume that the covariates Xi's are from a distribution over [0; 1]. Similar to the histogram, we first choose M, the number of bins. Then we partition the interval [0; 1] into M equal-width bins: 1 1 2 M − 2 M − 1 M − 1 B = 0; ;B = ; ; ··· ;B = ; ;B = ; 1 : 1 M 2 M M M−1 M M M M When x 2 B`, we estimate m(x) by Pn i=1 YiI(Xi 2 B`) mb M (x) = Pn = average of the responses whose covariates is in the same bin as x: i=1 I(Xi 2 B`) 2 Theorem 3.1 Assume that the PDF of X p(x) ≥ p0 > 0 for all x 2 [0; 1] and E(Y jX = x) < 1. Then 1 M bias(m (x)) = O ; Var(m (x)) = O : b M M b M n 3-1 3-2 Lecture 3: Regression: Nonparametric approaches Proof: Suppose that x belongs to bin B`. Let µM (x) = E(YiI(Xi 2 B`)) and qM (x) = E(I(Xi 2 B`)) and let µ(x) = m(x) · p(x). µbM (x) Using mM (x) = , where b qbM (x) n n 1 X X µ (x) = Y I(X 2 B ); q (x) = I(X 2 B ); bM n i i ` bM i ` i=1 i=1 the difference can be decomposed into µbM (x) µM (x) µM (x) µ(x) mb M (x) − m(x) = − + − : qbM (x) qM (x) qM (x) p(x) | {z } | {z } ∼Variance bias Bias. Since we have M bins, the width of each bin is 1=M. Thus, it is easy to see that M · qM (x) can be viewed as a density histogram estimator of p(x). Therefore, by the theory of histogram, the bias will be p(x) − M · qM (x) = O(1=M). Similarly, one can show that M · µM (x) can be viewed as an estimator of µ(x) 1 2 and µ(x) − M · µM (x) = O(1=M). Using the fact that 1+ = 1 − + O( ) when ! 0, we conclude that the bias part µ (x) µ(x) Mµ (x) µ(x) M − = M − qM (x) p(x) MqM (x) p(x) µ(x) + O(1=M) µ(x) = − p(x) + O(1=M) p(x) O(1=M) µ(x) = + O(1=M) p(x) p2(x) = O(1=M): Variance. For the variance part, it is easy to see that E(µbM (x)) = µM (x) and E(qbM (x)) = qM (x). Also, it is easy to see that the variance (using the same derivation as histogram), 1 1 Var(µ (x)) = O ; Var(q (x)) = O : bM Mn bM Mn Thus, M M Var(Mµ (x)) = O ; Var(Mq (x)) = O : bM n bM n p p Let ∆µ(x) = MµbM (x) − MµM (x) = OP ( M=n) and ∆q(x) = MqbM (x) − MqM (x) = OP ( M=n). Then µ (x) µ (x) Mµ (x) Mµ (x) bM − M = bM − M qbM (x) qM (x) MqbM (x) MqM (x) Mµ (x) + ∆ (x) Mµ (x) = M µ − M MqM (x) + ∆q(x) MqM (x) MµM (x) + ∆µ(x) MµM (x) MµM (x) = − 2 2 ∆q(x) − + smaller order terms MqM (x) M qM (x) MqM (x) 1 MµM (x) ≈ ∆µ(x) − 2 2 ∆q(x) MqM (x) M qM (x) 1 µ(x) ≈ ∆ (x) − ∆ (x): p(x) µ p2(x) q Lecture 3: Regression: Nonparametric approaches 3-3 Note that the last ≈ sign is due to the bias analysis. Thus, the variance of µbM (x) equals the variance of qbM (x) 1 µ(x) p(x) ∆µ(x) − p2(x) ∆q(x), which is of rate O(M=n). Therefore, the MSE and MISE will be at rate 1 M 1 M MSE = O + O ; MISE = O + O ; M 2 n M 2 n leading to the optimal number of bins M ∗ n1=3 and the optimal convergence rate O(n−2=3), the same as the histogram. Similar to the histogram, the regressogram has a slower convergence rate compared to many other com- petitors (we will introduce several other candidates). However, they (histogram and regressogram) are still very popular because the construction of an estimator is very simple and intuitive; practitioners with little mathematical training can easily master these approaches. Note that if we assume that the response variable Y is bounded, you can construct a similar concentration bound as the case of histogram and obtain the rate under the L1 metric. 3.3 Kernel Regression Given a point x0, assume that we are interested in the value m(x0). Here is a simple method to estimate that value. When m(x0) is smooth, an observation Xi ≈ x0 implies m(Xi) ≈ m(x0). Thus, the response value Yi = m(Xi) + i ≈ m(x0) + i. Using this observation, to reduce the noise i, we can use the sample average. Thus, an estimator of m(x0) is to take the average of those responses whose covariate are close to x0. To make it more concrete, let h > 0 be a threshold. The above procedure suggests to use P n Yi P i:jXi−x0|≤h i=1 YiI(jXi − x0j ≤ h) mb loc(x0) = = Pn ; (3.1) nh(x0) i=1 I(jXi − x0j ≤ h) where nh(x0) is the number of observations where the covariate X : jXi − x0j ≤ h. This estimator, mb loc, is called the local average estimator. Indeed, to estimate m(x) at any given point x, we are using a local average as an estimator. The local average estimator can be rewritten as Pn n n YiI(jXi − x0j ≤ h) X I(jXi − x0j ≤ h) X m (x ) = i=1 = · Y = W (x )Y ; (3.2) b loc 0 Pn I(jX − x j ≤ h) Pn I(jX − x j ≤ h) i i 0 i i=1 i 0 i=1 `=1 ` 0 i=1 where I(jXi − x0j ≤ h) Wi(x0) = Pn (3.3) `=1 I(jX` − x0j ≤ h) Pn is a weight for each observation. Note that i=1 Wi(x0) = 1 and Wi(x0) > 0 for all i = 1; ··· ; n; this implies that Wi(x0)'s are indeed weights. Equation (3.2) shows that the local average estimator can be written as a weighted average estimator so the i-th weight Wi(x0) determines the contribution of response Yi to the estimator mb loc(x0). In constructing the local average estimator, we are placing a hard-thresholding on the neighboring points{ those within a distance h are given equal weight but those outside the threshold h will be ignored completely. This hard-thresholding leads to an estimator that is not continuous. 3-4 Lecture 3: Regression: Nonparametric approaches To avoid problem, we consider another construction of the weights. Ideally, we want to give more weights to those observations that are close to x0 and we want to have a weight that is `smooth'. The Gaussian 2 function G(x) = p1 e−x =2 seems to be a good candidate. We now use the Gaussian function to construct 2π an estimator. We first construct the weight x0−Xi G G h W (x0) = : i Pn x0−X` `=1 G h The quantity h > 0 is the similar quantity to the threshold in the local average but now it acts as the smoothing bandwidth of the Gaussian. After constructing the weight, our new estimator is n n x0−Xi Pn x0−Xi X G X G h i=1 YiG h mG(x0) = W (x0)Yi = Yi = : (3.4) b i Pn x0−X` Pn x0−X` i=1 i=1 `=1 G h `=1 G h This new estimator has a weight that changes more smoothly than the local average and is smooth as we desire. Observing from equation (3.1) and (3.4), one may notice that these local estimators are all of a similar form: Pn x0−Xi n x0−Xi i=1 YiK h X K K K h mh(x0) = = W (x0)Yi;W (x0) = ; (3.5) b Pn x0−X` i i Pn x0−X` `=1 K h i=1 `=1 K h where K is some function. When K is a Gaussian, we obtain estimator (3.4); when K is a uniform over [−1; 1], we obtain the local average (3.1). The estimator in equation (3.5) is called the kernel regression estimator or Nadaraya-Watson estimator1. The function K plays a similar role as the kernel function in the KDE and thus it is also called the kernel function. And the quantity h > 0 is similar to the smoothing bandwidth in the KDE so it is also called the smoothing bandwidth.

Load more