Nonparametric Regression Handout
Total Page:16
File Type:pdf, Size:1020Kb
Nonparametric Regression Handout Damon Clark (ECO245) April 2014 This handout is designed to walk you through the important aspects of nonparametric regression using the same example throughout. It obviously doesn’t substitute for a more rigorous treatment of non-parametric regression. For that, I would recommend starting with Chapter 9 of Cameron and Trevedi (the best textbook treatment in my view) and following the references in there. 1: Motivation Suppose: 1. We have data on Y and W 2. We know the data were generated by the model: Y = g(W )+u E[u W ]=0 | 1 3. We don’t know the precise form of g(.) (e.g., whether its g(W )=1+2W or g(W )= W etc) 4. We we want to estimate (for example) E[Y W = 1] = g(1). | How do we go about getting an estimate of g(1)? 1.1: Example In general we don’t know g(.) but to consider the performance of different strategies, lets suppose g(W )=ln(W ) such that g(1) = 0: So the question is how close different estimates of g(1) are to the true value (=0). 2: Parametric methods Suppose we decide to proxy the unknown g() with a low-order polynomial (e.g., 4th order). Then we could get an estimate, call it g(1),inoneoftwoequivalentways: 1. Regress Y on W ,W 2,W 3,W 4 then use g(1) = β + β 1+β 12 + β 13 + β 14 d 0 1 ⇤ 2 ⇤ 3 ⇤ 4 ⇤ 2. Define V = W 1,thenregressY on V ,V 2,V 3,V 4,thenuseg(1) = β − d c c c c0 c d c 1 2.1: How well does the parametric model do? Look at the fitted values from one simulation: clear set seed 123456 set obs 10000 **** SPECIFY DGP *** ** W gen w=100*(runiform()) assert w>0 & w<100 ** g(W) FUNCTION gen g=ln(w) ** ERROR TERM gen u=rnormal(0,1) ** OUTCOME gen y=g+u **** ESTIMATE USING PARAMETRIC MODEL **** gen w2=w*w gen w3=w2*w gen w4=w3*w reg y w w2 w3 w4 predict yhat graph twoway (scatter yhat w), yline(0) xline(1) It doesn’t look good: this model can’t get the curvature in ln(W ) around 1 -> we would expect g(1) > 0 (i.e., too large). 2.2: Monte Carlo Analysis d The above analysis suggests the parametric model will give bad estimates of g(1).Butthatwasforonedrawofthedata.Perhaps if we take many draws, hence generate many estimates, these will look ok “on average”. We use Monte Carlo exercises to perform these types of analyses. The following code generates 100 estimates of g(1): capture program drop cbrm_sim program define cbrm_sim, rclass syntax [, obs(integer 1)] drop _all set obs ‘obs’ ** DGP gen u=rnormal(0,1) gen w=100*runiform() gen y=ln(w)+u 2 ** PARAMETRIC REGRESSION MODEL gen v=w-1 gen v2=v*v gen v3=v2*v gen v4=v3*v reg y v v2 v3 v4 gen g_para=_b[_cons] end simulate g_para=g_para, seed(123456) reps(100): cbrm_sim, obs(10000) graph twoway (kdensity g_para) The estimates are centered around 0.5 (roughly) and the distribution lies well to the right of the true value (0). So these parametric estimates do badly. 3: Non-parametric methods: discrete W Suppose W is discrete - e.g., W =[1, 2,..100]. Then a good estimate of g(1) will be Y (1).WecancomputeY (1) using one of two equivalent methods: Y 1(W =1) 1. Calculate Y (1) = i i 1(Wi=1) P 2. Regress Y on D2..DP100, where Dj is a dummy that takes value 1 if W = j, zero otherwise. Then β0 = Y (1). Its also true that βj = Y (j) Y (1). − c How do web know that Y (1) will be a good estimate? 1. Because we can write E[Y X] such that every value of E[Y X] is associated with a parameter (i.e., the model is saturated). | | Hence E[Y W = 1] = β and so β must be a good estimate (by the good properties of OLS). | 0 0 2. We can do a Monte Carlo exercise.c 3.1: Monte Carlo Let’s change the DGP so that W takes on values 1, 2, ..100 (i.e., is discrete): capture program drop cbrm_sim program define cbrm_sim, rclass syntax [, obs(integer 1)] drop _all set obs ‘obs’ 3 ** DGP gen u=rnormal(0,1) gen w=ceil(100*runiform()) unique w assert r(sum)==100 gen y=ln(w)+u ** PARAMETRIC REGRESSION MODEL forvalues i=2(1)100 { gen d‘i’=1 if w==‘i’ recode d‘i’ .=0 } reg y d2-d100 gen g_nonpara_discrete=_b[_cons] end simulate g_nonpara_discrete=g_nonpara_discrete,seed(123456) reps(100): cbrm_sim, obs(10000) graph twoway (kdensity g_nonpara_discrete) As expected, the estimates are centered around zero. Hence with discrete W ,itseasytoget“good”nonparametricestimates: just saturate the model. 4: Non-parametric methods: continuous W Its harder to get good estimates of g(1) when W continuous. Many possible approaches: 1. Local polynomial regression (locpoly in stata - use weighted average of observations in bin of certain size around point of interest). 2. Nearest-neighbor smoothing (lowess/ksm in stata - use weighted average of certain number of observations around point of interest). 3. Series methods (use all data) Focus on 1 (most commonly used). 4.1: Local Polynomial Regression Basic idea: for values of W close to the point of interest W0, g(W ) can be approximated via a Taylor expansion: g(2)(W ) g(p)(W ) g(W ) g(W )+g(1)(W )(W W )+ 0 (W W )2 + .. 0 (W W )p ⇠ 0 0 − 0 2! − 0 p! − 0 = β + β (W W )+β (W W )2 + ..β (W W )p 0 1 − 0 2 − 0 p − 0 4 So in my example: g(W )=β + β (W 1) + β (W 1)2 + ..β (W 1)p 0 1 − 2 − p − Now our estimate of g(1) will involve (a) estimating parameters β0,β1,etcand(b)usingg(1) = β0.Buthowdoweestimatethe parameters? d c 1. An obvious approach is to choose the parameters that minimimize the following function (which is the sum of squares residuals, hence the parameters will be the coefficients from “Reg Y on (W 1), (W 1)2,etc”). − − [Y β β (W 1) β (W 1)2 ..β (W 1)p]2 i − 0 − 1 i − − 2 i − − p i − i X c c c c 2. More generally, we can choose the parameters that minimimize the following function. [Y β β (W 1) β (W 1)2 ..β (W 1)p]2K i − 0 − 1 i − − 2 i − − p i − i i X c c c c where Ki is a kernel function that controls how much weight is given to each observation i. There are two choices involved in doing this exercise: 1. The order of the polynomial p .E.g.,doweuseonly(W 1) and (W 1)2 or do we also use (W 1)3? i − i − i − 2. The Ki function (i.e., “kernel function” aka “kernel”). Note: (a) The kernel controls which observations are used and how they are weighted. (b) The kernel must satisfy some properties (for nonparametric estimators to have some “good” properties). (c) The kernel generally depends on h (labeled the “bandwidth”), which often determines which observations actually used. (d) The simplest kernel is the “rectangular kernel” aka “uniform kernel”: 1 W W K = ( i − 0 < 1) i 2 | h | In effect, this says “use all observations within distance h of W0 and weight them equally”. (e) A popular alternative kernel is the “Gaussian kernel”: W W 2 1 1 ( i− 0 ) Ki = e− 2 h p2⇡ In effect, this says “use all observations of W but give more weight to those closest to W0” In what follows we’ll restrict attention to the rectangular kernel. We’ll consider various bandwidths. We’ll consider two orders (the most common choices): 1. Order=0 (aka “Kernel Regression” aka “Nadaraya-Watson Regression”) 2. Order=1 (aka “Local Linear Regression”). 5: Kernel Regression When order=0 and kernel=rectangular, LPR estimator minimizes: W W 1 (Y β )21( i − 0 1) i − 0 | h | 2 i X The value of β0 that minimizes this the simple average of the Y 0s within h of W0. c 5 5.1: Example: h=0.1 When h =0.1,thenforasingledrawofthedata,g(1) is the average Y among observations for which W [0.9, 1.1]. KR 2 d The following code shows how I drew this graph and confirms that the KR estimate can be obtained by taking means or using Stata’s nonparametric regression command (“locpoly”): clear set seed 123456 set obs 10000 **** SPECIFY DGP *** ** W gen w=100*(runiform()) assert w>0 & w<100 ** g(W) FUNCTION gen g=ln(w) ** ERROR TERM gen u=rnormal(0,1) ** OUTCOME gen y=g+u ** CALCULATE KR ESTIMATE BY HAND (h=0.1)*** reg y if w>0.9 & w<1.1 gen double g_nonpara_kr1=round(_b[_cons], 0.01) su g_nonpara_kr1 assert r(mean)==0.16 *GRAPH predict y_pred graph twoway (scatter y w) (line y_pred w if w>0.9 & w<1.1) if w>=0.5 & w<=1.5, xline(0.9 1.1, lpat- tern(dash)) xline(1) note(LLR estimate=‘r(mean)’) ** CALCULATE KR ESTIMATE OTHER WAYS (h=0.1)*** * USING MEAN su y if w>0.9 & w<1.1 gen double g_nonpara_kr2=round(r(mean),0.001) assert g_nonpara_kr2==0.16 * USING STATA NONPARA REGRESSION COMMAND moreobs 1 recode w .=1 gen c=1 if w==1 locpoly y w, at(c) degree(0) width(0.1) rectangle adoonly gen(g_nonpara_kr3) nogr assert round(g_nonpara_kr3,0.01)==0.16 if c==1 6 5.2: Performance of this KR estimator In the case above, the KR estimate is biased up (because the true value is 0).