American Economic Association

Nonparametric Regression Techniques in Economics Author(s): Adonis Yatchew Source: Journal of Economic Literature, Vol. 36, No. 2 (Jun., 1998), pp. 669-721 Published by: American Economic Association Stable URL: http://www.jstor.org/stable/2565120 . Accessed: 19/08/2011 15:48

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]

American Economic Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of Economic Literature.

http://www.jstor.org Journal of Economic Literature Vol. XXXVI (June 1998) pp. 669-721

NonparametricRegression Techniques in Economics


1. Introduction on an infinite number of unknown pa- rameters. (Nonparametric regression 1.1 Setting the Stage typically assumes little else about the shape of the regression function beyond 1.1.1 Benefits of Nonparametric Esti- some degree of smoothness.) Yet, we mation. If economics is the dismal sci- will argue that it is because such tools ence, then is its ill-fated lead to more durable inferences that offspring. The limited number of strong they will become an enduring-even in- implications derived from economic dispensable-element of every econo- theory, the all but complete inability to mist's tool kit, in much the same way perform controlled , the that linear and are paucity of durable empirical results, the today. After all, reliable empirical re- errant predictions, the bottomless res- sults are essential to the formulation of ervoir of difficult to measure but poten- sound policy. tially relevant variables-these do not From the point of view of a pure set a pleasant working environment for analyst, the added value of nonparamet- the empirical analyst in economics. ric techniques consists in their ability to When one contrasts the overall qual- deliver estimators and inference proce- ity of economic data and empirical in- dures that are less dependent on func- ferences with the plethora of sophisti- tional form assumptions. They are also cated econometric theory already useful for exploratory data analysis and available, it would appear difficult to as a supplement to parametric proce- justify learning (or teaching) techniques dures. If one has reservations about a where the regression function depends particular parametric form, specifica- 1 Department of Economics, University of tion tests against nonparametric alter- Toronto. The author is grateful to Andrey Feuer- natives can provide reassurance. verg;er, Zvi Griliches, James MacKinnon, Angelo From the point of view of the econo- Me ino, John Pencavel and particularly to Frank Wolak for their patient reading and insightful mist/econometrician, such techniques comments. The and econometrics litera- possess additional appeal in that most ture on nonparametric regression is massive and implications of economic theory are inevitably many interesting and important results have been given little or no exposure in this intro- nonparametric. Typically, theoretical duction to the subiect. The author regrets this arguments exclude or include variables, state of affairs and hopes that his lacunae will be they imply monotonicity, concavity, or corrected in complementary and more artful re- views by other writers. The author may be con- homogeneity of various sorts, or they tacted at [email protected] embody more comDlex structure such as 669 670 Journal of Economic Literature, Vol. XXXVI (June 1998) the implications of the maximization hy- because the bound on the first derivative pothesis. They almost never imply a implies that If(xt)-f(xti1) I < L Ixt - xt- I .) specific functional form (the pure quan- As long as z is not perfectly correlated tity theory of mioney equation being one with x, the ordinary estima- exception). This paper will therefore fo- tor of f3using the differenced data, that is cus some considerable attention on con- strained nonparametric regression esti- X(yt - yt-i) (Zt - Zt-i) = Xzzt)2 (1.2) and Pdiff mation testing. y(Zt - Zt_1)2 In some cases, the researcher may feel comfortable with a particular para- has the approximate distribu- metric form for one portion of the re- tion gression function, but less confident about the shape of another portion. fdiff NNf(Vj PI 1 ) 1.3 Such varying prior beliefs call for com- bining parametric and nonparametric where y2 is the conditional of z techniques to yield semiparametric re- given x.2 gression models (these have been stud- We have applied this estimator to ied extensively). Inclusion of the non- data on the costs of distributing elec- parametric component may avoid tricity (Figure 1). Factors which can inconsistent estimation which could re- influence distribution costs include cus- sult from incorrect parameterization. tomer density (greater distance be- 1.1.2 An Appetizer for the Reluctant tween customers increases costs), the Palate. For those who have never con- remaining life of physical plant (newer sidered using a nonparametric regres- assets require less maintenance), and sion technique, we suggest the follow- wage rates. These are among the "z" ing elementary procedure, much of variables which appear parametrically which can be implemented in any stan- in the model. However, the nature of dard econometric package. Suppose you scale economies is unclear-the effect are given data (yi,zi,x1). .. (yT,zT,xT) on the on costs could be constant, declining model y = zf3 +f(x) + e where for simplic- to an asymptote or even U-shaped. ity all variables are assumed to be sca- Indeed, important regulatory decisions lars. The Ct are i.i.d. with 0 and (such as merger policy) may involve variance (y2 given (z,x). The x's are determining the minimum efficient drawn from a distribution with support, scale. say, the unit interval. Most important, In the left panel, we estimate the cost the data are rearranged so that function parametrically, incorporating a XI < ... < XT. All that is known about f is quadratic term for the scale variable "x" that its first derivative is bounded by a which we measure using the number of constant, say L. Suppose that we first customers of each utility. In the right difference to obtain panel, we report the coefficient esti- mates of the z variables after applying yt - yt-1 = (Zt - Zt-1)f the differencing estimator (1.2), suit- + (f(Xt) -f(xt-I)) + ?t-t - (1.1) ably generalized to allow for vector z. There is moderate change in coefficients. As sample size increases-packing the unit interval with x's-the typical difference 2 Throughout the paper, the symbol - will de- note that a has the indicated ap- xt - xt-1 shrinks at a rate of about IIT so proximatedistribution and the symbol_ will indi- that f(xti1) tends to cancel f(xt). (This is cate approximateequality. Yatchew: Nonparametric Regression Techniques in Economics 671

Figure 1. Returns to Scale in Electricity Distribution

Model: Restructuring,vertical unbundling, and deregulation of electricity industries have led to reexaminationof scale economies in electricity distribution.The of small (usually municipal) distributorsthat exist in the U.S., Norway, New Zealand, Germany and Canadais at issue. The objective is to assess scale economies. The model is given by y = z i + f(x) + ? where y is variable costs of distributionper customer (COST PER CUST). The vector z includes: customer density (ruraldistribution is more expensive) as measured by average distance between customers (DIST); load per customer (LOAD); local generation per customer (GEN); remaining life of assets (LIFE) - older plants require more maintenance; assets per customer (ASSETS);and a proxy for labour wage rates (WAGE). The scale of operation, x, is measured by the number of customers served (CUST). The scale effect may be nonlinear, (e.g., decreasing to an asymptote or U-shaped), hence the inclusion of a nonparametriccomponentf (x). Data: 265 municipal distributorsin Ontario, Canadawhich vary in size from tiny ones serving 200 customers to the largest serving 218,000 (a 'customer' can be a household, or a commercial, industrialor governmental entity).


PARAMETRIC:Quadratic scale effect SEMIPARAMETRIC:Smooth scale effect y= a + Z3 +y1x +y2x2 + ? Y = z 3 +f(x) + ? OLS Differencing Estimator1 Coefficient StandardError Coefficient StandardError oc 15.987 37.002 DIST 3.719 1.248 ADIST 2.568 1.560 LOAD 1.920 .661 ALOAD .437 .912 GEN -.051 .023 AGEN .0005 .032 LIFE -5.663 .798 ALIFE -4.470 1.070 ASSETS .037 .005 AASSETS .030 .0072 WAGE .003 .0007 AWAGE .003 .00086 71 -.00154 .00024 72 .1x10-7 .14x10-8

R2 .45


, 200 Kernel w34 ------Quadratic 0 E- 150- Q ~~~~~~~~~~0 = 100 o 0 C0 05 DCo ? E

C/) 0%00Qx 800 0 80 0 00

0~~~~~~~ 500 1000 5000 0 1000500 100

1 aarodrds htx < .t. . XT,.0 othe al-0 vaibe difeecedAw= w0 - , foloe by0 orinr lesqurs Standarderor ~ mutple by XTS as per eqato (13) DeenenA- vaial is COS PE CUST. 500 1000 5000 10000 50000 100000 CUSTOMERS ? ? 1Data reordered so that x, ... XT, then all variables differenced A wt = Wt- Wtl1, followed by ordinaryleast squares. Standarderrors multiplied by Vfi75, as per equation (1.3). Dependent variable is A COST PER CUST. 672 Tournal of Economic Literature. Vol. XXXVI (Tune 1998)

Though the "differencing estimator" vations. For the approach described of f3 does not require estimation of f, here, the implicit estimate of f(xt) is one can visually assess whether there is f(xt-i). a potentially interesting relationship em- 1.1.3 Objectives of the Paper. Why bodied in f by producing a scatter-plot have nonparametric regression methods of yt - Ztdiff against xt as we have done not yet infiltrated the applied literature in Figure 1. In addition, we ,plot two to the degree one might expect, particu- curves-the quadratic estimate y 2x+ 72x2, larly since other nonparametric tech- (it does not look quadratic because x is niques are used without hesitation? Af- scaled logarithmically) and a nonpara- ter all, anyone using the central limit metric estimate using the "kernel" tech- theorem to do inference on a mean nique, which we will later discuss in de- from an unknown distribution has en- tail. The two estimates are strikingly gaged in semiparametric inference. The similar, suggesting that the quadratic same is true if the mean is a linear func- specification is adequate (we will later tion of some explanatory variables, but provide formal tests of this proposi- the distribution of the residuals remains tion). Note further that the standard er- unknown, Even the , which rors of the differencing estimator are we teach in introductory courses, is a larger than the pure parametric ones, as nonparametric estimator of the underly- one would expect from (1.3). ing density. For the most efficient nonparametric True, nonparametric regression has estimator, the 1.5 factor in equation not thus far uncovered any particularly (1.3) is replaced by 1, thus the relative startling empirical phenomena that efficiency of this differencing estimator were hitherto inaccessible to parametric is 66.7 percent (=1/1.5). Later we will practitioners.3 But this does not explain outline how asymptotic efficiency can the absence of greater use in explora- be achieved by using higher order dif- tory data analysis, as a confirmatory ferences. tool, or as a supplement to the standard Our analysis of this partial linear parametric fare. Several factors have in- model has been divided into two parts: fluenced this relatively slow penetration first we analyzed the parametric portion rate at the same time that there has of the model, all of which can be done been an explosion of theoretical results in a standard econometric package; in this area. then we estimated the nonparametric First, nonparametric regression tech- portion of the model (estimators for niques are theoretically more complex which are widely available in statistical than the usual tool kit of linear and non- packages such as S-Plus). This modular linear parametric modelling methods. approach will be a theme of the paper, Second, nonparametric regression since it permits one to use existing soft- techniques are computationally inten- ware and to adapt a variety of paramet- sive and they require large (in some ric or purely nonparametric procedures cases astronomically large) data sets, to this setting. since relationships are "discovered" We make one final observation on the by examining nearby observations. differencing idea which we exploit in ("Nearby"better not be too far away.) this paper because of its simplicity. Nonparametric procedures estimate the 3 Some may be surprised by the apparent ab- sence of scale economies in electricity distribution value of the regression function at a (Figure 1). After all, the so-called "wires business" given point by using neighboring obser- is a natural monopoly. Yatchew: Nonparametric Regression Techniques in Economnics 673

Third, a unified framework for con- However, for the interested applied strained estimation and testing of eco- economist, there is no need to wait for nomic models using nonparametric re- future software developments. Many of gression is still in the incipient stage. As the procedures outlined in this paper a consequence, software which will han- can be implemented using off-the-shelf dle such procedures in an automated software, in particular S-Plus (see for fashion does not yet exist. example, William Venables and Brian A central objective of this paper is to Ripley 1994), and XploTe (see Wolf- demonstrate that these barriers are in- gang Hardle, Sigbert Klinke, and Ber- deed substantially lower than might first win Turlach 1995). appear. We comment on each in turn. Third is the issue of constrained esti- First, we deal with the issue of mation and hypothesis testing. In a theoretical sophistication. Nonparamet- parametric setting, constraints can be ric regression typically involves either imposed on parameters relatively easily, local averaging or some form of least and many hypotheses can then be tested squares estimation. Both ideas are fa- (for example, by comparing restricted miliar from parametric modelling, and and unrestricted sums of squared re- indeed a significant portion of the the- siduals) . ory of nonparametric regression in- In a nonparametric setting, imposing volves straightforward extenision of constraints on the estimator is often results and techniques familiar to more difficult. However, if a restricted parametric practitioners. Unfortunately, estimator is obtained, one can examine nonparametric methods also have criti- the estimated residuals to see whether cal elements that are not present in the they constitute pure error or whether pure parametric setting. Most impor- they are related to the explanatory vari- tant among these are the "curse of di- ables. If it is the latter, then this sug- mensionality" and the need to select the gests invalidity of the constraints. (For- value of a "smoothing parameter." Tak- mal testing can proceed by performing ing as a premise that a technique is un- a regression of the estimated residuals likely to be used (or worse, unlikely to on all the explanatory variables.) be used correctly) without a rudimen- In summary, our overarching objec- tary understanding and intuitive grasp tive is to increase the accessibility of of the basic theory, we set out, wher- nonparametric techniques to econo- ever possible, simple arguments sup- mists. To pursue this main goal, our porting the theoretical propositions. three subsidiary objectives are to pro- Second is the issue of computation. vide implementation details for a few The precipitous drops in computing relatively simple nonparametric regres- costs, data storage, and even data col- sion estimation and inference tech- lection (the latter through the prolifera- niques, to summarize central theoret- tion of automated tech- ical results which we hope will make nologies) are effectively eliminating this the techniques more intelligible, and to as a barrier to the use of nonparametric outline procedures for constrained esti- regression techniques. We believe that mation and hypothesis testing. the forthcoming industry standard (be it 1.1.4 Charting the Terrain Ahead. a local averaging estimator, a least The structure and contenit of the paper squares estimator, or a hybrid) will be is thus driven by the objectives men- coupled with computer-intensive infer- tioned above. The remainder of the In- ence techniques, such as the bootstrap. troduction, entitled Background and 674 Journal of Economic Literature, Vol. XXXVI (June 1998)

Overview, categorizes various kinds of cusses constrained estimation and hy- regression models, from parametric at pothesis testing. the simple extreme, through semipara- In the fifth section of the paper, Ex- metric models, to various kinds of non- tensions and Technical Details, we col- parametric models which may incorpo- lect a variety of topics. The partial lin- rate additional structure (such as ear model is revisited yet again, this additive separability or monotonicity). time to demonstrate that the parametric The section then discusses the "curse of and nonparametric portions can be ana- dimensionality" a phenomenon which lyzed separately. Thus, a variety of has no counterpart in a parametric set- purely nonparametric procedures can ting. Following this, two naive nonpara- be grafted onto the nonparametric por- metric regression estimators are intro- tion of the partial . The duced-the first based on local role of various constraints implied by averaging, the second on least squares. economic theory is discussed. The sec- The section closes with a glimpse of es- tion also includes a note on available sential theoretical results. Upon com- computer software, and directions for pletion of the Background and Over- further reading. view, the reader should have an Section 6 presents our conclusions. appreciation of the range of models un- We will from time to time make use der study, the two most basic estimation of a number of mathematical ideas deal- principles, the theoretical challenges, ing with sequences of numbers and of and a minimal set of theoretical results. random variables. We have summarized The second section of the paper de- these in Appendix A. Throughout the tails two nonparametric regression esti- paper, equations of particular signifi- mators-the kernel estimator which is cance or usefulness are indicated by based on the principle of local averag- equation numbers set in boldface type. ing; and nonparametric least squares, which is closely related to spline esti- 1.2 Background and Overview mation. The section also summarizes their main statistical properties. 1.2.1 Categorization of Models. Con- The third section focuses on The Par- sider the model y =f(x) + s where ? is tial Linear Model which we believe to i.i.d. with mean 0 and variance ag given be the most likely "entry level" model x. Iff is known only to lie in a family of for economists. This section describes smooth functions -3 then the model is three distinct estimators. In addition to nonparametric. If f satisfies some addi- the partial linear model, the class of tional properties (such as monotonicity, semiparametric specifications includes concavity, homogeneity or symmetry) other important subclasses such as and hence lies in 3 c 3, we will say that partial parametric and index models. the model is constrained nonparamet- Despite the prominence of index mod- ric. If we can partition x into two sub- els in the econometric literature, we sets xa and Xb,such that f is of the form have not included them in this paper, fa(Xa)+fb(xb), then it is called additively essentially because of space limitations. separable. A foothold into that literature may be Next we turn to semiparametric mod- established by reading Thomas Stoker els. In this paper we will focus on the (1991). See also James L. Powell (1994) partial linear model already introduced. and references therein. (See Figure 2 for categorization of vari- The fourth section of the paper dis- ous kinds of regression models.) Yatchew: Nonparametric Regression Techniques in Economics 675

Figure 2. Categorizationof Regression Functions

Parametric- Linear Semiparametric- PartialLinear

Linear y =x + (pictured above) PartialLinear y = z,B +f(x) + e fe 3 (pictured above) Nonlinear y = g (X;j)+c, g known PartialParametnrc y = g (z;4) +f(x)+ F. g known,fe 3 Index Models y =f (z) + e, fe 3

Additively Separable Nonparametric

I -

Smooth y =f(xa)+fb(xb)+ , fa+fb e Smooth y =fx) + , fe 3 (pictured above) Smooth and Constrainedy = f (x) + F, fe .3

3 is a smooth family of functions. Z is a smooth family with additionalconstraints such as monotonicity,concavity, symmetryor other constraints.

1.2.2 The Curse of Dimensionality jective is to approximate a function f. If and the Need for Large Data Sets. In it is known to be linear in one variable, comparison to parametric estimation, two observations are sufficient to deter- nonparametric procedures can impose mine the entire function; three are suf- enormous data requirements. To gain ficient iff is linear in two variables. Iff an appreciation of the problem as well is of the form g(x;,) where g is known as remedies for it, we begin with a de- and : is an unknown k-dimensional vec- terministic framework. Suppose the ob- tor, then k judiciously selected points 676 Journal of Economic Literature, Vol. XXXVI (June 1998) are usually sufficient to solve for P. No linear form z, +f(x) (the function f is further observations on the function are unknown except for a derivative necessary. bound). In this case, we need two evalu- Let us turn to the pure nonparamet- ations along the z axis to completely de- ric case. Suppose f, defined on the termine P (see the "Semiparametric" unit interval, is known only to have a surface in Figure 2). Furthermore, T first derivative, bounded by L (i.e., equidistant evaluations along the x axis supxF_[o,l]If'I

Figure 3 Naive Local Averaging

0~~~~~~~~~~~~~~ 0 1.0 0iRA s X 0 0 0 ~~~~~~~~~0 0.5 - _ 0 0 0 ~~~~~~~~~~~~0 00 0 .0.0 ~~000 0 0 .00 0. 0 0 0 0 1 1.0 | Bandwidth= .40 | o? |o0 ,o

0 0 0~~~~~~~~~~~~~~~~~~~~~ 1.0 X 0~~~~~~~~~~~~~~~~~~| 0 -1.0 Bandwvidth= .15 0 o _

0.0 0.2 0.4 0.6 0.8 1.0 x100 -1.0~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1.0~~~~~~~~~~~~~~~~ 0.5~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0 0~~~~~~-

0 0 0. 0.0. 0. 1.

I (x T r 00 0

0 0~~~~~ 0

-1.0~~ 00 0 + 01 1.0 N , T 0 O - a a o

0.5dsof the indicated width. 0.0 0.2 0.4 0.6 0.8 1.0 xlOO

bandsBofdthedidicatedwidth Yatchew: Nonparametric Regression Techniques in Economics 679

Figure 4 Bias-VarianceTradeoff

1.0 -




0.0 0.2 0.4 0.6 0.8 1.0 x 1.0 -

0.5 ~~~~Bandwidth=.15


0.0 0.2 0.4 0.6 0.8 1.0 x 1.0 - Bandwidth=.40 0.5 -_i


s ~~~~/ 1 l l l -0.5 -i

0.0 0.2 0.4 0.6 0.8 1.0 x

- f (x) true regressionfunction E [f (x)] expectation of the estimator E [f(x)]?2[Varf (x)]1/2 ?2 x

- DATA GENERATING MECHANISM Yt = Xt cos(4ntxt) + -t et N (0,09) xte [0,1], T=100 the example of Figure 3). In the first cides almost perfectly with the true re- panel, local averaging is taking place us- gression function (the dashed line). The ing just 5 percent of the data at each broken lines on either side correspond to point (since the bandwidth is .05 and 100 two times the standard errors of the esti- x's are uniformly spaced on the interval mator at each point- 2(Var[fx)])/2. In [0,1]). The solid line is E[tx)] and the the second panel the bandwidth is sub- estimator exhibits little bias-it coin- stantially broader; we are now averaging 680 Journal of Economic Literature, Vol. XXXVI (June 1998) about 15 percent of the data at each point. The standard error curves are min T X(yt - yt)2 t tighter but some bias has been intro- A1.,T A A - = duced. The E [Aftx)]no longer coincides Yt YS < L s,t 1, ... ,T. (I. 9) perfectly with the true regression curve. Xt-XS In the third panel, averaging is taking place over 40 percent of the data. The Here At is the estimate of f at xt andf is standard error curves are even tighter, a piecewise linear function joining the but now there is substantial bias, particu- Yt with slope not exceeding the deriva- larly at the peaks and valleys of the true tive bound L. Under general conditions regression function. The expectation of this estimator will be consistent. Fur- the estimator E[fAx)] is fairly flat, while thermore, adding monotonicity or con- the true regression function undulates cavity constraints, at least at the points widely around it. where we have data, is straightforward. A more general formulation of local As additional structure is imposed, the averaging estimators modifies (1.6) as estimator becomes smoother and its fit follows: to the true regression function improves (see Figure 5). T 1.2.4 A Bird's-Eye View of Important f(xo) = Wwt(x0)yt. (1.8) Theoretical Results. The non/semipara- metric literature contains a large num- The estimate of the regression function ber of theoretical results. Here we at xo is a weighted sum of the yt where summarize, in crude form, the main the weights wt(xo) depend on xo. (A categories of results that are of par- number of local averaging estimators ticular interest to the applied re- can be put in this form including ker- searcher. nel, nearest neighbor and regresso- Computability of Estimators. Our gram.) Since one would expect that ob- preliminary exposition of local averag- servations close to xo would have condi- ing estimators suggests that their com- tional means similar tof(xo), it is natural putation is generally straightforward. to assign higher weights to these obser- The naive optimization estimator con- vations and lower weights to those that sidered above can also be calculated are farther away. Local averaging estima- easily, even with additional constraints tors have the advantage that as long as on the regression function. What is the weights are known, or can be easily more surprising is that estimators which calculated, f is also easy to calculate. minimize the sum of squared residuals The disadvantage of such estimators is over (fairly general) infinite dimen- that it is often difficult to impose addi- sional classes of smooth functions tional structure on the estimating func- can be obtained by solving finite di- tionf. mensional (often quadratic) optimiza- Optimization estimators, on the other tion problems. (See Sections 2.1 and hand, are more amenable to incor- 2.2.) porating additional structure. As a Consistency. In nonparametric re- prelude to our later discussion, consider gression, smoothness conditions (in par- the following naive estimator. Given ticular, bounds on derivatives), play a data (yI,xi) ... (yT,XT) on yt =f(xt) + et, central role in assuring consistency of where xte [0,1], suppose If'i< L and we the estimator. They are also critical in solve determining the rate of convergence as Yatchew: Nonparametric Regression Techniques in Economics 681

Figure 5 Naive NonparametricLeast Squares

DATA GENERATING MECHANISM Yt =Xt + E t - N (0,,04) xtE [0,1]

SMOOTHNESS:min < 3 s,t= TX(yt-M)2 xt x < 1,..,25

0 0

1.0 0


z 0.5- 0 0 0

0 -0 0 0.0 0 - ---f(x) =x 0.0 _smooth'_estimate0 0


0.2 0.4 0.6 0.8 1.0 x

SMOOTHNESSAND MONOTONICITY:additional constraints y < y' for all x, < xt 0~~~~~~ 1.0 0 ?

- 0.5 0 0~~~~~~ 0 0 x 0.52 0 0 0 - -0* 0 -1~~

0 0.0 ,[I_ smooth' and monotone estimate


0.2 0.4 0.6 0.8 1.0 x

- - X MONOTONICITYAND constraintss < Xs Xr + Xt r = SMOOTHNESS, CONCAVITY:additional Y - rs,t 1,... T, xr.

1.0 0 00 O

0 0 ~0

0.5- 0 0 0

0 0 0

0.0 0 - - - - f(x) =x ?? 0 ?-'0smooth'- ~~~0 ~~ ~ monotone and concave estimate


0.2 0.4 0.6 0.8 1.0 x

The simulations were performed using GAMS -General Algebraic Modelling System (A. Brooke, D. Kendrick, and A. Meerhaus 1992). 682 Journal of Economic Literature, Vol. XXXVI (June 1998) well as certain distributional results.4 least squares estimators can be con- With sufficient smoothness, derivatives structed which achieve the optimal rate of the regression function can be esti- of convergence (see Sections 2.1 and 2.2.). mated consistently, sometimes by dif- Rate of convergence also plays an im- ferentiating the estimator of the func- portant role in certain test procedures. tion itself. (See Sections 2.1 and 2.2.) If the model is additively separable Rate of Convergence. How quickly or partially linear, then the rate of con- does one "discover" the true regression vergence of the optimal estimator de- function? In a parametric setting, the pends on the nonparametric component rate at which the variance of estimators of the model with the highest dimen- goes to zero is typically 1/T.5 It does sion (Stone 1985, 1986). For example, not depend on the number of explana- for the additively separable model tory variables. For nonparametric esti- y =fa(Xa) +fb(xb) + E where Xa, Xb are sca- mators, convergence slows dramatically lars, the convergence rate is the same as as the number of explanatory variables if the regression function were a non- increases (recall our earlier discussion parametric function of one variable. of the curse of dimensionality), but this The same is true for the partial linear is ameliorated somewhat if the function model y = zI3+f(x) + E where x and z are is differentiable. The optimal rate at scalars. Estimators of E can be con- which a nonparametricestimator can con- structed for which the variance shrinks verge to the true regression function is at the parametric rate 1/T and which given by (see CharlesJ. Stone 1980, 1982) are asymptotically normal. We have al- ready seen a simple differencing esti- mator with this property (see Sections 3 [f(X) -f(X)]2 dx= OPT2m/(2n + ( 0) d)j and 4.3 for further discussion). For the hybrid regression function wheremn equals the degree of differenti- f(z, Xa, Xb, xc) = Z4 + fa(Xa) + fb(xb) + fc(xc) of and is the dimension of ability f d x. where Xa, Xb, xc are of dimension da, For a twice differentiable function of db,dc, respectively, the optimal rate of one variable, (1.10) implies an optimal convergence for the regression as a rate of convergence of Op(TV5)(a case whole is the same as for a nonparamet- which will recur repeatedly); for a func- ric regression model with number of tion of two variables it is Op(lT2/3). variables equal to max {da,db, dc}. Local averaging and nonparametric Constraints such as monotonicity or 4 For example, in proving these results for mini- concavity do not enhance the (large mrization estimators, smoothness is used to ensure sample) rate of convergence if enough that uniform (over classes of functions) laws of smoothness is imposed on the model large numbers and uniform central limit theorems apply; see Richard Dudley (1984), David Pollard (see, Section 4.4). They can improve (1984), and Donald W.K. Andrews (1994a). performance of the estimator if strong 5 In the i.i.d. setting, if y = ,t + E, Var(y-) = G2/T smoothness assumptions are not made hence g - = Op(T-'2). For the model y = + r3x+ ? we have: or if the dataset is of moderate size (recall Figure 5). J@t+ rx - Q - r3x)2dx = (o - oc)2Jdx + (,B - ,)2Jx2dx Bias-Variance Trade-Off. By increas-

+ 2(ox - f- )Jxdx = 0p(1/T), ing the number of observations over which averaging is taking place, one can since A, are unbiased and Var(&),Var(f) and reduce the variance of a local averaging Cov(&r3) converge to 0 at 1/T. The same rate of convergence usually applies to general parametric estimator. But as progressively less forms of the regression function. similar observations are introduced, the Yatchew: Nonparametric Regression Techniques in Economics 683 estimator generally becomes more bi- regression estimation techniques are ased. The objective is to minimize the based upon principles which should be mean squared error (variance plus bias familiar to the applied economist-in squared). For nonparametric estimators particular, the notion of a local average which achieve optimal rates of conver- and the principle of least squares. We gence, the square of the bias and the have also introduced naive implementa- variance converge to zero at the same tions of each of these principles for rate (see Section 2.1 below). (In para- models with a single explanatory vari- metric settings the former converges to able. zero much more quickly than the lat- Of course, very few relationships of in- ter.) Unfortunately, this property com- terest to economists are so simple. Both plicates the construction of confidence principles have natural implementations intervals and test procedures. in the presence of multiple explanatory Asymptotic Distributions of Estima- variables. Unfortunately, our ability to tors. For a wide variety of nonparamet- accurately estimate the relationship de- ric estimators, the estimate of the teriorates as the number of such vari- regression function at a point is ap- ables increases. This "curse of dimen- proximately normally distributed. The sionality" canlbe mitigated somewhat by joint distribution at a collection of introducing additional structure such as points is joint normally distributed, as a semiparametric specification, by as- are various functionals such as the aver- suming higher order differentiability, or age sum of squared residuals. (See Sec- by imposing additive separability. tions 2.1 and 2.2 below.) How Much to Smooth. Smoothness 2. Nonparametric Regression parameters such as the bandwidth can 2.1 Kernel Estimators be selected optimally by choosing the value which minimizes out-of-sample 2.1.1 Estimation. We continue with prediction error. The technique, known our nonparametric regression model as cross-validation, will be discussed be- y =f(x) + , where for the time being x is low (see Section 2.3). a scalar. A conceptually convenient way Testing Procedures. A variety of to construct local averaging weights for specification tests of parametric or substitution into (1.8) is to use a uni- semiparametric null hypotheses against modal function centered at zero, which nonparametric or semiparametric alter- declines in either direction at a rate natives are available. Some nonparamet- controlled by a . Natural ric tests of significance are also avail- candidates for such functions, which are able. There are also tests of additive commonly known as kernels, are prob- separability, monotonicity, homogene- ability density functions. Let K be a ity, concavity and maximization hypo- bounded function which integrates to theses. The validity of bootstrap infer- one and is symmetric around zero. De- ence procedures has been proved in a fine the weights to be A number of cases. fairly unified testing I theory can be constructed using condi- K tXj tional moment tests. wt(xo)= (2.1) 1.3 Reprise TK Xt XoJ In this introductory part of the paper, we have argued that nonparametric 684 Journal of Economic Literature, Vol. XXXVI (June 1998)

The shape of the weights (which, by con- the uniform kernel, the denominator of struction, sum to one) is determined by (2.2) will be about 1 and the estimator K, while their magnitude is controlled by fxO) becomes a simple average of the yt X which is known as the bandwidth. A over the neighborhood N(xo). (For ex- large value of X results in greater weight ample, if the bandwidth equals .2, one being put on observations that are far is averaging about .2T observations or from x,. Using (1.8) the nonparametric about 20 percent of the data to obtain regression function estimator (first sug- the estimate.) Thus, we have gested by E.A. Nadaraya 1964 and G.S.

Watson 1964) becomes f(xo) -Xyt N(x0) x o -,T K yt _1 1 XT~ LXY et fix0)= (2.2) XT Xf(xt) +

Nff(x0) XTN(Xt-X)(23 fN(xo ) XTEXKI Xt XoJ A variety of other kernels are available (see Figure 6). Generally, selection of +f (Xo) (Xt - Xo) + 2T3t. the kernel is less important than selec- 2TN(Xo) N(x) tion of the bandwidth over which obser- vations are averaged. The simplest is the + 1 ~~~~2E 2kTJ: ( +t T et. uniform kernel (a.k.a. rectangular or box -ff(xo) + '/2f t(Xo)7T E (Xt - Xo) kernel), which takes a value of 1 on N(xf) [- 1/2,1/2] and 0 elsewhere. We focus on it next as it will provide us with the clear- N(xN) est insights. 2.1.2 Uniform Kernel, x's Uniformly Distributed on [0,1]. In order to garner We have applied a second order Taylor some intuition, in addition to working series.7 Next, we rewrite (2.3) as 8 with a uniform kernel, assume for the moment that x is uniformly distributed on the unit interval. If we draw T obser- 7In particular, f(xt) =f(x,) + f '(xo)(xt - xo) + 1/2f'(xo)(xt - xO)2 + O(Xt - xO)2. We are obviously as- vations on x, then the proportion of ob- suming second order derivatives exist. servations falling in an interval of width 8 Consider a random variable Uz which is uni- X will be approximately X and the num- formly distributed on an interval of width X cen- tered at 0. Then Var(Ux) = X2/12. To obtain (2.4) ber will be approximately XT.6 I More formally, define the neighbor- from (2.3) it is useful to think of Y (Xt- xo) xtSN(Xo) hood around xo as N(xo) = fxt XtE [Xo - as an average of XT such random variables, in A/2, xo+ A/2]} and note that there are which case the average has mean 0 and variance roughly AT observations in N(xo). For X/12T. Thus, the second term in the third line of (2.3) converges to 0 fast enough so that for 6 Two conditions are imposed on X. The first is our purposes we can set it to 0. Next, think of X - 0 which ensures that averaging takes I place I (Xt- xo)2 as an estimator of the variance of over a shrinking bandwidth, thus eventually elimi- xteN(xo) nating bias. The second is XT -> which ensures that the number of observations being averaged Us. The estimator converges quickly enough so grows, which allows the variance of the estimate to that the second term of the last equation of (2.3) ecline to 0. can be approximatedby 1/24f "(xo)X2. Yatchew: Nonparametric Regression Techniques in Economics 685

Figure 6. AlternativeKernel Functions

Kernel Domain Kernel Domain

Uniform' 1 u E [-'/2, /21 Triangular 2(1 - 2juj) u E [-/2,/2]

Quartic 15 (1 - 4U2)2 E Epanechnikov 3 (1 - 4uj2) u 8 U [-'/2,'/2] 2 E [-'/2,/2W

Normal exp( U2) uE (-oo,oo)

*Also known as the 'box'or 'rectangular'kernel.

ables so that its variance is 62/XT and we f(x) _f(x) +24- 2f '"(xo) + X E (2.4) have: N(xo)

The last term is an average of about XT J(Xo) =f(Xo) + 0(k2) + ?p (XT) (2.5) independent and identical random vari- 686 Journal of Economic Literature, Vol. XXXVI (June 1998)

The bias E(f(x,) -f(x0)) is approximated by the second term of (2.4) and the (AT)1/2(XO) -f(xO) Var(f(x0)) is approximately Y?JXTso that the mean squared error (the sum of the - "(Xo) -N(O,a?) (2.9) bias squared and the variance) at a point 2f xo is If we select X optimally, say, A==-1/5 then (XT)/2X2= 1 and the construction of E[f(xo) -f(xo)] = O(X4) TO J (2.6) a for f(xo) is compli- cated by the presence of the term involv- The approximation embodied in (2.4) ing f"(x0) (which would need to be esti- yields dividends immediately. As long as mated). However, if we select X to go X -* 0 -* and AT oo, the second and third to zero faster than T-r15 (for example, terms go to zero and we have a consis- X = T-1/4), then (XT)?/2X2-> 0 and (2.9) be- tent estimator. comes (XT)(/2(xo)-f(xA)) N(0y2). Intui- - 0 The rate at which fixo) -f(xo) de- tively, the bandwidth is shrinking fast pends on which of the second or third enough so that the bias is small relative terms in (2.4) converge to zero more to the variance (see (2.5) above). In this slowly. Optimality is achieved when the case, a 95 percent confidence interval for bias squared and the variance shrink to f(xo) is approximatelyAix0) ? 1.966/(XT)?2. zero at the same rate. Using (2.5) one Let us pause for a moment. In this can see that this occurs if 0(X2) = section, we have illustrated three essen- Op((TX)-?/2) which implies that optimal- tial results for a simple kernel estima- ity can be achieved by choosing X= tor: that it is consistent; that by averag- O(T-1/5). In this case ing over a neighborhood which shrinks at an appropriate rate it achieves the optimal rate of convergence, and that it Axo) f(xo) + T2/) PT2/5) (27) is asymptotically normal. 2.1.3 General Kernel, x's Distributed Equivalently, we could have solved for with General Distribution p(x). For a the optimal rate using (2.6). Setting general kernel and assuming that the x's = Q(X4) O(1/XT) and solving we again ob- are random with probability density tain A= O(rl/5). Substituting into (2.6) p(x), the Nadaraya-Watson kernel esti- yields a rate of convergence for the mean mator (2.2) is consistent. The numera- squared error at a point of - xo E[f(xO) tor converges to f(xo)p(xo) and the de- = O(T4/5). This in turn f(x0)]2 underpins nominator converges to p(xo). the following: The rate of convergence is optimized if X = O(T'15) in which case the inte- f[(x) -f(X)]2dx = Op{T 4/5 (2.8) grated squared error converges at the optimal rate Op(T-4/5) as in (2.8). Confi- a rather pleasant result in that it satisfies dence intervals may be constructed us- Stone's optimal rate of convergence, ing: equation (1.10) above (m=2, d=1). Applying a central limit theorem9 to - 1/2 aKk2 the last term of (2.4), we have: (AXT)/2f(xo) -f(to) f (Xo)

9One must be a little bit careful because the +2fo(x)PVo) -NkN0, b (2.10) rnumberof Ct being summed is random. See, for p(xo))) p(xo)) example, Robert Serfling (1980, p. 32). Yatchew: Nonparametric Regression Techniques in Economics 687 where p(.) is the density of x and 1995). Also displayed is an estimated linear regression function which falls aK=u2K(u)du bK=K2(u)du (2.11) within the nonparametric confidence If the bandwidth converges to zero at band at all but extreme levels of income. the optimal rate, then estimates of the 2.1.4 Kernel Estimation of Functions first two derivatives off, the density of x of Several Variables. In economics, it is and its first derivative, as well as the rarely the case that one is interested in variance of the residual must all be ob- a function of a single variable. Even if tained in order to construct a confidence we are comfortable incorporating most interval which in large samples will have of our explanatory variables parametri- the correct coverage probability.10 cally (for example, within a partial lin- Again, the bias term in (2.10) can be ear model), there may be more than made to disappear asymptotically by one variable entering nonparametri- permitting the bandwidth to shrink at a cally. The effect of geographic location rate that is faster than the optimal rate. (a two-dimensional variable) provides one Averaging over a narrower range re- such example (see Figure 11 below). duces the bias but increases the vari- Suppose then that f is a function of ance of the estimator. two variables. We are given data An alternative, which takes account (yi,xI) ... (YT,XT) where Xt = (Xtl,Xt2) on of both bias and variance terms without the model yt =f(xtl,xt2) +Et and we will requiring their calculation, is based on assume f is' a function on the unit the bootstrap. Figure 7 provides imple- square [0,1]2. We want to estimatef(xo) mentation details for both asymptotic by averaging nearby observations; in and bootstrap confidence intervals. particular we will average observations Thus far we have discussed confi- falling in a square of dimension X -AX dence intervals for f at a point. A more which is centered at xo. If the xt are interesting graphic for nonparametric drawn from a uniform distribution on estimation is a confidence band or rib- the unit square, there will be (approxi- bon around the estimated function. The mately) X2T observations in the neigh- plausibility of an alternative specifica- borhood defined by N(xo)= {Xt I Xtl E xol tion (such as a parametric estimate, a _ X/2, Xt2E Xo2? X/2}. (For example, any monotone, or concave estimate) could square with sides A,= .5 has area .25 and then be assessed by superimposing the will capture about 25 percent of the ob- latter on the graph to see if it falls servations.) Consider then: within the band.11 A 1 Figure 8 provides nonparamnetricker- f(xo) = X2T I yt nel estimates of the Engel curve relating N(xo) expenditure on food to income. The 90 uniform confidence band was con- 1 1 percent - X f(xt) + 2I t structed using the reguncb function in 2TN(xo) 2TN(Xo) (2.12) XploRe (Hardle, Klinke and Turlach

10 See Hardle and Linton (1994, p. 2310). The ffi(xo) + O(2) + 1 t normality result in (2.10) extends to the case +2TN(xo) where one is interested in the joint distribution of estimates at a vector of points. 11See Hardle and Oliver Linton (1994, p. 2317), + + Peter Hall (1993), and Randall Eubank and Paul =f(Xo) O(X2) Op{X T/ Speckman (1993). 688 Journal of Economic Literature, Vol. XXXVI (June 1998)

Figure 7. Confidence Intervals for Kernel Estimators- Implementation

ASYMPTOTIC CONFIDENCE INTERVAL ATf (X0) 1. Select X so that T"15X -4 0, e.g. X = 0 (T-1"4).This ensures that the bias term does not appear in the limiting distribution(2.10). 2. Select a kernel K and calculate bK = fK2(u)du. For the uniform kernel on [ 1/2,1/2] bK = 1. 3. Estimatef using the Nadaraya-Watsonestimator (2.2). 4. Calculate 62= 1/T E (yt-f(xt)2. 5. Estimate p (xo)using denominator of (2.2). If the uniform kernel is used, p3(xo) equals the numver of xt in the intervalxo?+X/2 divided by X. 6. Calculate the confidence interval atf (xo)using f(xo) ?1.96 bK)e p5(x0)XT

BOOTSTRAP CONFIDENCE INTERVAL* ATf (X0) 1. Estimatef using the optimal bandwith X = 0 (T-1"5),call this estimatefx, then calculate the residuals gt = yt-f(xt). 2. Re-estimatef using a wider bandwidth, say X, (which will result in some oversmoothing)and call this estimatef. Resample the estimated residuals gt using the 'wild'bootstrap to obtain bootstrap residualsqB and construct a bootstrap dataset yt =fx (xe) + ?, t=.T. 3. Estimatef (x0)using the bootstrap data and X = 0 (T-"15)to obtainf B (xo).Repeat the many times and obtain the .025 and .975 quantiles of the distributionoffB (xo).The result yields a 95% confindence intervalfor f (xo) which has the correct coverage probabilityin large sampites.

*For the theory underlyingthis bootstrap methodology see Hardle (1990, Th. 4.2.2, and pp. 106-7, 247). See also Appendix B of this paper.

What we have done here is to mimic the using a kernel K then the estimator in reasoning in equations (2.3)-(2.5) but (2.2) becomes this time for the bivariate case. (We have T d assumed that f is twice differentiable, Kt 1 xoij but have spared the reader the Taylor A XdT~~j=! series expansion.) However, there is a = subtle difference. The bias term is still f(xo) . (2.13) 1 proportional to X2, but the variance term T HK CXoj j is now OA(I/XT?)rather than Op(1/(XT)?') since we are averaging approximatelyX2T values ofEt, Again, if K is the uniform kernel on [-1/2, Hence for consistency, we now need 1/2], then the product of the kernels ->O0 and AT'/2 -> o. As before, conver- (hence the term product kernel) is gence of Afxo) tof(xo) is fastest when the one only if xtic [xoi- X/2, xoi + X/2] for bias and variance terms go to zero at i= 1,... ,d, that is only if xt falls in the d- the same rate, that is when X = O(T-1/6). dimensional cube centred at xo with sides The second and third terms of the last of length B. line of (2.12) are then Op(T-'/3) and Above we have introduced a simple J[f(x) -f(x)]2dx = Op(T-2/3) which is opti- kernel estimator for functions of several mal (see (1.10) above). variables which averages observations More generally, if the xt are d-dimen- over a cube centered at xo. A multitude sional with probability density p(x) de- of variations and alternatives exists. For fined on the unit cube in Rd, and we are example, one could select different Yatchew: Nonparametric Regression Techniques in Economics 689

Figutre 8. Engel Curve Estimation

MODEL:y =f(X) + ? where y is expendituireon food, x is income (units are in 104 Canadiandollars). DATA:the data consist of a sample of 1058 two-parent households with two children below 16 years of age. Source, 1992 Survey of Family Expenditures, Household Survey Division, StatisticsCanada.



1.0 .


0.6 .-.

0.4- - ...... , '


0.0 2.0 4.0 6.0 8.0 10.0

Linear Model and Kernel Estimate with 90% Uniform Confidence Band

Estimation and graphicsproduced using XploRe (Hairdle,Klinke, and Turlach 1995). Uniform confidence band use routine reguncb.

bandwidths for each dimension so that 2.2 Nonparametric Least Squares 12 averaging would take place over rec- tangular cubes rather than over per- 2.2.1 Estimation. In order to imple- fect cubes. Or, one might select differ- ment nonparametric least squares esti- ent kernels for each dimension. Still mators, we will need a tractable way to more generally, one could average over impose constraints on various order de- non-rectangular regions such as spheres rivatives. Let Cn be the set of functions or ellipsoids. For details in a multivari- 12 S line function estimation, which is closely ate setting, see for related to nonparametric least squares, is a widely used technique. We begin with an exposition of example, David Scott (1992, pp.149- the latter, then- explain its relationship to the 55). former. 690 Journal of Economic Literature, Vol. XXXVI (June 1998)

that have continuous derivatives up to 1 min - [y - Rc]' [y - Rc] s.t. c'Rc < L (2.16) order m (for expositional purposes we c T restrict these functions to the unit in- terval). A measure of smoothness that is where y is the Txl vector of observations particularly convenient is given by the on the dependent variable and R is a TxT Sobolev norm: matrix that is computable from xi, ..., XT. Note that even though one is estimating T parameters to fit T observations, the lIfIf Sob= [f 2 + (f ')2 + (f ")2 parameters are constrained so that there 1/2 is no immediate reason to expect perfect + ... + (f(m))2] (2.14) fit.14 Repeating the main point: the infi- where (in) denotes the mth order deriva- nite dimensional nonparametric least tive. A small value of the norm implies squares problem (2.15) may be solved that neither the function, nor any of its by solving the finite dimensional opti- derivatives up to order m can be too mization problem (2.16). Furthermore, large over a significant portion of the do- if x is a vector, the Sobolev norm (2.14) main. Indeed, bounding this norm im- generalizes to include various order plies that all lower order derivatives are partial derivatives. The optimization bounded in supnorm.13 (Recall from problem has the same quadratic struc- Section 1.2 and Figure 5 that even ture as in the one-dimensional case bounding the first derivative produces a above, and the functions r.17...IrXT as consistent nonparametric least squares well as the matrix R are directly com- estimator.) putable from the data xi, ..., XT. Suppose we take our estimating set 3 Our Returns to Scale in Electricity to be the set of functions in Cm for Distribution example, which we con- which the square of the Sobolev norm is tinue in Figure 10 below, illustrates a bounded by say L, that is, 3Z= {fe Cnl, nonparametric least squares estimator 2Iflob < L}. The task of finding the of f. We have imposed a fourth order function in 3 that best fits the data Sobolev norm (m=4 in equation [2.14]), would appear to be daunting. After all, which implies at least three bounded 3 is an infinite dimensional family. derivatives. As one can see, the non- What is remarkable is that the solutionf parametric least squares estimate is that satisfies very similar to the quadratic estimate of the scale effect. s= min4 , [yt-f(xt)] 2.2.2 Properties of Estimator15 The f t main statistical properties of the proce- dure are these: f is a consistent estima- s.t. 2If ob < L (2.15) tor of f, indeed low order derivatives of can be obtained by minimizing a quad- f consistently estimate the correspond- ratic objective function subject to a ing derivatives of f. The rate at which f quadratic constraint. The solution is of 14 the form f = 41 'trx,where rX., rXT are The rx, are called representor functions and R, the matrix of inner products of the rX,,the rep- functions computable from x, XT and resentor matrix. See Grace Wahba (1990) for de- e = (1, C--, T) is obtained by solving tails. An efficient algorithm for solving the optimi- zation problem in (2.16) may be found in Gene 13 That is, there exist constants Lz such that for Golub and Charles Van Loan (1989, p. 564). all frZ, sup, IJi) < L ,i = 1, ..., in - 1, where 3 is de- 15 These results may be proved using empirical fined momentarily. This result flows from the processes theory as discussed in Dudley (1984) Sobolev Imbedding Theorem. and Pollard (1984). Yatchew: Nonparametric Regression Techniques in Economics 691 converges to f satisfies equation (1.10). uses it in solving (2.19), the resulting f These optimal convergence results are will be identical.17 useful in producing consistent tests of a broad range of hypotheses. 2.3 Selection of Smoothing Parameter18 The average minimum sum of 2.3.1 Kernel Estimation. We now turn squared residuals s2 is a consistent esti- to selection of smoothing parameters mator of the residual variance y2 . Fur- for kernel estimators. If the bandwidth thermore, in large samples, S2 is indis- X is too large, then oversmoothing will tinguishable from the true average sum exacerbate bias and eliminate important of squared residuals in the sense that features of the regression function. Se- I lection of a value of X that is too small V/2(s2- , 2) - 0 (2.17) will cause the estimator to track the current data too closely, thus impairing in probability. Next, since T'/2(1/T I?t2 the prediction accuracy of the esti- -2) -4 N(0,Vai(?2)) (just apply an ordi- mated regression function when applied nary ), equation to new data (recall Figures 3 and 4). To (2.17) implies obtain a good estimate of f we would - 7T'/2(S2- (y2) N(0,Var(?2)) (2.18) like to select X to minimize the mean As we shall see below, this result lies integrated squared error: at the heart of demonstrating that non- 2 parametric least squares can be used to MISE(X)= EJ Lf(x;X)-f(x)j dx (2.20) produce T'/2 consistent normal estima- tors in the partial linear model. where we write Ax;X) to explicitly denote Finally, the estimator being consid- that the kernel estimator depends on the ered here is closely related to spline choice of X. Of course we do not observe function estimators. Assume for the mo- f so the MISE cannot be minimized di- ment > 0 is a given constant16 and fl rectly. Nor will selecting X by minimiz- consider the penalized least squares ing the estimate of the residual variance problem: I 2 62(X = [yt_( A; min - [yt-f(xt)] + IIf1ISob* (2.19) tt= 1 (2.21)

The criterion function trades off fidelity lead to a useful result-the minimum of to the data against smoothness of the (2.21) occurs when X is reduced to the function f. There is a penalty for select- point where the data are fit perfectly. ing functions which fit the data ex- However, this idea can be modified to tremely well but as a consequence are produce useful results. Consider a slight very rough (recall that the Sobolev norm variation on (2.21) known as the cross- measures the smoothness of a function validation function: and its derivatives). A larger il results in a smoother function being selected. CV(X)= T E [yt -f-t(Xt;X)] (2.22) If one solves (2.15), our nonparamet- t= 1 ric least squares problem; takes the La- 17 In their simplest incarnation, spline estima- grangian multiplier, say il associated tors use f(f ")2 as the measure of smoothness. See with the smoothness constraint; then Eubank (1988) and Wahba (1990). 18Cross-validation was first proposed for the 16 Actually, it is selected using cross-validation, kernel estimator by Clark (1975) and for spline a procedure which we discuss shortly. estimation by Wahba and Wold (1975). 692 Journal of Economic Literature, Vol. XXXVI (June 1998)

The only difference between (2.21) and - (2.22) is that the kernel estimator is sub- mnin , [yi -f(xi)] scripted with a curious "-t" which is used to denote that f-t is obtained by omnit- s.t. If| 2

Figutre 9. Selection of SmoothlinigParameters



36.0 - A

V 1 34.0-

32.0 -

30.0 -

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

CV[,ll, <410-l,


95.0 -

94.0 -

A ? 93.0- v\/

,; 92.0-

& ; 91.0-

90.0 -

89.0 -

12.0 14.0 16.0 18.0 20.0

SOBLSQCVL[,ll, <'10-1>

- Data generating mechanism: yt =Xt + Et, Et N (0,.01), t=1,...25, where xt are equally spaced on the interval [0,11. Kernel cross-validationperformned using XploRe function regcvl, see Hardle, Klinke, and Turlach(1995, p. 88). Nonparametric least squares cross-validationperformned using Fortran code wvitteniby the author. 694 Journal of Economic Literature, Vol. XXXVI (June 1998) are available for kernel regression.21 parametric least squares or spline esti- However, by simply trying different val- mators is considerably more difficult ues for the smoothing parameter and and we have merely asserted them. visually examining the resulting esti- Such estimators are consistent and mate of the* regression function, one achieve optimal rates of convergence. can often obtain a useful indication of They also have considerable appeal whether one is over- or under-smooth- since it is relatively easy to impose addi- ing. tional structure on such estimators. (We Furthermore, cross-validation can be will devote attention to constrained es- automated relatively easily. The kernel timation and hypothesis testing in Sec- cross-validation function in Figure 9 tion 4 below.) was obtained using the regcvl in XploRe. Both local averaging and least S-Plus uses cross-validation to produce squares estimation require the selection its spline estimates, and other auto- of a smoothing parameter. In kernel es- mated procedures are also available.22 timation, it is the neighborhood (or bandwidth or window) over which aver- 2.4 Reprise aging is to take place. In nonparametric least squares it is the diameter of the This portion of the paper focussed ball of functions over which estimation on developing the kernel estimator (an is to take place. Selection of the example of a local averaging estimator) smoothing parameter is performed by and on a nonparametric least squares trying different values and selecting the estimator closely related to spline esti- one that minimizes (out-of-sample) pre- mation. The statistical properties of the diction error, a technique known as former are relatively easy to derive us- cross-validation. ing basic techniques. For example, it is It should be emphasized that in this straightforward to demonstrate that ker- section we have introduced but two nel estimators are consistent and ap- nonparametric regression estimators. A proximately normally distributed. If the wide variety of others exists (see for ex- estimator is to achieve the optimal rate ample Hardle 1990, Ch.3 and Jianqing of convergence, then its bias squared Fan and Irene Gijbels 1996, Ch. 2 for and variance must go to zero at the overviews). same rate, which complicates the con- struction of confidence intervals. De- 3. The Partial Linear Model spite the somewhat burdensome nota- tion, for example, equation (2.13), 3.1. Estimation kernel estimation of functions of several 3.1.1 Introduction. Given i.i.d. data variables is conceptually straightfor- (y1x1,Z1),..i ) . (yT,xT,zT) consider the semi- ward-for example, with a uniform parametric regression model which was kernel, averages are taken over a discussed in the opening section of the square (or cube) rather than over an in- paper: terval. Derivation of the properties of non- y = Z +f(x) + ?, (3.1) = = 21 For "rules of thumb" in a kernel density set- where E(y I z,x) zIP+f(x), 62 Var[y z,x]. ting, see Scott (1992, ch. 6). For alternatives to The function f is not known to lie in a cross-validation in nonparametric regression set- particular parametric family. An early ting, see for example, Simonoff (1996, p. 197 and references therein). and important application of this model 22 See Venables and Ripley (1994, p. 250). was that of Robert Engle, Clive Granger, Yatchew: Nonparametric Regression Techniques in Economics 695

John Rice, and Andrew Weiss (1986), estimator (2.15) can be applied equation who used it to study the impact of by equation. The sample weather on electricity demand. su2=fUA2/T, sv2= IA2/T are Tl/2 consistent Peter Robinson's (1988) influential asymptotically normal estimators of paper demonstrates that the parameter the corresponding population variances a can be estimated at parametric rates, y2, (using equation [2.18]). It can also that is, - , = Op(77?2),despite the pres- be demonstrated that StlV = IUA At/T is a ence of the nonparametric function f. T'/2 consistent asymptotically normal Specifically, Robinson rewrites (3.1) estimator of alv. In summary, the conditioning on x: sample analogue to (3.4), that is, the ma- trix of estimated variances and covari- y - E(y I x) = y - E(z I x)f -f(x) ances, is T'/2 consistent asymptotically = [z-E(z Ix)]p+?. (3.2) normal. Now = , so that it is fairly If E(y Ix) and E(z Ix) are known, then or- v3 straightforward to show that its sample dinary least squares on (3.2) yields an es- analogue , = is also T1/2consistent timate of 3 which is asymptotically nor- sUV/sS asymptotically normal. Furthermore, its mal with variance y2/Ty2 where (y2 is the variance is given by the same variance of the z conditional on x. Of 6y2/T(y2, variance attained by the Robinson esti- course, the regression functions E(y x) I mator. and E(z x) are generally not even known I Thus, inference may be conducted to have particular parametric forms. Ro- using - Alternatively, the binson then produces nonparametric N(p,62/T(y2). bootstrap may be used to obtain stan- (kernel) estimators of E(y x) and E(z x) I I dard errors and critical values (see that converge sufficiently quickly so that Mammen and Van de Geer 1995, and their substitution in the OLS estimator Yatchew and Bos 1997).24 does not affect its asymptotic distri- 3.1.3 The Differencing Estimnator bution.23 Revisited25. In the introduction to this 3.1.2 Nonparametric Least Squares. paper, we outlined an estimator of f3 Returning to (3.1), consider the condi- which involves reordering the data so tional distribution of y,z Ix where all that the x's are in increasing order, variables are scalars: then differencing to remove the non- z =E(z Ix)+u =g(x)+u parametric effect. The estimator is y=E(y Ix)+v=h(x)+v given by: = + + (g(x)03+f(x)) (up C) (3.3) 5dzff X(yt - yt-1) (Zt - Zt-1)

To simplify exposition we assume both (zt- Zt_1)2 conditional models are homoscedastic so that 24 We note that one can perform semiparametric least squares on the model (3.1) by minimizing the sum of squared residuals with respect to P and f subject to a smoothness constraint on f, but the CV L22j + (3.4) .U> v.] (T2P2 resulting estimator of P would not in general con- verge at T-V2.See Hung Chen (1988). 25 The idea of differencing to remove a nonpara- Under sufficient smoothness assump- metric effect in pure nonparametric models has tions, the nonparametric least squares been used by John Rice (1984), Yatchew (1988), Peter Hall, J.W. Kay, and D.M. Titterington 23 For general results of this nature, see Whit- (1990), and others. Powell (1987) and Hyungtaik ney Newey (1994). See also Linton (1995b) who Ahn and Powell (1993), among others, use the analyzes higher order properties of :. idea in the partial linear model. 696 Journal of Economic Literature, Vol. XXXVI (June 1998)

We asserted that Jif- N(P,1.52/ToU), ance 6csy2cs2.Thus, the ratio is asymptoti- which has 66.7 percent (1/1.5) efficiency cally normal with mean zero and vari- relative to the efficient estimator (for ex- ance given by 602y2 /(2oy)2 =1.5- 2/o ample, Robinson 1988). In this section We now have an explanation for the we sketch out the main idea of the proof 1.5 factor in the introduction to the of this result and provide an estimator paper (which comes as no surprise to based on higher order differencing practitioners). By differenc- which is asymptotically efficient. ing we have introduced an MA(1) Using (3.3) to substitute for yt and zt process in the residuals. Applying in IAdiffand then rearranging terms we to the differ- have enced data is, as a result, less efficient. (But beware, do not attempt to apply T'12(5ditff- generalized least squares for then you will reverse the beneficial effects of For/2 Y(xt) -ffxt-i) _t+ - Et-1) differencing.) Thus the simplicity of dif- ferencing is purchased at the price * 9 (Xt-1) + Ut - Ut-1)1 Wxt) of some lost efficiency relative to the kernel-based procedures or the non- - - / D,(gxO g(xt_1)+ ut ut_1)2 (3.6) parametric least squares procedures For simplicity, consider the case where above. the xt are equally spaced on the unit in- Fortunately, efficiency can be im- terval. We assume that first derivatives proved substantially by using higher or- of f and g are bounded by L . Then: der differences (see Yatchew 1997). Fix the order of differencing m. Consider now the following generalization of the T1'/24 X(f(xt) - g(xt-)) -f(xt-1))(g(xt) estimator in (3.5):

T'/2 L2_ < T XL2lxt - xt_112=L2 (3.7) in in and E,jyt_ -f2dzt-j t j= f j=O FT'/21 ut(f(xt) Var [ T -f(xt-i))j ~diff 2 (3.10) in = Xt)-f(Xt_1)2< T2 (3.8) E Idjzt j J=0 Thus (3.7) and (3.8) converge to 0. Using similar arguments one can show that all whlere do,...,dm are differencing weights terms in (3.6) which involve satisfying the conditions lYondi= O and (f(xt)-f(xtij)) or (g(xt)-g(xtij)) converge l;ond = 1. The first condition ensures to 0 sufficiently quickly so that (3.6) is that differencing removes the nonpara- approximately metric effect in large samples. The second ensures that the variance of TT/2'(Ip( )t - Et-1)(ut - ut-1) the residual: in the transformed model - = is the same as in the original model. T112(Rdiff P) 1 4X(ut - ut-i)2 (39) (Thus, the weights for the simplest dif- ferencing estimator (3.5) which were The denominator converges to 2oy and do= 1, di = -1, would be normalized to the numerator has mean zero and vari- do= .7071 di=-.7071. ) Yatchew: Nonparametric Regression Techniques in Economnics 697


1 (0.7071,-0.7071) 2 (0.8090, -0.5000, -0.3090) 3 (0.8582, -0.3832, -0.2809, -0.1942) 4 (0.8873, -0.3099, -0.2464, -0.1901, -0.1409) 5 (0.9064, -0.2600, -0.2167, -0.1774, -0.1420, -0.1103) 6 (0.9200, -0.2238, -0.1925, -0.1635, -0.1369, -0.1126, -0.0906) 7 (0.9302, -0.1965, -0.1728, -0.1506, -0.1299, -0.1107, -0.0930, 0.0768) 8 (0.9380, -0.1751, -0.1565, -0.1389, -0.1224, -0.1069, -0.0925, -0.0791, -0.0666) 9 (0.9443, -0.1578, -0.1429, -0.1287, -0.1152, -0.1025, -0.0905, -0.0792, -0.0687, -0.0588) 10 (0.9494, -0.1437,-0.1314,-0.1197, -0.1085, -0.0978, -0.0877, -0.0782,-0.0691, -0.0606, -0.0527)

In contrastto those in Hall et al (1990), the aboveoptimal weight sequencesdecline in absolutevalue towardszero.

If the weights do0...d. are chosen op- Define the following: timally, it can be shown that Sdiff= 4x(Ay t - AztjdifJ2 (3.13) fd ff- N j( (1 +2+ 4 (3.11) A -AZ'AZ (3.14) By increasing the order of differencing T from 1 to 2 to 3, the efficiency of the Then sdiff converges to E and lu,diff con- estimator relative to the Robinson proce- verges to 1,, the conditional dure improves from 66.7 percent matrix of z given x. If x is a vector, then re- (=1/1.5) to 80 percent (=1/1.25) to 85.7 ordering so that xt and xt-1 are close is not percent (= 1/1.167). Optimal differencing unique. Nevertheless, for reasonable order- weights do not have analytic expres- ing rules, the procedure works as long as sions but are tabulated (up to m = 10) in the dimension of x does not exceed 3. Table 1. 3.1.4 Examples. We have already in- Suppose z the parametric variable is troduced an example on Returns to a vector, but x the nonparametric vari- Scale in Electricity Distribution (Fig- able continues to be a scalar. Fix the ure 1). We extend it here to include order of differencing m. Select the dif- second order optimal differencing and ferencing weights do0..dm optimally (as Hal White's (1985) indicated above). Define Ay to be the consistent standard errors (Figure 10). (T-m)xl vector whose elements are As an additional example with a two [Ay]t = YjZodjyt y and AZ to be the dimensional nonparamnetric variable x, (T-m)xp matrix with entries [AZ]ti= we consider Hedonic Prices of Housing 1el= djzt i. Then, Attributes (Figure 11). Parametric variables include lot size, area of living space and presence of various fdiff = [AZ'AZL'AZ'Ay amenities. The location effect, which NI,{1 + 2 (l31) J has no natural parametric specification, is incorporated nonparametrically. 698 Journal of Economic Literature, Vol. XXXVI (June 1998)

Figure 10. Returns to Scale in Electricity Distribution (Colntinued)


PARAMETRIC:Quadratic scale effect SEMIPARAMETRIC:Smooth scale effect y=a + Zf +71X +y2X2 +8 E Y = a + Z +f(X) + ? OLS First Order Differencing 2 Second Order OptimalDifferencing3 Coeff SE Coeff SE HCSE Coeff SE oc 15.987 37.002 DIST 3.719 1.248 ADIST 2.568 1.560 1.890 ADIST 3.998 1.377 LOAD 1.920 .661 ALOAD .437 .912 .854 ALOAD .562 .814 GEN -.051 .023 AGEN .0005 .032 .030 AGEN -.004 .028 LIFE -5.663 .798 ALIFE -4.470 1.072 .899 ALIFE -5.107 .955 AsSETS .037 .005 AAsSETS .030 .0072 .006 AAsSETS .030 .0064 WAGE .003 .0007 AWAGE .003 .00086 .00074 AWAGE .003 .00078 'Yl -.00154 .00024 2 lX.1X10-7 .14X10-8 R2 .45 S,L 1259.751' 1249 7 2 1255.4 s~~~~~~diff129.sdiff 15.


, 200 - Kernel __ ~ ~ ~ ---Nonparmn LS i H^ 150 1 A- _ _ 0 H~~150 - - - - ~SplineQuadratic /1

= 1000- o 0 o 0 U ~ ? ? 0 0 b ~~~~~~~~o

V 0 0 O 0 0 0 0 ? Z 0 00000~~~?cOO o%0 0 00 00 %Co0 0008a ? 8 ? 0 ?

0 o 0 00 X0 G 0 0 0- 0' '

_ I ~ ~I l I I I 500 1 0 5000 10000 50000 100000 CUSTOMERS

S22 Under the null hypothesisthat scale has no effect (fis constant),sres = 1508.5. 2Data reordered so that x1 < ... < XT, (T=265). Then all variablesdifferenced A wt = wt - wt1l, followed by ordinaryleast squares. SE's obtained by multiplying OLS standarderrors by x'T5, (set mn=1in equation [3.12]). HCSE's are heteroscedasticityconsistent standarderrors. Dependent variableis A COSTPER CUST.3Data reordered so that x1 < ... < XT, then all variablesdifferenced A wt = .809wt- .5wt - .309wt-2, followed by ordinaryleast squares. OLS standarderrors multipliedby V12 (set m = 2 in equation [3.12]). Dependent variableis A COSTPER CUST. 4Kerneland spline estimates obtained using ksmoothand snooth.spline functions in S-Plus. Dependent variableis Yt- Zt diff where diff is calculatedusing first order differencing. Nonparametricleast squares applied using fourth order Sobolev norm. The quadraticestimate is from the parametricmodel above.

3.2 Reprise y = zf3+f(x) + e. Three distinct estima- tors of f3 have been discussed-the first In this portion of the paper, we have two involve initial nonparametric esti- focussed on the partial linear model mation using either kernel or nonpara- Yatchew: Nonparametric Regression Techniques in Economics 699

Figure 11. Hedonic Plices of Housing Attributes

The semiparametricmodel was estimated by Michael Ho (1995) using semniparametricleast squares;the dependent variable y is SALEPRICE; z variablesin clude lot size (LOTAREA),square footage of housing (USESPC),number of bedrooms (NRBED), averageneighborhood income (AVGINC),distance to highway (DHwY),presence of garage (GRGE),fireplace (FRPLC),or luxuryappointments (Lux). A criticaldeterminant of price is locationwhich has no naturalparametric specification (indeed most urban areashave multi-modallocation premnia),thus the inclusion of a nonparametricfunctionf (xl,x2) where x1,x2are location coordinates.The data consist of 92 detached homes in the Ottawaarea which sold during 1987.


PARAMETRIC:Linear location effects SEMIPARAMETRIC:Smooth location effect y= C+Z+71+1y+Y2X2+Y y=Z+f(Xl,X2) + OLS First Order Differencing 2 Second Order OptimalDifferencing3 Coeff S E Coeff S E HCSE Coeff S E oc 74.0 18.0 FRPLC 11.7 6.2 AFRPLC 11.3 6.0 5.7 AFRPLC 8.4 6.0 GRGE 11.8 5.1 AGRGE 2.5 5.5 4.1 AGRGE 7.3 5.1 Lux 60.7 10.5 ALux 55.3 10.8 15.4 ALux 52.6 10.3 AVGINC .478 .22 AAVGINC .152 .35 .22 AAVGINC .10 .29 DHWY -15.3 6.7 ADHWY -5.0 15.9 9.2 ADHWY -1.4 11.2 LOTAREA 3.2 2.3 ALOTAREA 4.1 2.3 2.0 ALOTAREA 5.6 2.2 NRBED 6.6 4.9 ANRBED 3.3 4.9 3.6 ANRBED 2.5 4.5 USESPC 21.1 11.0 AUSESPC 36.5 12.0 9.3 AUSESPC 36.5 11.1 'Yl 7.5 2.2 7'2 -3.2 2.5 R2 .62 424.3' Sdiff 309.2 Sdiff 324.8


1 Under the null that location has no effect, (f is constant),s2 =507.4. 2Data reordered using nearest neighbor algorithm for (x1t,x2t),t= 1,...,T (T=92). Then all variablesdifferenced A wt = wt - wt-1,followed by ordinaryleast squares. Standard errorsmultiplied by Vi7f (set m=1 in equation [3.12]). HCSE's are heteroscedasticityconsistent standarderrors. Dependent variableis A SALEPRICE. 3Data reordered accordingto nearest neighbor algorithm,then all variables differenced A wt = .809wt- .5wtvl- .309wt-2,followed by ordinaryleast squares. OLS standarderrors multiplied by V12 (set m = 2 in equation [3.12]). Dependent variableis A SALEPRICE. 4Smoothed estimate obtained using loess function in - S-Plus. Dependent variableis Yt Zt diffwhere ldifcis calculatedusing first order differencing. 700 Journal of Economic Literature, Vol. XXXVI (June 1998)

metric least squares methods. The third geneity of degree zero), that a propor- estimator, which is based on differenc- tionate increase in all inputs will in- ing, circumvents this initial step. Each crease output by the same proportion estimator has a variance that shrinks to (constant returns to scale or, equiva- zero at 1/T (indeed, each is T'/2-consis- lently, homogeneity of degree one), that tent) as well as being asymptotically the effect of one explanatory variable normal. The relative efficiency of the does not depend on the level of another differencing estimator improves with (additive separability), that the relation- higher order of differencing, if "opti- ship possesses certain curvature proper- mal" differencing weights are used. ties such as concavity or convexity or The partial linear model discussed that observed consumption patterns re- above is part of the much larger class of sult from optimization of utility subject semiparametric models. The most to a budget constraint (the maximiza- closely related are partial parametric tion hypothesis). models, y = g(z;f) +f(x) + F, where ,u is a Empirical investigation is then re- known function but f and f3 remain un- quired to assess whether one or another known. Each of the techniques de- variable is significant or whether a par- scribed above may be used to obtain T'/2 ticular property holds. In parametric re- consistent estimators in partial paramet- gression modelling, a functional form is ric models. selected and properties are tested by An important class of semiparametric imposing restrictions on the parame- models is index models which are of the ters. However, rejection of a hypothesis form y =f(xf3) +F; x and f3 are vectors. may be a consequence of the specific Both f and 3 are unknown (x: is the 'in- functional form that has been selected dex', in this case linear); T'/2 consistent (but not implied by economic theory). estimation of ,B is also possible (see Thus, while the translog production Hidehiko Ichiinura 1993, as well as function is richer and more flexible Roger Klein and Richard Spady 1993). than the Cobb-Douglas, it may not Other important references on semi- capture all the interesting features of parametric models include Stoker the production process and may indeed (1991), Gary Chamberlain (1992), An- lead to incorrect rejection of restric- drews (1994b), Newey (1994), Linton tions. Nonparametric procedures, on the (1995a), and Joel Horowitz (1996). other hand, provide both richer families of functions and more robust tests for 4. Constrained Estimation and assessing the implications of economic Hypothesis Testing theory. Within this framework it is also possible to test whether a 4.1 Introduction specific para- metric form is adequate. Economic theory rarely dictates a In the following sections, we there- specific functional form. Instead, it fore focus on the imposition of addi- typically specifies a collection of poten- tional constraints on nonparametric re- tially related variables and general func- gression estimation, such as separability tional properties of the relationship. and monotonicity, and on testing of hy- For example, economic theory may im- potheses, particularly specification and ply that the impact of a given variable is significance 26 positive or negative (monotonicity), that 26We will not consider tests on the stochastic doubling of prices and incomes should structure of the residuals, such as heteroscedastic- not alter consumption patterns (homo- ity or . Yatchew: Nonparametric Regression Techniques in Economics 701

However, before proceeding, we pro- provides a fit strikingly similar to all vide some standardized notation (the three nonparamnetricestimates (indeed, ideas are illustrated graphically in Fig- it tracks the nonparametric least ure 12). Begin with the true model: squares estimate very closely). This is y =f(X) + E. (4.1) reinforced by the observation that the estimate of the residual variance does We will maintain that f lies in the set 3 not decrease markedly as one moves which is a smooth set of functions. We from the completely will want to estimate f subject to con- (S2es = 1259.75) to the differencing esti- straints of the form f E Z c 3 where the mator (sd2if=1249.7). Our objective will be set 3 combines smoothness with addi- to parlay these casual observations into tional functional properties. We denote a formal test of the quadratic specifica- the restricted estimator as fres with cor- tion.28 responding estimated residual variance: 4.2.1 A Simple Differencing Test of s2 IILS t il\2. Specification. To motivate our first and 8res jres(Xt) (.(4.2) T simplest test we return to the idea of Our general null and alternative hy- differencing, but this time in a purely potheses will be of the form: nonparametric setting. Suppose one is given data (yi,xl)... (yT,,xT) on the model Ho: fe Z (4.3) y =f(x) +F- where all variables are sca- lars, Et are i.i.d. with mean 0, variance oTF, and independent of x. The x's are We will assume that Aresconverges to drawn from a distribution with support, some functionf in the restricted set S.27 say the unit interval, and once again When the null hypothesis is true,f andf we have rearranged the data so that are identical and the re- (since fe 3), x1 < ... < XT. If we first difference to ob- stricted estimator converges tof=f. tain yt - yt-i =f(xt) -f(xt-i) + Et - Et-i and One final important notational con- define vention. Since certain tests will depend on the difference between the true re- 8kff= 2TX (yt-yt-i)2, (4.5) gression_function and the closest func- tion in 3, we will reserve special nota- then as the typical distance between xt tion for it. In particular: and xt-i goes to zero, s2i converges to GE. Suppose under the null, the regres- fA =f-f (4.4) sion function is hypothesized to be quad- If the null hypothesis is true, fA = 0. ratic and we obtain an estimate of the variance 4.2 Specification Tests A sres =2.d (yt - A- AL -2X)2. To fix an objective, let us return mo- (4.6) mentarily to Figure 10, Returns to Scale If the null hypothesis is true, then in Electricity Distribution. Four esti- mates of the scale effect are illustrated: Tl ^ 2 N(0 1). (4.7) kernel, nonparametric least squares, sdiff spline, and a simple quadratic model. Inspection suggests that the quadratic 28 There is a huge literature on specification testing. See James MacKinnon (1992) and White 27 For example, f could be the closest function in (1994). In this section we focus specifically on 3 tof in the sense thatf satisfies minf(g -_f )2 d.x. tests where the alternative involves a nonparamet- grZ ric component to the regression function. 702 Journal of Economic Literature, Vol. XXXVI (June 1998)

Figure 12. Constrained and Unconstrained Estimation and Testing





3 is the unrestricted set of functions, Z is the restricted set of functions. Let f be the closest function in Z3to the true regression functionf. If Ho is true, thenf lies in 3 andf =f If Ho is false, then the differencefA =f -f?# 0.

If the null is false, V grows large.29 only in a pure nonparametric setting, but The test procedure may be applied not also to the partial linear model 29 Higher order differencing may also be used y = zf +f(x) + c. To test the quadratic to obtain s2 . See also Plosser, Schwert, and specification for the scale effect in elec- White (1982) who propose specification tests that tricity distribution costs, obtain s2 by are based on differencing but that do not involve reordering of the data. regressing y on z, x and x2. ObtainSdiff Yatchew: Nonparametric Regression Techniques in Economics 703

Figure 13. Differencing Test of Specification - Implementation

HYPOTHESES: given data (y1,xl). .(YT,XT) on the model y =f(x) + ?, test parametricnull against nonparametricalternative. Below, the parametricnull is a quadraticfunction.

STATITIC: T"2(S2(2 - s2 ) ST A T S TI C V = T 112 res s0ffd iff - N (0,1I) u n d e r H0


1. Reorder data so that x1 < ... < XT I 2. Calculates2= I (Yt-Yt_1)2 3. Perform restricted regression of y on x and x2 to obtain jo + i Xt + Y2Xt2. I 4. CalculateS2 = 1 - esT (YYt YoY' X07t-Y2X). X Xt2)2 5. Calculate V and perform one-sided test comparing to critical value from a N(O,1)

Applicationto Returns to Scale in Electricity Distribution (See Figure 10). The model is y = z +f(x) + ? and we wish to test a quadraticspecification for f. First, obtain a T1"2consistent estimator of f such as the differencing estimator of equation (3.12). Calculate yt-ztp and apply the above procedures to this newly defined dependent variable. Using S-Plus we obtain V = .131 which supports a quadraticmodel or the scale effect. To test significance of the nonparametricvarable specify a constant function forf. In this case we obtain V = 3.37 so that the scale variableis significant.

_ _~~~~~~~~~~~~~~~~~~~(_6 _2 6_ CtC Notes: The derivationof the distributionof V is straightforward.In large samples, Sjff- 2T (_t- t-1)2 4-T t f- I and S2 1 E p2 Hence, the numeratorof V is approximatelyT112 T E ?tet wichwl using a central hmit theorem is (O,-o4)re The denominatorconverges 2 toT s N(0,U4). The denominatorconverges to a2. Test may be genieralizedusing higherwhich order usigr differencing centimi to theoremobtain 2df by applying equation (3.13). The value parametric null. This test may be moti- of the test is V=.131 in which vated by writing case the quadratic model appears to be adequate. (See Figure 13 for an imple- y -flx) =f(x) -f(x) + E =fA(x) + E. (4.8) mentation summary.) We assume that the restricted regression A test of significance of the nonpara- estimator fres estimates f consistently metric variable x is a special case of the and note that if the null hypothesis is above procedure-in this case the null true, that is, if f E Z3then fA = O. Thus, if hypothesis is that f is a constant func- we do an "auxiliary"regression of the esti- tion. Regressing y only on z we obtain mated residuals yt -fres(xt) on xt to esti- S2es = 1508.5 in which case V=3.37 so mate fA and perform a significance test, that scale is a statistically significant then we will have a test of the null hy- factor in the costs of distributing elec- pothesisfe (.30 tricity. In fact, the moment condition that The procedure described above may we will consider involves a slight vari- be applied if x is two- or three-dimen- ation on this idea. Consider the follow- sional, but more sophisticated rules for ing: reordering the data so that the x's are close, are required. The test is gen- Eg,x[(y -f (x))fA(x)px(x)] erally not valid if the dimension of x ex- = Exf A (x)px(x) ? 0 (4.9) ceeds 3. 4.2.2 A Conditional Momnent Test of 30 Specification. Qi Li (1994) and John Xu This idea, of course, is not new and has been exploited extensively for purposes of specification Zheng (1996) propose a conditional mo- testing in parametric regression models (see for ment test of specification against a example, MacKinnon 1992). 704 Journal of Economic Literature, Vol. XXXVI (June 1998) where px(x) is the density of x. Recall The test generalizes readily to the that the numerator of the kernel estima- case where x is a vector and the boot- tor in equation (2.2) consistently esti- strap may be used to obtain critical val- mates the product of the regression ues. function and the density of x. Thus, a 4.2.3 Other Specification Tests. Bier- sample analogue for the left expression ens (1990) considers a moment condi- of (4.9) may be obtained by calculating tion of the form

U= (yt -fres(Xt)) K>xT(ys EF,x[e'rx(y-f(x))

= Ex[erx(ftx) -f (x))] (4.13) -fres(xs))K r tjjj (4.10) which for (almost) any real number t does not equal 0 if the null hypothesis is To interpret this expression, note that false. (The expressions equal 0 if the null the term in square brackets may be is true since f=f.) Bierens proposes a thought of as an estimator of fA(xt)px(xt). test based on a sample analogue of the If the null hypothesis is true, then left expression. Hardle and Mammen (1993) base their T2I/2U N(O, 2a4 I p2(x)f K2(u)), (4.11) specification test on the integrated squared a2 difference IT = I (fres(X) - funr(X))2 dx where where = Var(U) = 2a4 f p2(x)f K2(u)/XT2 A may be estimated using funr, the unrestricted estimator, is a kernel estimator of f. The restricted estimator 2 A2 E (yt -fre.s(Xt))2(ys fres is (for technical reasons a smoothed version of) the parametric estimator of = A~2T4 ?f_e(- KKX XtJ(4.12) f. In simulations, Hardle and Mammen t~ t find that the nornmal approximation to the distribution of their test statistic is so that U/a N(0,1). Both U and au are substantially inferior to bootstrapping relatively straightforward quadratic forms the critical values of the test statistic. which involve the residuals from the They denmonstrate that the "wild" boot- parametric regression. strap (see Appendix B) yields a test pro- Returning to our example on Returns cedure that has correct asymptotic size to Scale in Electricity Distribution, under the null and is consistent under Figure 14 plots the estimated residu- the alternative. (They also demnonstrate als from the regression of y on z, x and that conventional bootstrap procedures x2. Casual observation suggests that there fail.) The test can be applied to circum- is no further relationship between y stances where x is a vector and where E and x. is heteroscedastic. Testing the quadratic specification Yongmiao Hong and White (1995) formally, we obtain U/u =.168 so that propose tests based on series expan- the null hypothesis is not rejected. As sions, in particular the flexible Fourier with the differencing test, we can per- form (Ronald Gallant 1981). The unre- form a nonparametric test of signifi- stricted regression model is given by cance of the x variable by inserting a constant function forf in the restricted f(x) = 6o + 61x + 62x2 model. In this case, U/8u= 2.82, indi- ?lT cating significance of the scale variable. + jyicos(Jx)X + y2jsin(jx) (4.14) j=1 Yatchew: Nonparametric Regression Techniques in Economics 705

Figure 14. Conditional Moment Test of Specification- Implementation

HYPOTHESES: given data (Y1,X1)... (YT,XT)on the model y =f(x) + ?, test parametricnull against nonparametricalternative. Below, the parametricnull is a quadraticfunction.

STATISTIC: We implement using the uniform kernel so that fK2 = 1. With only slight abuse of notation, let Kstbe the st-th entry of the kernel matrixdefined below.

- - - UYt-i0-1Xt-i2Xt = , YoE YIX, ,12 4J2X)St 2

TEST PROCEDURE 2 ~~~~~~2 1. Perform (restricted) regression y on x and x2 to obtain o - Yi St - Y2Xtr 2. Calculate the kernel matrix Kstas follows: Kst= 1 if Ix, - xtI < k/2 s?t (note that diagonal elements Ktt =0) Kst = 0 othervise 3. Calculate U 4. Define (u = Var (U) = 2 J4fp2(X)/kT2and estimate it using - au = T422 S (Yt - Yo - 71 Xt - 72 Xt)2(yS -70 - y1x5 - 2 x2 (Note that since we are using the uniform kernel, K2 = K,t) 5. Perform a one sided test comparing U/lu to the criticalvalue from the N (0,1).

APPLICATION TO RETURNS TO SCALE IN ELECTRICITY DISTRIBUTION: (See Figure 10). Under the alternative,the model is y = z4 +f (x) + ? which we wish to test againsty = 4 + Yo + YI x + y2x2 + ?. Heuristically,we want to perform a nonparametricregression of the estimated residuals from the latter on x. Visual examinationof these residuals (below) suggests no relationshipwith x. Implementing in S-Plus we obtain U/du= .168 indicating acceptance of a quadraticmodel for the scale effect. To test significance, specify a constant function forf In this case we obtain U/lu = 2.82 so that the scale variableis significant.


150- 0

100 - 0 0 0 0

o0 0 0 0 - 0 (b0 00 0 0 50 50 0 ~~0o0 0 ? ? (o? o 0 0 0 0 0 050 - O 0 0 0 00 0 00 0c o06b0 0 0 000 0 0 O O 0 00 - 00?o o ?2 o~Q ooX 0?9 o Go0 00 8000000 o0 a 0 000 Oo00 fP00o0 0 0 00 ~ OOoO 0a 000 0 ooo' 0 00 00 0Io 0ooo 0 0 000 0 000 0 500 0 0 00 00 0000 0 00 0 0 0 00 0 _ _ 00 -50 0 0o 0 00 00 0000%00 000 00 0 00 0 0

-100 0 500 .1000 5000 10000 50000 100000 CUSTOMERS where the number of unknown coeffi- ing tested. The test statistic is based on cients nT= 3 + 2n* increases with sample the difference between s2 the estimate size. The rate at which nT is permitted to of the residual variance obtained from grow depends on the null hypothesis be- (4.14), and s2S which is obtained by esti- 706 Journal of EconomnicLiterature, Vol. XXXVI (June 1998) mating the parametric model under the between a more rapid rate of conver- null. gence and a loss in the richness of the Other procedures not discussed here set of functions that can be approxi- but worthy of note include those of mated.32 For example, if xa and Xb are A. Azzalini, Adrian Bowman and Hardle scalars, then the optimal rate of conver- (1989), Eubank and Clifford Spiegel- gence that a nonparametric estimator man (1990), B.J. Lee (1991), Jeffrey can achieve is the same as if f were a Wooldridge (1992), Pedro Gozalo function of only one variable. On the (1993), Yoon-Jae Whang and Andrews other hand, the additive model does (1993), and Horowitz and HWrdle not encompass such relatively common (1994). specifications as the multiplicative model 4.2.4 Significance Tests. The simplest f(xa,Xb) = Xa ' Xb. nonparametric significance tests are An important and useful feature of those where the null hypothesis is of additive models is the ease with which the form f(x) = g1;g is a constant. In this the effects of different variables may be case, the null model is entirely paramet- summarized. Consider, for example, per ric and so all specification testing meth- capita gasoline consumption as a func- odologies described above immediately tion of per capita disposable income yield tests of significance of this kind. If and the price of gasoline. If the two ef- x is a vector, then this null hypothesis fects are indeed additive, then they may corresponds to testing the joint signifi- be represented on separate graphs. As cance of all the explanatory variables.31 may be seen in Figure 15, price has a What is more challenging (and much strong negative effect on demand, while more useful) is the derivation of tests of income a strong positive effect. (The significance for a subset of the explana- data are from Jeffrey Simonoff 1996, tory variables. In this case, the null hy- Appendix A.) pothesis involves a nonparametric com- 4.3.1 A General Estimation Algo- ponent and so the above tests can no rithm-Backfitting. A powerful and longer be directly applied. (For exam- general algorithm used to estimate ad- ple, suppose one is estimating the ditively separable models is motivated model y =f(xl,x2) + F and the null hy- by the observation that pothesis is that y =f(xi) + e.) However, the conditional moment test may be ex- E[y-fa(Xa) I Xb] =fb(xb)

tended to this case. (See, for example, and E[y-fb(Xb) I Xa] =fa(xa). (4.15) Fan and Li 1996.) If f is a good estimate Offa then we may 4.3 Additive Separability estimate fbA by nonparametric regres- The curse of dimensionality which sion of y -fa(xa) on Xb. (A parallel argu- haunts nonparametric estimation has ment holds for estimation Offa.) Begin- focussed attention on improving the rate ning with these observations, the algo- of convergence of nonparametric esti- rithm in Table 2 has been widely mators. Additive models which in the studied. The initial estimates fa may simplest case are of the form f(Xa,Xb) = be set to zero or to the estimates from a parametric procedure (such as a linear fa(Xa) +fb(xb) provide a useful compromise regression). 31 See Stoker (1989) who, in addition to propos- ing tests of significance, develops tests of symme- 32 See for example, Stone (1985, 1986); Andreas try and homogeneity, and general tests of additive Buja, Trevor Hastie, and Robert Tibshirani (1987); derivative constraints. and Hastie and Tibshirani (1987, 1990). Yatchew: Nonparametric Regression Techniques in Economics 707

Figure 15. Gasoline Consumption

MODEL: given the model y =.fa(xa)+fb(xb) + 6, where y is per capita gasoline consumption, Xa, Xb are per capita income and the real price of gasoline. Data are from Simonoff (1996). The procedure uses the back-fittingalgoiithm applying spline estimation to each component. Estimiationis performed using the gam function in S-Plus. Partialresiduals are given by Yt-fb(xbt) and yt -fa(xat) respectively. Bands represent point-wise standard errors.


l l l l l l l l l l l l l l l ll l l l l l111 1 11 1 1


4'~~~~~~~~~~*_o? ' *~~~~~~~~~~~~~~~~C

The procedure may be generalized in mated regression function equals the the obvious fashion to additively separa- rate of convergence of the component ble models with more than two additive with the largest number of explanatory terms where each term may be a func- variables. tion of several variables. Parametric In performing the component non- components may also be included. As- parametric regressions, a variety of suming that optimal nonparametric esti- techniques may be used, including ker- mators are applied to each component, nel and spline estimation. Indeed, the the rate of convergence of the esti- algorithm is particularly versatile in that 708 Journal of EconomnicLiterature, Vol. XXXVI (June 1998)

TABLE 2 4.4 Monotonicity THE BACKFITTINGALGORITHM Economnic theory often leads to pre- dictions that a certain effect should be Initialization Select initial estimatesf , fj positive or negative. (Indeed, one of the Iteration Obtainf by nonparainetiic most common complaints expressed by regression of y -f 'b-(xb) cni xa Obtainf by nonpaarametric students in econometrics is that despite ' regression of y -f (xa) on xb all their efforts, coefficient estimates Convergence Continue iteration until there is are of the wrong sign.) The imposition little change in individual and testing of inequality constraints on function estimates coefficients of parametric models is a well-studied problem (see Frank Wolak Note: See Hastie and Tibshirani (1990, Ch. 4 and 1989 and references therein). The in- references therein). position of monotonicity restrictions in nonparanietric regression has also been studied extensively. The isotonic regres- different techniques may be selected sion literature, in the simplest case, for different components. For example, considers least squares regression sub- fa may be estimated using kernel regres- ject only to monotonicity constraints: sion and fb using nonparametric least that is, given data (yixi) ... (YT,XT) on the squares (or even nonparainetric least model yt =f(xt) + ?t, solve squares subject to constraints). The al- gorithm is available in S-Plus using _ the function mi-inT1' (yt- t)2 gain (generalized additive (Tt model) which we have applied in Figure 15. Ys At for xs 2. mination of the closest set of mono- If the true mean is, say, 1.5, then as tonic points to the smoothed points. sample size increases, the probability That is, given data (yi,xi),...., (yT,xT), let that the unconstrained estimator equals (IJx)O.(, f(T,xT) be the set of points ob- the constrained estimator goes to one. tained by applying a kernel estimator, Thus, the constraint becomes nonbind- then solve ing. m In nonparametric regression, an Al (y-t)2F analogous result holds. If the true func- gL1,.,,OTT t tion is strictly monotone (that is, if the YS-?t for x,

Each summation will have approximately

A TI(ytztf3)xt kT terms, so that in each case we are cal- 6= culating a simple average. Recall that the 1x2 term involving the second derivative cor- T1 1 responds to the bias (see Section 2.1), the next corresponds to the variance TEX? A TEXZ term, and the last is the term arising out =+ 1 + P-). 1 of the two-stage estimation procedure. T X2 Txt2 Following (2.5),

+ Op('/2) + Op(/2) (1) (5.1) f(XO) -fl(xo) =+Op(l) 0P(1)' = O(X2) + Op((kT)-'/2) + Op(f/2)Op(1) There are two random terms, each of = O(T-2/5) + Op(T-2/5) + Op(T'1/2)0p(i) the same order (the first is due to the if X = O(T-1/5). (5.3) variation in C; the second results from using D instead of and if one esti- ,B) Consistency of the kernel estimator is mates then one is in- Var(8) using c/Xx, unaffected, since all three terms con- the first one. corporating only Indeed, verge to zero (we continue to require two-stage estimation in purely paramet- k -4O, T X-oo). The optimal rate of ric models often involves adjusting sec- convergence is unaffected-k = O(T-1/5) ond standard errors based on a stage still minimizes the rate at which the first-stage procedure. (sum of the) three terms converge to The partial linear model not suf- does zero. The order of each of the first two fer from this additional burden. Essen- terms is Op(T-275), while the third termn tially, the reason is that the parametric converges to zero more quickly and inde- portion of the model more converges pendently of A. Confidence intervals may quickly than the nonparametric portion also be constructed as before (see 2.9), so that of the analysis nonparametric since portion can proceed as if the parametric portion were known. To see this a little more precisely, suppose we perform a (U)/2(f(Xo) -f(xo)) kernel regression of Yt - z4J on xt where = O((QT)'/2X2) + Op(l) + Op(kl/2) (5.4) we assume uniformly distributed x's and = = the uniform kernel. Then, recalling 0(1) + Op(l) + Op(T-1110) if X O(T-1"5) (2.3) we have and the third term goes to zero, albeit slowly. If the optimal bandwidth k= f( ) E Yt zt O(T-1/5) is selected, then confidence in- N(x0) tervals must correct for a bias term. Returning to nonparametric least squares estimation (Section 2.2), if we = f(xt)+ Xt+ )Zt1 A - kTN~(X7 kTN(X%) kTN(Xo) regress y* = yt zt4 on xt, the estimator f remains consistent and its rate of con- -f(xo) + 1/2f"t(xo) AT X(Xt - Xo)2 vergence is unchanged. N(xo) The practical point is that we can separate the analysis of the parametric + Xett +(- T) XZt. (5.2) portion of the model from the analysis (xo) N(x,) of the nonparametric portion. Given a 712 Journal of Economic Literature, Vol. XXXVI (June 1998)

T'l/2-consistent estimate of 1, we may tion. Other restrictions which may be construct the new dependent variable, driven by considerations of economic A yt= - Zt4, set aside the original yt, and theory and which enhance convergence analyze the data (y*,xt) as if they came rates are additive separability and semi- from the pure nonparametric model parametric modelling. Monotonicity yt =f(xt) + Et. None of the large sample and concavity restrictions do not en- properties that we have discussed will hance the (large sample) rate of conver- be affected. This holds true regardless gence if sufficient smoothness is im- of the dimension of the parametric vari- posed, but are beneficial in small able z. samples. Alternatively, their presence Where does this leave us? Since ft de- can be used to reduce the dependency pends on A, variation in the latter will on smoothness assumptions. affect variation in the former. The argu- Figure 16 illustrates the conse- ments in this section merely state that quences of imposing progressively more in large samples, the impact of the vari- stringent restrictions on a model which, ation in 3 is small relative to the impact unbeknownst to the investigator, is lin- of the other terms in equation (5.2). If ear in one variable. The benefits of we want to obtain better approxima- learning that the model is a function of tions to our sampling distributions (for one variable rather than two are evi- example by using the bootstrap), then dent. The expected mean squared error we may not want to ignore the term as- (EMSE), given fixed sample size, de- 9 ~~~A sociated with prior estimation of , the clines by more than 40 percent as one last term in (5.2) and (5.3). moves from the smooth two-dimen- sional to a smooth one-dimensional 5.2 The Value of Constrained Estimation model. This observation underscores Since economic theory rarely pro- the importance of powerful significance vides parametric functional forms, ex- tests for nonparametric models. As ex- ploratory data analysis and testing pected, separability and partial linearity which rationalizes a specific parametric can also substantially improve the accu- regression function is particularly bene- racy of the estimator. ficial. In this connection we have out- 5.3 Computer Software and Further lined two specification tests and refer- Reading enced a variety of others. Even though parametric specification Extensive software already in exist- is not its forte, economic theory does ence can perform many nonparametric play a role in producing other valuable procedures automatically. Most of the restrictions on the regression function. kernel estimation in this paper was per- By specifying which variables are poten- formed in S-Plus, a statistical program- tially relevant to an equation, and ex- ming language with extensive graphical cluding myriad others from considera- capabilities. S-Plus also performs a vari- tion, rate of convergence is improved. ety of other nonparametric techniques, (Exclusion restrictions may come dis- including spline and nearest neighbor guised, for example, as homogeneity of estimation, and the estimation of addi- degree zero.) The imposition of ex- tively separable models. Within S-Plus, clusion restrictions on either local av- it is straightforward to implement the eraging or minimization estimators is differencing estimator, confidence in- straightforward-one simply reduces the tervals for nonparametric function esti- dimensionality of the regression func- mators, and bootstrapping procedures, Yatchew: Nonparametric Regression Techniques in Economics 713

Figure 16. Constrained Estimation - Simulated Expected Mean Squared Error

SIMULATED EMSE: E [f (x( tt)gxb))2]


0.035 -

0.030 - =25

0.025 - SMOOTH 2 DIM 0.020- SEPARABLE

SMOOTH 0.015 - 1 DIM MONOTONE ,A---- .. LINEAI" 0.010 -

0.005 T=100 A& -

0.000 -

DATAGENERATING MECHANISM: y xI + ), xS E[1,2], N (0,G2 = .25), || | ob= + 1)dx1 = 3.33

ESTIMATED MODELS y = 11()+ , lIlsob? 10.0

SMOOTH - 2 DIm W'(.) =f(xj,x2) SEPARABLE t(t) =f1(XI) +f2(X2) SMOOTH- 1 DIM t() =f(xl) MONOTONE t(.) =f(Xd),f(lXd)

Sobolev smoothness norms are of fourth order. In each case, 1000 replicationswere performned as well as the tests we have discussed. software are of course available, but An excellent reference is Venables and this is not the place to start.) Ripley (1994) 38 Other useful refer- XploRe is a relatively newer package ences include Phil Spector (1994) and with considerable nonparametric and John Chambers and Hastie (1993). semiparametric capabilities and good (Complete reference manuals for the graphical functions. Hardle, Klinke, and Turlach (1995) provide a fine introduc- 38The volume of Hacrdle (1990), which focusses specifically on nonparametric techniques, contains tion to its capabilities. XploRe performs numerous implementation algorithms. a variety of nonparametric regression 714 Journal of Economic Literature, Vol. XXXVI (June 1998) procedures such as kernel, nearest (1994) in the Handbook of Econo- neighbor, and . The metrics series provide fine overviews of logical learning sequence is to begin various aspects related to nonparamet- with S-Plus prior to attempting to use ric/semiparametric estimation and test- XploRe. ing. See also Linton (1995a) who re- The nonparametric least squares pro- views semiparametric estimation. Fan cedures described in this paper require and Gijbels (1996) provide a fine intro- two steps. First, one must calculate the duction to local polynomial modelling, a representor matrices that enter into the technique which has recently been gain- optimization problem. (The author uses ing popularity. Fortran for this purpose.) The resulting constrained optimization problems them- 6. Conclusions selves may be solved in GAMS-General Algebraic Modelling System (Brooke, The averaging of identically distrib- Kendrick and Meerhaus 1992). This uted objects to estimate the mean of an software is designed to solve a wide unknown distribution involves working range of linear, nonlinear and integer in a nonparametric setting. So does the programming problems and is suffi- use of sample moments to estimate ciently versatile so that bootstrap re- population moments (method of mo- sampling can be fully automated. It is ment estimation), the application of the not specifically designed for statistical central limit theorem to do inference use and presently has no graphic capa- on a characteristic of a population, or bilities. the construction of a histogram. All We note that the computational bur- these are familiar to most applied den of bootstrap techniques will not economists. Parametric estimation such limit their use in nonparametric estima- as ordinary least squares is also widely tion, since the bootstrap is particularly applied. And we routinely examine amenable to parallel computation. Thus, residuals, often subjecting them to estimation could proceed simultane- diagnostic tests or auxiliary regres- ously on individual bootstrap samples. sions to see if anything else is going There exist a number of papers and on that is not being captured by the monographs that survey various aspects model. of the nonparametric/semiparametric Nonparametric regression techniques literatures. Hardle's (1990) book pro- draw upon many of these ideas. When vides a good introductory treatment identical objects are not available, one of nonparametric regression. Wahba's looks at similar objects to draw infer- (1990) monograph is a standard refer- ences. Thus kernel estimation averages ence on spline estimation. Miguel Del- similarly distributed objects to estimate gado and Robinson (1992) provide a the collection of conditional means valuable survey of semiparametric re- known as the regression function. Non- gression. See also Stoker (1991). Aman parametric least squares (which is Ullah and Hrishikesh Vinod (1993) and closely related to spline estimation) Theo Gasser, Joachim Engel and Burk- minimizes the sum of squared residuals hardt Seifert (1993) provide useful sur- but searches over a richer set of func- veys of nonparametric regression in the tions (typically a set of smooth func- Handbook of Statistics series. Papers by tions) than the family of straight lines. Andrews (1994a), Hardle and Linton Tests on residuals (for example, the (1994), Matzkin (1994), and Powell conditional moment specification test Yatchew: Nonparametric Regression Techniques in Economics 715 we have described) provide a means of ing over "neighbors" will produce much assessing a broad range of hypotheses less accurate estimates in higher dimen- such as whether the sign of the slope of sions. a relationship changes or whether the In a sense, nonparametric regression relationship is additive, concave, or ho- is truly empirical in spirit because data mothetic. must be experienced pretty much every- But there are also profound differ- where. (The word empiricism descends ences between conventional and non- from a word which means experience.) parametric regression techniques. A Interpolation is only deemed reliable common question asked in introductory among close neighbors, and extrapola- econometrics courses is how one would tion outside the observed domain is distribute x's on the unit interval if one considered entirely speculative. wanted to learn as much as possible Averaging similar (but not identically about a linear relationship. The answer distributed) observations introduces is that one wants to maximize the vari- bias. One is tempted to increase the ation of the x's and this is done by di- number of observations being averaged viding the x's between the end points. in order to reduce variance, but this in- The further apart they are, the better.39 creases bias as progressively less similar In paramnetric regression, knowing the observations are added to the mix. The relationship accurately at a relatively balance is struck when the increase in small number of points allows one to in- bias (squared) is offset by the reduction fer the parameters and then to use the in variance. model equation to interpolate or ex- A true sceptic would claim that since trapolate elsewhere. Not so for non- economic theory yields little in the way parametric regression. Knowing the of specific functional forms, all relation- conditional means at the end points re- ships and all variables should be inves veals little of what transpires in be- tigated nonparametrically. (Symmetric tween. Accurate knowledge of the rela- treatment of various x's also has an tionship must be acquired in small aesthetic, even egalitarian, appeal.) For neighborhoods covering the whole do- applied economists this is neither rea- main, and so the x's are best distributed sonable nor practical. The potential throughout the unit interval. (-In this relevance of numerous variables would business, one learns about an individual conspire with the curse of dimensional- by looking at close neighbors.) ity to lead to a state of nihilism-noth- Estimation problems emerge as the ing of significance could be said about number of explanatory variables in- anything. Fortunately, we often have creases (the so-called curse of dimen- much stronger priors about the influ- sionality). For a fixed number of obser- ence of some variables than others, and vations, "neighbors" are much further so semiparametric specifications such as apart in higher dimensions than in the partial linear model provide a prom- lower. For example, 100 x's dispersed in ising route. The "curse" also increases the unit cube are more than 20 times the importance of procedures which farther apart than on the unit interval would confirm a parametric model (Figure 17). If similarity declines pro- (specification tests) or reduce the num- portionately with distance, then averag- ber of explanatory variables (signifi- cance tests). This paper introduces procedures 39For the model y = cc + xA3 + c, Var(,) = y2/y (Xt - -y)2. that are relatively simple. In some 716 Journal of Economic Literature, Vol. XXXVI (June 1998)

Figure 17. The Curse of Dimensionality

Typical Distances Between Neighbors

0.215 0.20-

0.15 -

0.1 0.10 -

0.05- 0.01 0.0 __ 1 2 3 Dimension of x

On the uilit interval, 100 equally spaced x'swill be a distance of only .01 from each other. Allow the x's to spread out on the unit square and the distance to the nearest neighbor increases to .1(= 1/1001/2). Distribute them uniformlyin the three- dimensional unit cube and the distance increases further to .215(= 1/1001/3). cases, more powerful but more complex AZZALINI A.; BOWMAN, ADRIAN AND HARDLE, WOLFGANG. "On the Use of Nonparametric techniques are available. But increased Regression for Model Checking," Biometrika, use of even simple nonparametric tech- 1989, 76, pp. 1-11. niques is part of an endogenous process BARLOW, RICHARD E.; BARTHOLOMEW, D.J., BREMNER, J.M. AND BRUNK, H.D. Statistical where greater use will cause software inference under order restrictions, New York: developers to add more of them to John Wiley, 1972. econometric packages. This in turn will BARRY, DANIEL. "Testing for Additivity of a Re- gression Function," Annals Statist., 1993, 21, drive further use and further software pp. 235-54. development. However, the ultimate BERAN, RUDY AND DUCHARME, GILLES. Asymp- test of nonparametric regression tech- totic theory for bootstrap methods in statistics, Centre de Recherches Mathematiques, Univer- niques resides in their ability to dis- site de Montreal, 1991. cover new and unusual relationships. BIERENS, HERMAN. "A Consistent Conditional Moment Test of Functional Form," Econo- metrica, 1990, 58, pp. 1443-58. REFERENCES BROOKE, A.; KENDRICK, D. AND MEERAUS, A. AHN, HYUNGTAIK AND POWELL, JAMES. "Semi- GAMS, Redwood City, : Scientific parametric Estimation of Censored Selection Press, 1992. Models with a Nonparametric Selection Mecha- BUJA, ANDREAS; HASTIE, TREVOR AND TIBSHI- nism,"J. Econometrics, 1993, 58, pp. 2-29. RANI, ROBERT. "Linear Smoothers and Additive AFRIAT, SYDNEY. "The Construction of a Utility Models, Annals Statist., 1989, 17, pp. 453-555 Function from Expenditure Data," Int. Econ. (with discussion). Rev., 1967, 8, pp. 66-77. CHAMBERLAIN, GARY. "Efficiency Bounds for ANDREWS, DONALD W.K. "Empirical Process Semiparametric Regression," , Methods in Econometrics," in Handbook of 1992, 60, pp. 576-96. econometrics, Vol. IV. Eds.: R. ENGLE AND D. CHAMBERS, JOHN M. AND HASTIE, TREVOR. Sta- MCFADDEN. Amsterdam: North Holland, tistical models in S, New York: Chapman and 1994a, pp. 2247-94, Hall, 1993. --."Asymn ptotics for Semiparametric Econo- CHEN HUNG. "Convergence Rates for Parametric metric Models via Stochastic Equicontinuity," Components in a Partly Linear Model," Annals Economnetrica, 1994b, 62, pp. 43-72. Statist., 1988, 16, pp. 136-46. Yatchew: Nonparametric Regression Techniques in Economics 717

CLARK, R. "A Calibration Curve for Radiocarbon sion Function Models," Econometric Theory, Dates," Antiquity, 1975, 49, pp. 251-66. 1993, 9, pp. 451-77. DELGADO, MIGUEL AND ROBINSON, PETER M. GOZALO PEDRO L. AND LINTON, OLIVER. "Test- "Nonparametric and Semiparametric Methods ing Additivity in Generalized Nonparametric for Economic Research," J. Econ. Surveys, Regression Models," manuscript, 1997. 1992, 6(3). HALL, PETER. The bootstrap and edgeworth ex- DUDLEY, RICHARD M. "A Course on Empirical pansion, New York: Springer Verlag, 1992. Processes," Lecture notes in mathematics, Ecole . "On Edgeworth Expansions and Bootstrap d'Ete de Probabilit6s de Saint-Flour XII-1982, Confidence Bands in Nonparametric Curve Es- New York: Springer-Verlag, 1984. timiation," J. Royal Statist. Soc. B, 1993, 55, pp. EFRON, BRADLEY. "Bootstrap Methods: Another 291-304. Look at the Jackknife," Annals Statist., 1979, 7, HALL, PETER; KAY, J.W. AND TITTERINGTON, pp. 1-26. D .M. "Asymptotically Optimal Difference- EFRON, BRADLEY AND TIBSHIRANI, ROBERT J. based Estimation of Variance in Nonparametric An introduction to the bootstrap, New Regression," Biornetrika,1990, 77, pp. 521-28. York/London: Chapman & Hall, 1993. HARDLE, WOLFGANG. Appliednonparametric re- ENGLE, ROBERT; GRANGER, CLIVE W.J., RICE, gression, Monograph Se- JOHN AND WEISS, ANDREW. "Semiparametric ries, 19, University Press, 1990. Estimates of the Relation between Weather and HARDLE WOLFGANG; HALL, PETER AND MAR- Electricity Sales," J. Amer. Statist. Assoc., 1986, RON, JAMES S. "How Far Are Automatically 81, pp. 310-20. Chosen Regression Smoothing Parameters from EPSTEIN, LARRY AND YATCHEW, ADONIS. "Non- Their Optimum?" J. Amer. Statist. Assoc., 1988, parametric Hypothesis Testing Procedures and 83, pp. 86-99 (with discussion). Applications to Demand Analysis," J. Econo- HARDLE, WOLFGANG; KLINKE, SIGBERT AND metrics, 1985, 30, pp. 150-69. TURLACH, BERWIN. XploRe:An interactivesta- EUBANK, RANDALL. Spline smoothing and non- tistical computing environment, New York: parametric regression, New York: Marcel Dek- Springer-Verlag, 1995. ker, 1988. HARDLE, WOLFGANG AND LINTON, OLIVER. "Ap- EUBANK, RANDALL; HART, JEFFREY D., SIMP- plied Nonparametric Methods," in the Hand- SON, D.G. AND STEFANSKI, LEONARD A. "Test- book of econometrics, Vol. IV. Eds.: 'R. ENGLE ing for Additivity in Nonparametric Regres- AND D. MCFADDEN. Amsterdam: North Hol- sion," Annals Statist., 1995, 23, pp. 1896-920. land, 1994, pp. 2297-334. EUBANK, RANDALL AND SPECKMAN, PAUL. "Con- HARDLE, WOLFGANG AND MAMMEN, ENNO. fidence Bands in Nonparametric Regression," J. "Comparing Nonparametric vs Parametric Re- Amer. Statist. Assoc., 1993, 88, pp. 1287-301. gression Fits," Annals Statist., 1993, 21, pp. EUBANK, RANDALL AND SPIEGELMAN, CLIFFORD 1926-47. H. "Testing the of a Linear HARDLE, WOLFGANG AND MARRON, JAMES S. Model Via Nonparametric Regression Tech- "Optimal Bandwidth Selection in Nonparamet- niques," J. Amer. Statist. Assoc., 1990, 85, pp. ric Regression Estimation," Annals Statist., 387-92. 1985, 13, pp. 1465-81. FAN, JIANQING AND GIJBELS, IRENE. Local poly- HART, JEFFREY D. Nonparametric smoothing and nomial modelling and its applications, New lack-of-fit tests. New York: Springer, 1997. York/London: Chapman & Hall, 1996. HASTIE, TREVOR J. AND TIBSHIRANI, ROBERT. FAN YANQIN AND LI, QI. "Consistent Model "Generalized Additive Models: Some Applica- Specification Tests: Omitted Variables and tions," J. Amer. Statist. Assoc., 1987, 82, pp. Semiparametric Functional Forms," Econo- 371-86. metrica, 1996, 64, pp. 865-90. - . Generalized additive models, London: GALLANT, A. RONALD. "Unbiased Determination Chapman and Hall, 1990. of Production Technologies," J. Econometrics, HAUSMAN, JERRY AND NEWEY, WHITNEY. "Non- 1981, 20, pp. 285- 323. parametric Estimation of Exact Consumer Sur- GASSER, THEO; ENGEL, JOACHIM AND SEIFERT, plus and Deadweight Loss," Economnetrica, BURKHARDT. "Nonparametric Function Esti- 1995, 63, pp. 1445-76. mation," in Handbook of statistics, Vol. 11, Ed.: Ho, MICHAEL. Essays on the housing market, un- C.R. RAO. 1993, pp. 423-65. published Ph.D. dissertation, University of GOLDMAN, STEVEN M. AND RUUD, PAUL. "Non- Toronto, 1995. parametric Multivariate Regression Subject to HONG, YONGMIAO AND WHITE, HALBERT. "Con- Monotonicity and Convexity Constraints," sistent Specification Testing via Nonparametric manuscript, University of California, Berkeley, Series Regression," Econometrica, 1995, 63, pp. 1992. 1133-60. GOLUB, GENE H. AND VAN LOAN, CHARLES. Ma- HOROWITZ, JOEL. "Semiparametric Estimation of trix computations, Baltimore: Johns Hopkins a Regression Model with an Unknown Transfor- University Press, 1989. mation of the Dependent Variable," Econo- GOZALO, PEDRO L. "A Consistent Specification metrica, 1996, 64, pp. 103-38. Test for Nonparametric Estimation of Regres- HoROWITZ, JOEL AND HARDLE, WOLFGANG. 718 Journal of Economic Literature, Vol. XXXVI (June 1998)

"Testing a Parametric Model against a Semi- NADARAYA, E.A. "On Estimating Regression," parametric Alternative," Econometric Theory, Theory Prob. Applic., 1964, 10, pp. 186-90. 1994, 10, 821-48. NEWEY, WHITNEY. "The Asymptotic Variance of ICHIMURA, HIDEHIKO. "Semiparametric Least Semiparametric Estimators," Econometrica, Squares (SLS) and Weighted SLS Estimation of 1994, 62, pp. 1349-82. Single-Index Models," J. Econometrics, 1993, PINKSE, COENRAAD AND ROBINSON, PETER M. 58, pp. 71-120. "Pooling Nonparametric Estimates of Regres- KLEIN, ROGER AND SPADY, RICHARD. "An Effi- sion Functions with Similar Shape," in Ad- cient Semiparametric Estimator for Binary Re- vances in econometrics and quantitative eco- sponse Models," Econometrica, 1993, 61, pp. nomics. Eds.: G.S. MADDALA, P.C.B PHILLIPS, 387-422. AND T.N. SRINIVASAN. Oxford: Blackwell, 1995, LEE, B.J. "A Model Specification Test against the pp. 172-97. Nonparametric Alternative," Department of PLOSSER, CHARLES G.; SCHWERT, WILLIAM AND Economics, University of Colorado, manuscript, WHITE, HALBERT. "Differencing as a Test of 1991. Specification," Int. Econ. Rev. 1982, 23, pp. LEPAGE, RAOUL AND BILLARD, LYNNE. Explor- 535-52. ing the limits of the bootstrap, New York:John POLLARD, DAVID. Convergence of stochastic pro- Wiley, 1992. cesses, New York: Springer-Verlag, 1984. LI, KER-CHAU. "AsymptoticOptimality of CL and POWELL, JAMES L. "Semiparametric Estimation of Generalized Cross-Validation in Ridge Regres- Bivariate Latent Variable Models," working pa- sion with Application to Spline Smoothing,"An- per 8704, Social Systems Research Institute of nals Statist., 1986, 14, pp. 1101-12. Wisconsin, University of Wisconsin, Madison, "AsymptoticOptimality for Cp, CL, Cross- 1987. Validation and Generalized Cross-Validation: - ."Estimation of Semiparametric Models," in Discrete Index Set," Annals Statist., 1987, 15, Handbook of econometrics, Vol. IV. Eds.: R. pp. 958-75. ENGLE AND D. MCFADDEN. Amsterdam: LI, QI. "Some Simple Consistent Tests for a Para- North Holland, 1994, pp. 2444-523. metric Regression Function versus Semipara- RAMSAY, JAMES 0. "Monotone Regression Splines metric or Nonparametric Alternatives,"Depart- in Action," Statist. Sci. 1988, 3, pp. 425-61. ment of Economics, University of Guelph, RAMSAY, JAMES 0. AND SILVERMAN, BERNARD. manuscript, 1994. Functional data analysis, New York: Springer, LINTON, OLIVER. "Estimation in Semiparametric 1997. Models: A Review," in Advances in econo- RICE JOHN. "Bandwidth Choice for Nonparamet- metrics and quantitative economics, essays in ric Regression," Annals Statist., 1984, 12, pp. honor of Professor C.R. Rao. Eds.: G.S. MAD- 1215-30. DALA, P.C.B. PHILLIPS, T.N. SRINIVASAN. ROBERTSON, TIM; WRIGHT, F.T. AND DYKSTRA, 1995a, pp. 146-71. R.L. Order restricted , New ."Second Order Approximation in the Par- York: John Wiley, 1988. tially Linear Regression Model," Econometrica, ROBINSON, PETER M. "Root-N-Consistent Semi- 1995b, 63, pp. 1079-112. parametric Regression," Econometrica, 1988, MAcKINNON, JAMES. "Model Specification Tests 56, pp. 931-54. and Artificial Regressions," J. Econ. Lit., 1992, SCOTT, DAVID W. Multivariate density estimation, 30, pp. 102-46. New York: John Wiley, 1992. MAMMEN, ENNO. "Estimating a Smooth Mono- SERFLING, ROBERT. Approximation theorems of tone Regression Function," Annals Statist., , New York: John Wiley, 1991, 19, pp. 724-40. 1980. -. When does bootstrap work? New York: SHAO, JUN AND Tu, DONGSHENG. The jackknife Springer Verlag, 1992. and bootstrap, New York: Springer-Verlag, 1995. MAMMEN, ENNO AND VAN DE GEER, SARA. "Pe- SIMONOFF, JEFFREY. Smoothing methods in sta- nalized Quasi-Likelihood Estimation in Partial tistics, New York: Springer-Verlag, 1996. Linear Models," manuscript, Humboldt-Uni- SPECTOR, PHIL. An introduction to S and S-plus, versitet Berlin and University of Leiden, 1995. Belmont, California: Duxbury Press, 1994. MATZKIN, ROSA L. "Restrictions of Economic STOKER, THOMAS M. "Tests of Additive Deriva- Theory in Nonparametric Methods," in Hand- tive Constraints,"Rev. Econ. Stud. 1989, 56, pp. book of econometrics, Vol. IV. Eds.: R. ENGLE 535-52. AND D. McFADDEN. Amsterdam: North Hol- . Lectures on semiparametric econometrics, land, 1994, pp. 2524-59. CORE Foundation Louvain-La-Neuve, 1991. MUKARJEE, HARI. "Monotone Nonparametric STONE, CHARLES J. "Optimal Rates of Conver- Regression," Annals Statist., 1988, 16, pp. 741- gence for Nonparametric Estimators," Annals 50. Statist. 1980, 8, pp. 1348-60. MUKARJEE, HARI AND STERN, STEVEN. "Feasible -" "Optimal Global Rates of Convergence for Nonparametric Estimation of Multi-argument Nonparametric Regression," Annals Statist., Monotone Functions," J. Amer. Statist. Assoc., 1982, 10, pp. 1040-53. 1994, 89, pp. 77-80. ."Additive Regression and Other Nonpara- Yatchew: Nonparametric Regression Techniques in Economics 719

metric Models," Annals Statist., 1985, 13, pp. WHITE, HALBERT. Asymptotic theory for econo- 689-705. metricians, New York: Academic Press, 1985. ."The Dimensionality Reduction Principle . Estimation, inference and specification for Generalized Additive Models," Annals of analysis, Cambridge University Press, 1994. Statistics, 1986, 14, pp. 590-606. WOLAK, FRANK A. "Testing Inequality Constraints ULLAH, AMAN AND VINOD, HRISHIKESH D. in Linear Econometric Models," J. Econo- "General Nonparametric Regression Estimation. mnetrics, 1989, 41, pp. 205-235. and Testing in Econometrics," in Handbook of WONG, WING HUNG. "On Constrained Multivari- statistics, Vol. 11. Ed.: C.R. RAO. 1993, pp. 85- ate Splines and Their Approximations," Numer. 116. Math., 1984, 43, pp. 141-52. UTRERAS, FLORENCIO. "Smoothing Noisy Data WOOLDRIDGE, JEFFREY M. "A Test for Func- under Monotonicity Constraints: Existence, tional Form against Nonparametric Alterna- Characterization and Convergence Rates," Nu- tives," Econometric Theory, 1992, 8, pp. 452- mer. Math., 1984, 47, pp. 611-25. 75. VAN DE GEER, SARA. "Estimating a Regression WRIGHT, IAN AND WEGMAN, EDWARD. (1980): Function,"Annals Statist. 1990, 18, pp. 907-924. "Isotonic, Convex and Related Splines," Annals VARIAN, HAL R. "Nonparametric Analysis of Opti- Statist., 1980, 8, pp. 1023-35. mizing Behavior with Measurement Error," J. WU, CHIEN-Fu JEFF. "Jackknife, Bootstrap and Econometrics, 1985, 30, pp. 445-58. Other Resampling Methods in Regression -. "Goodnessof Fit in Optimizing Models,"J. Analysis" Annals Statist., 1986, 14, pp. 1261- Econometrics, 1990, 46, pp. 125-40. 350 (with discussion). VENABLES, WILLIAM AND RIPLEY, BRIAN. Mod- YATCHEW, ADONIS. "Some Tests of Nonparamet- ern applied statistics with S-Plus, New York: ric Regressions Models," in Dynamic econo- Springer-Verlag, 1994. metric modelling, Proceedings of the third inter- VILLALOBOS, MIGUEL AND WAHBA, GRACE. "In- national symposium in economic theory and equality-Constrained Multivariate Smoothing econometrics. Eds.: W. BARNETT, E. BERNDT Splines with Application to the Estimation of AND H. WHITE. Cambridge University Press, Posterior Probabilities," J. Am. Statist. Assoc., 1988, pp. 121-135. 1987, 82, pp. 239-48. - ."Nonparametric Regression Model Tests WAHBA, GRACE. Spline models for observational Based on Least Squares," Econometric Theory, data, CBMS-NSF Regional Conference Series 1992, 8, pp. 435-451. in Applied Mathematics, #59, Society for Indus- - ."An Elementary Estimator of the Partial trial and Applied Mathematics, 1990. Linear Model," Econ. Lett., 1997, 57, pp. 135-43. WAHBA, GRACE AND WOLD, S. "A Completely . "A Nonparamnetric Differencing Test of Automatic French Curve: Fitting Spline Func- Equality of Regression Functions," manuscript, tions by Cross-Validation," Communications in 1998. statistics, Series A, 1995, 4, pp. 1-17. YATCHEW, ADONIS AND Bos, LEN. "Nonparamet- WATSON, G. S. "Smooth ," ric Regression and Testing in Economic Mod- Sankhya, Series A, 1964, 26, pp. 359-72. els,"J. Quant. Econ., 1997,13, pp. 81-131. WHANG, YOON-JAE AND ANDREWS, DONALD ZHENG, JOHN XU. "A Consistent Test of Func- W.K. "Tests of Specification for Parametric and tional Form via Nonparametric Estimation Semiparametric Models," J. Econometrics, Techniques," J. Econometrics, 1996, 75, pp. 1993, 57, pp. 277-318. 263-89.


Some Useful Mathematical Background 3 T-115= O(T-1/5) since T'15aT converges to 3 and hence is a bounded sequence. Suppose aT,T = 1,...,o is a sequence of A sequence is 0(1) if it is bounded. numbers. Then the sequence aT is of Now suppose aT,T= 1,...,oo is a se- smaller order than the sequence Tr, quence of random variables. Then, written aT= O(T-r) if TraT converges to aT = Op(T-r) if TraT converges in prob- zero. For example, if aT= T-1/4 then ability to zero. For example, let aT = E= aT = o(T-1/5) since T1'5 T-1/4 _ O. A se- 1/ETxle1?t where et are independently quence is o(1) if it converges to 0. and identically distributed with mean The sequence aT is the same order as zero and variance y2. Then E(?T) = 0, the sequence T-r, written aT =O(T-r) if Var(?T) = (52/T. Since the mean is 0 and TraT is a bounded sequence. For ex- the variance converges to 0, the se- ample, the sequence aT - 7 T-114+ quence TT converges in probability to 0 720 Journal of Economic Literature, Vol. XXXVI (June 1998) and is op(l). Furthermore, for any r < 1/2, on T observations YT. Then YT= OP(gy + = op(T-r) since Var(TrT`) = o?/Tl-2r ?T) = ,uC+ OP(ET) = 0(1) + Op(T-7?) and converges to 0. T12(YT - Ay) = T'/2ET = Op(l). Next, write aT = Op(T-r) if TraT is Let ?T be a sequence of real numbers bounded in probability.1 For example, converging to zero.2 Typically we con- suppose ThaT converges to a random sider sequences of the form ?T=T-r variable with finite mean and variance, where 0 < r < 1. Let aT= ITMTEt/lTT be then aT= Op(T-r). Thus, using the cen- the average of the first kTT = Tl-r values tral limit theorem T"1?2T converges to a of -t. For example, if T= 100, r= 1/5 N(O, 52) in which case -ET= Op(T-'/2) and then we are averaging the first 39.8 - T'1/2T- Op(l). 40 observations. Then, E[aT] = 0 and Suppose yt = ty + Et where gty is a con- VarfaT] = 6?2 /T = y2/T1. Hence, aT = stant and define the sample mean based OP((XTT)-'/2) = Op(T-'/2(1-r)).

APPENDIX B Nonparametric Regression and the bootstrap requires resampling many Bootstrap times, calculations need not be done se- rially, but can be done contemporane- Bootstrap procedures, which were in- ously, making the bootstrap particularly troduced by Efron (1979), are simula- suitable to parallel processing. tiorn-based techniques which provide, The monograph by Efron and Tibshi- among other things, estimates of vari- rani (1993) contains a readable intro- ability, confidence intervals, and critical duction to the bootstrap. Beran and values for test procedures. In part, be- Ducharme (1991) provide an approach- cause the simulated sampling distri- able treatment of the large sample va- butions are constructed from correctly lidity of the bootstrap. Hall (1992) dem- sized samples, bootstrap techniques are onstrates that the bootstrap often yields often more accurate than those based superior finite sample performance on asymptotic distribution theory. In than the usual asymptotic distribution maniy circumstances they are also sim- theory. Shao and Tu (1995) provide a pler to implement. survey of recent developments. The fundamental idea is to create Suppose we have data (yi,x) ... replications by treating the existing data (YT,XT) on the model y =f(x) + ? where f may or set (say of size T) as a population from may not lie in a parametric family. A which samples (of size T) are obtained. joint resampling methodology involves In the bootstrap world, sampling from drawing i.i.d. observations with replace- data becomes the the original data-gen- ment from the original collection of or- erating mechanism. Variation in esti- dered pairs. mates results from the fact that upon Residual resampling, on the other selection, each data point is replaced hand, proceeds as follows. First, f is es- within the population. Although the timated using, say, Af The estimated re- siduals are assembled and recentered by 1 That is, for any 8> 0, no matter how small, subtracting off their mean to produce there exists a constant A8 and a point in the t - sequence T8, such that for all T> T8, ?t= -f(Xt) T I(Ys -f(xs))/T. One then Prob[I aT I > A8] < 8. samples independently from these to 2 Although throughout the paper the bandwidth X depends on T, we have suppressed the T sub- construct a bootstrap data set: (yB,x),. script. (YT,XT) where ytj =f(xt) + AtB. The "B" Su- Yatchew: Nonparametric Regression Techniques in Economnics 721 perscript signifies a bootstrap observa- The random variable 0t has the proper- tion. Statistics that are of interest are then ties E(ot) = 0, E(o2) - At2, E(o3)t . One computed using the bootstrap dataset. then draws from this distribution to ob- An alternative residual resampling tain F_B for t= 1, ..., T. The bootstrap methodology, known as the "wild" boot- data set (yB,X,), ..., (yB,xT) is then con- strap, is useful in heteroscedastic and in structed, where yt =-frxt)+ -tB, and statis- certain nonparametric regression set- tics of interest are calculated. See Wu tings. In this case, for each residual (1986) and Hardle (1990, pp. 106-8, F=yt -(Xt) estimated from the original 247). data one creates a two-point distri- Hardle (1990) discusses various appli- bution for a random variable, say ot, cations of the bootstrap in a nonpara- with the metric setting and provides algorithms. See also Hall (1992, pp. 224-34), Hall cot Prob(wt) (1993), Mammen (1992), LePage and Billard (1992), Hardle and Mammeri ?_ (I - 5)/2 (5 + 5-)/IO (1993), Li (1995), and Yatchew and Bos (1997).