THE DELTA METHOD FOR NONPARAMETRIC KERNEL FUNCTIONALS

Yacine AÔt-Sahalia1 Graduate School of Business University of Chicago 1101 E 58th St Chicago, IL 60637

January 1992 Revised, August 1994

1 This paper is part of my Ph.D. dissertation at the MIT Department of Economics. I am very indebted to Richard Dudley, Jerry Hausman and Whitney Newey for helpful suggestions and advice. The comments of Peter Robinson and three anonymous referees have considerably improved this paper. I am also grateful to seminar participants at Berkeley, British Columbia, Carnegie-Mellon, Chicago, Columbia, Harvard, Northwestern, Michigan, MIT, Princeton, Stanford, Washington University, Wharton, Yale and the Yale- NSF Conference on Asymptotics for Infinite Dimensional Parameters as well as Emmanuel Guerre, Pascal Massart and Jens Praestgaard for stimulating conversations. Part of this research was conducted while I was visiting CREST-INSEE in Paris whose hospitality is gratefully acknowledged. Financial support from Ecole Polytechnique and MIT Fellowships is also gratefully acknowledged. All errors are mine. The Delta Methods for Nonparametric Kernel Functionals

by Yacine AÔt-Sahalia

Abstract

This paper provides under weak conditions a generalized delta method for functionals of nonparametric kernel , based on (possibly) dependent and (possibly) multivariate data. Generalized derivatives are allowed to permit the inclusion of virtually any functional, global or pointwise, explicitly or implicitly defined. It is shown that forming the with dependent data modifies the only if the functional is more irregular than some threshold level. estimators and rates of convergence are derived. Many examples are provided.

Keywords: Delta Method, Nonparametric Kernel Estimation, Dependent Data, Rate of Convergence, Functional Differentiation, Generalized Functions. 1

1. Introduction

The delta method is a simple and widely used tool to derive the asymptotic distribution of nonlinear functionals of an estimator. Most parametric estimators converge at the familiar root-n case, and so does the functional. When the estimator is nonparametric, however, some functionals will converge at a rate slower than root-n while others will retain the root-n rate. The slower than root-n functionals require some form of smoothing to be estimated, the most popular being the kernel method. A delta method has long been available for the class of root-n functionals that can be estimated without smoothing (von Mises [1947], Reeds [1976], Huber [1981], Dudley [1990]).

In all cases, the essence of the delta method is a first order Taylor expansion of the functional. The problem is that the slower-than-root-n functionals are not differentiable in the usual sense. Therefore the examples of slower-than-root-n functionals studied in the literature have been tackled without the systematic "plug-and-play" feature that made the delta method attractive in the settings where it was available. This paper proposes a simple delta method that covers also slower-than-root-n functionals, and under conditions that equal and often relax those used in the previous "case-by-case" work. To address the problem of non-differentiability, the paper allows generalized functions as functional derivatives. An example of a generalized function is the Dirac delta function, and its derivatives. With generalized functions, the familiar delta method approach based on differentiating the functional is shown to be easily implemented for non-trivial examples. The results are valid even when the data are serially correlated, with independent data as a special case.

The main contribution of the paper is to show how to linearize systematically slower-than-root-n functionals, and then to provide a general yet simple result yielding their asymptotic distribution. The purpose of the examples is two-fold: first, some classical 2 examples (regression function, etc.) are included to show how the method of this paper significantly beats the previous approaches; second, new distributions are derived for cases where they were not previously available (dependent censored least absolute deviation, quantiles, mode, stochastic differential equations, etc.). The paper is organized as follows: Section 2 derives the generalized delta method. Consistent estimators of the asymptotic are proposed in Section 3. Section 4 discusses the rates of convergence of the estimators. Section 5 illustrates the application of the result through many examples. Section 6 concludes. Proofs are in the Appendix.

2. The Delta Method with Generalized Derivatives

2.1 Assumptions

Consider Rd-valued random variables X1, X2,...,Xn identically distributed as f(.), x an unknown density function with associated cumulative density function Fx()≡ ∫ ft()dt −∞ where xxxx≡ (),,,K . The following regularity conditions are imposed: 12 d

{} β Assumption A1: The sequence Xi is a strictly stationary -mixing sequence satisfying:

δ β → δ k k k→∞ 0 for some fixed > 1.

β =∀≥ β → k 01, k corresponds to the independence case. As long as k k→∞ 0, the sequence is said to be absolutely regular.

Assumption A2: The density function f(.) is continuously differentiable on Rd up to order s. Its successive derivatives are bounded and in LR2 ()d .

Let Cs be the space of density functions satisfying A2. To estimate the density function f(.), a Parzen-Rosenblatt kernel function K(.) will be used. The kernel will be required to satisfy: 3

Assumption A3: (i) K is an even function integrating to one; (ii) The kernel is of order r=s, an even integer: +∞ λλ ∀∈λλλλd ≡ + + ∈{} − 1 d = 1110)/NrxxKxdx11KKdd ,,,∫ L (); −∞ +∞ λλ ∃∈λλd =1 d ≠ 20)/N r and∫ 1 xLd x K (); x dx −∞ +∞ 3)|()|.∫ xKxdxr <+∞ −∞ (iii) K is continuously differentiable up to order s+d on Rd, and its derivatives of order up to s are in L2(Rd).

The last assumption indicates how the bandwidth hn in the kernel density estimator should be chosen. The statement of the assumption depends upon an exponent parameter e>0 and an integer m, 0≤m

− →∞ 12e + 12 2m 1 → Assumption A4(e,m): As n : nhn () nhn 0.

Each specific application requires that A4(e,m) be satisfied for some particular e and m. Finally, construct the kernel cumulative density function (KCDF) estimator: x à ()≡ à () Fn x ∫ f n t dt. Consider also the empirical cumulative distribution function (ECDF) −∞ n ()≡ 1 ()− . Fn x ∑1x xi n i=1

2.2. Functional Differentiation

Consider a functional Φ[.] defined on an open subset of Cs with the L2 norm, and taking values in R. Say that Φ is be L(2,m)-differentiable at F in Cs if it admits a first order Taylor expansion:

ΦΦΦ[]+ = []++()1 = ()2 FH F[]() FH RFHΦ [, ], with RFHΦ[, ] O HLm()2, 4 where Φ(1)[G](.) is a continuous linear (in H) functional and L(2,m) is the sum of the L2 norm of the all derivatives of H up to order m. If this holds uniformly on H in any compact s Φ()1 ≤ () Φ subset K of C , and []()GH CKHLs()2, , then is said to be L(2,m)-

Hadamard-differentiable at F. In what follows it will always implicitly be assumed that the linear term Φ(1)[F](.) is not degenerate. If it were then the asymptotic distribution would be given by a term of higher order in the Taylor expansion.

By the Riesz Representation Theorem (see e.g., Schwartz [1966]), there exists a +∞ ϕ d Φ()1 ()= ϕ () () ϕ ()⋅ distribution []FR: a R such that []F H∫ [] F x dH x . Call []F the −∞

functional derivative2 of Φ at F. The standard delta method is applicable only if ϕ[]F ()⋅ is a

regular function, i.e., at least cadlag (right-continuous, left-limit). For some functionals Φ, the functional derivative will indeed be a regular function. For example, let +∞ Φ[]Ffxdx≡ ∫ ()2 . Then: −∞

+∞ +∞ +∞ +∞ Φ[]F+ H= ∫∫∫∫{} f() x+ h() x2 dx= f() x22 dx+ 2 f()() x h x dx+ h() x dx −∞ −∞ −∞ −∞

+∞ +∞ +∞ = 2 Φ()1 = ϕ = so RFHΦ[, ] ∫ hxdx() and []F() H∫∫[] F() x dH () x2 f()() x h x dx. Thus its −∞ −∞ −∞ functional derivative is: ϕ[]Ff()⋅ =⋅2 (), a function in Cs.

Unfortunately, many functionals of interest in econometrics do not have "regular" functional derivatives, that is ϕ[]F ()⋅ will not be a cadlag function. Instead, it will be a

2 Although ϕ[F] is not unique, the respective asymptotic distributions given by Theorem 3 are independent of the choice of ϕ[F]. One way to make the representation unique would be to impose that +∞ ∫ ϕ[]F() x dF () x = 0 . Any ϕ[F] given by the Riesz Theorem will be called "the" derivative, even though −∞

"a" derivative would be more appropriate. 5 generalized function. The existing delta method cannot treat such functionals. The main point of the paper is that the same familiar delta approach will work provided that one includes generalized functions as functional derivatives. This method turns out to be very simple as well as powerful. Many examples will be provided below. To get the flavor of fy() the result immediately, consider the hazard rate function Φ[]F ≡ evaluated at some 1 − Fy()

y. It will be shown below that it is differentiable in the extended sense of this paper, with 1 Fy()1 () functional derivative: ϕδ[]()Fx= ()x + 1()y ≥ x . This functional 1 − Fy() ()y []1 − Fy()2

derivative is a linear combination of a Dirac mass at y (a generalized function) and an indicator function (a regular function). The asymptotic distribution of the kernel estimator of the hazard rate will be driven by the Dirac term in the linear expansion, only the most unsmooth term counts.

2.3. Generalized Functions

The concept of generalized function, or distribution, was formally introduced by Schwartz [1954,1966]. Simply put, any function g, DEFINED AS???? no matter how unsmooth, can be differentiated. Its derivative g(1) is defined by its cross-products against +∞ +∞ smooth functions f: g()11()() x f x dx≡−(1 ) g () x f() () x dx in the univariate case, ∫∫−∞ −∞

where f(1) is a standard derivative. Of course, by integration by part, this reduces to the common definition of differentiability if g turns out to be a regular function.

+∞ For example, a Dirac function at 0 is defined by: δ ()()xfxdx= f()0 , and its ∫−∞ ()0 +∞ +∞ derivative is given by δδ()1 ()()x f x dx=− () x f()11 () x dx=− f ()()0 . Successive ∫∫−∞ ()0 −∞ ()0

differentiation of the Dirac function is possible up to the number of derivatives that the +∞ +∞ () () functions f admit, to yield δδ()q ()()x f x dx=−(110 )q () x fq () x dx=−()q f q (). ∫∫−∞ ()00−∞ ()

Besides Dirac functions, many other generalized functions can be constructed; see Schwartz [1954,1966] or Zemanian [1965] for examples. 6

This paper allows functional derivatives ϕ[]F ()⋅ to be generalized functions. I will

define an increasing sequence of spaces of generalized functions. Each space will contain functions of a given "level of unsmoothness." It will be first shown that the asymptotic distribution of the plug-in depends on the particular space containing ϕ[]F ()⋅ and that the more unsmooth ϕ[]F ()⋅ is the slower the rate of convergence. Furthermore when ϕ[]F ()⋅ is more unsmooth than cadlag (i.e., when it is a generalized function instead of a regular function) it will be shown that constructing the estimator on correlated data does not affect its asymptotic variance.

Start by defining the space C-1 of bounded cadlag functions from [0,1]d to R. C-1 contains all the usual spaces C0, C1, etc. of continuous, continuously once-differentiable functions, etc. The regular functions are the elements of C-1. Now define C-2 to be the space of linear combinations of Dirac functions and functions of C-1, C-3 to be the space of linear combinations of derivatives of Dirac functions and functions of C-2, etc. When ϕ[]F ()⋅ belongs to the generalized function space C-q, q 2, but not to the space immediately smaller C-q+1, write ϕ[]FCC∈ −−+qq\ 1. q can readily be interpreted as an "order of unsmoothness" of ϕ[]F ()⋅ . Moving up the following scale, the functions become

more unsmooth, and conversely: 7

DIFFERENTIATE

G E F N U - 3 E N C R C A T L I - 2 C I O Z N E S - 1 D C F R U E N 0 C G C U T L I A 1 O C R N S

INTEGRATE

II.4. The Generalized Delta Method

The result is stated in dimension d=1. An extension to the multivariate case is provided in the Appendix. ϕ[]FCC∈ −−+qq\ 1 has the form:

L ϕαδ[]()= []()()q−2 ()+ []() α []()⋅ ∈ −1 Fx∑ l Fx() x BFx where each y is a fixed point, l FC y l l l=1 and BF[]()⋅ ∈ C−+q 1 (see Section 5 for examples).

The following delta method characterizes the asymptotic distribution of the plug-in functional as a function of the particular space C-q where the functional derivative ϕ[]F ()⋅ lies: 8

Theorem: Suppose that Φ is L(2,m)-Hadamard-differentiable at the true cdf F with functional derivative ϕ[]F ()⋅ . Then under A1-A3:

(i) If ϕ[]FC∈ −1 , then under A4(r,m): 12 ΦΦÃ − →d ()[] nF{}( n )() F NVF0 ,Φ with asymptotic variance: +∞ []= ϕϕϕ()+ () ( ) VΦ F VAR()[] F xt 2 ∑ COV() [] F xttk ,[] F x + k =1

(ii) If ϕ[]FCC∈ −−+qq \ 1 for some q [2,s], then under A4(r+1/2, ()23212q− //ΦΦÃ − →d ()[] m): hnFFn {}( n )() NVF0 ,Φ with asymptotic variance:

+∞ L  2  []= ()q () α []()2 VFΦ  ∫ K x dx ∑{}ll Fy . The asymptotic variance is the −∞  l=1 same whether or not the data are serially dependent.

When the functional derivative is a regular function (that is ϕ[]FC∈ −1 ), the result does not depend on how smooth it is beyond being cadlag. On the other hand, when it is a generalized function, the result depends on the exact degree of unsmoothness of the functional derivative (belonging to the space C−q , but not the one immediately smoother, C−+q 1). Many (but not all, e.g. the integrated squared density) of the functionals with regular derivatives can be estimated without smoothing. In that case, the result is exactly the same whether a kernel or empirical cdf is plugged-in, and there is no reason to smooth.

For functionals with unsmooth derivatives however, smoothing is essential, as the plug-in cannot even be defined at the empirical cdf. And the asymptotic distribution is driven exclusively by the "most unsmooth" component of the functional derivative: the smoother component BF[]()⋅ ∈ C−+q 1 of ϕ[]FCC∈ −−+qq\ 1 does not appear in the asymptotic variance. Such a functional (asymptotically) behaves essentially like a linear combination of the density or its derivatives that are not integrated upon. When ϕ[]FCC∈ −−+qq \ 1, it can also be noted that not only the asymptotic variance contains no 9 time-series term, but also has no cross-covariances across the L terms in the functional derivative. For example, it is known that the kernel density evaluated at a point y1 and at a different point y2 are asymptotically uncorellated.

This brings the following remark. The slower-than-root-n functionals have a "local" character, such as the density evaluated at a point, or the mode of the density function. Consider for example the density function f(.) and the local functional Φ []≡ () Φ[]≡⋅() s y Ffy (real-valued) as opposed to the global functional Ff (C -valued).

Drawing from the experience of root-n functionals, it may be tempting to try to obtain weak convergence to a Gaussian process of Φ[]Ff≡⋅(). Unfortunately, no such result holds for

slower-than-root-n functionals. Indeed if it existed a limiting process for the normalized kernel density estimator, this process would have to take independent values W(t) and W(s) for every t … s.

The delta method derived here has an intuitive duality interpretation. The asymptotic +∞ distribution of an unsmooth functional Φ is driven by the inner product ∫ ϕ[]()F x dH () x . −∞

When ϕ[F] is a generalized function (in C-q, q 2) then H must belong to C+q-1. Therefore one needs to have a sufficiently regular nonparametric estimator and unknown cdf. to plug- = à − in as H Fn F. This is the role played by the kernel smoothing. If one uses the empirical à =− -1 distribution Fn instead of the KCDF Fn then HFn F will be in C only, and therefore the only functionals that can be plugged-into must have derivatives in C-1.

3. Consistent Estimation of the Asymptotic Variances

The asymptotic variances given by the delta method can be consistently estimated in each case: 10

(i) If ϕ[F] C-1, then under A4(r,m) and the technical regularity condition A5 (given in the Appendix, and designed to guarantee that the truncated sum in the variance

estimator will effectively approximate the infinite sum) the asymptotic variance VFΦ[] can

be consistently estimated by:

2 11n  n  Và ≡−∑∑ϕϕ[Fxà ]( )2  [Fxà ]( )  nni ni n i==11n i − 2 G n 2  11nk  n   + ∑∑ ∑ϕϕ[Fxà ]( ) [ Fxà ]( )− ϕ [Fxà ]( )   +− ni nik+  ni k =1 Gknn 1  i=1 n i=1 

=+∞ = 13/ where Gn is a truncation lag chosen such that lim Gn and GOnn (). This is an n→∞

estimator of the spectral density at zero (see Newey-West [1987], Robinson [1989,1991]).

The choice of the truncation lag Gn and the Bartlett kernel is subject to the same provisions and can be improved upon as in Andrews [1991] in the parametric case.

-q -q+1 (ii) If ϕ[F] C \ C , then under A4(r+1/2, m) the asymptotic variance VFΦ[] +∞ L  2  2 ÃÃ≡ ()q () α () can be consistently estimated by: VKxdxFyn  ∫  ∑{}ll[]n . −∞  l=1

The appropriate estimate of the asymptotic variance makes it possible to construct Φ Ã Ã confidence intervals on []Fn and carry out tests of general hypotheses regarding Fn . For Φ[]= Φ[]≠ example, to test the hypothesis HF0 : 0 versus HF1 : 0 one could simply use the following Wald-type test :

nini() ≡λ Φ Ã Ã −1 Φ Ã d χ2 λ ≡ Wn (n) (Fn )© Vn (Fn ) → [1] , where ()n  ()23q− . n hn in() ii under H 0

4. Rates of Convergence

The speed of decrease of the bandwidth to zero as the sample size increases is constrained by A4. The bandwidth can be chosen within the bounds allowed by A4 in 11 order to generate the fastest possible rate of convergence β (the speed of convergence being n−β).

(i) If ϕ[F] C-1, then the plug-in will converge at rate β=1/2. The root-n rate is

achieved by kernel plug-ins under A3 no matter how hn is chosen within A4, and will produce an asymptotic distribution centered at zero.

(ii) If ϕ[F] C-q \ C-q+1 for some q [2,s], then the rate of convergence is at best β=()rq −() −221() r+ . It can be achieved by kernel plug-ins under A3 when choosing

−α hn of the order n , with α=12()r + 1. The resulting asymptotic distribution of the plug- in will not be centered at zero. For any ε>0, the rate of convergence β−ε however can be achieved with a resulting asymptotic distribution centered at zero by choosing hn of the −α   αε=++−()  order n , with {}12//rq 1 12() 3. This choice is admissible under

A4(r+1/2,m). Given the optimal rates of Stone [1980] and Goldstein and Messer [1992], it therefore turns out that the kernel-type estimators can achieve the optimal rate (but if one insists on getting the optimal rate then the limiting distribution is not centered at zero).

5. Examples and Applications

Classical examples as well as new distributions are provided in this section to both show how the method can very easily yield classical results and provide new results.

Example 1: Ordinary Least Squares

The following trivial example illustrates the method in a very simple case, recovering the asymptotic distribution of classical parametric estimators. Consider a simple =+βε ε = linear model: yxtt t,|0 E[] tt x . Although at first sight a quintessentially parametric model, the linear regression model in fact makes no assumptions whatsoever regarding the distribution of the disturbances (other than uncorrelatedness with the 12 regressors). In that sense, the OLS estimator can be treated as a nonparametric estimator. OLS estimates the functional: [] +∞+∞ +∞+∞ β=EXY ≡ Φ()= ()2 () 2 F∫∫∫ x yf x,, y dxdy∫ x f x y dxdy EX[] −∞−∞ −∞−∞

by plugging-into this expression the empirical cdf Fn :

n n βà ≡ Φ()= 112 OLSF n∑∑yx t t xt n t ==1 n t 1

Now compute the functional derivative of Φ:

+∞+∞ +∞+∞ +∞+∞ +∞+∞ +∞+∞  ∫ ∫ xy{} f+ h ∫ ∫ xyf ∫ ∫ xyh  ∫∫∫ xyf ∫ x2 h −∞−∞ −∞−∞  2 Φ[]FH+ = −∞−∞ =+−−∞−∞ −∞−∞ + OH() +∞+∞ +∞+∞ +∞+∞ +∞+∞ 2 L()21, 222{}+ 2 ∫ ∫ xfh ∫ ∫ xf ∫ ∫ xf  ∫ ∫ xf −∞−∞ −∞−∞ −∞−∞ −∞−∞  +∞+∞ = Φϕ[]+ []()() + ()2 FFxyhxydxdyOH∫ ∫ ,, L()21, −∞−∞ Φ()=β So the functional FF a is L(2,1)-Hadamard-differentiable at F, and its []2 ϕ[]()= uv − EXYu ∈ −1 derivative is Fuv, 2 2 C . Theorem (i) gives the asymptotic EX[]{}EX[]2

distribution of the plug-in (using either the empirical or the kernel estimator of F) for ϕ ∈ −1 12 ΦΦÃ − →d ()[] []FC: nF{}( n )() F NVF0 ,Φ , with asymptotic variance given by:

+∞ []= ϕϕϕ()+ () ( ) VΦ F VAR()[] F ytt , x2 ∑ COV() [] F yt , x t ,[] F y tk++ , x tk . k =1

Replacing the functional derivative ϕ[F] by its expression and yt-xtβ by εt, it is easy to check that this expression is equal to the classical OLS asymptotic variance +∞ −2   ≡ 2 εεε22+ VExExEOLS()[] t [] t t2∑ [] t t++ k xx t t k .  k =1  13

Example 2: Least Absolute Deviations

=+βε β Consider again the simple linear model yxtt t where x is K-variate, is an

unknown K-dimensional parameter vector. The identification assumption on {}ε /,,tT= 1 K consists now of being independent of {}xt/,,= 1 K T and having zero t t

median. Let fε be the marginal density of the disturbances ε. The least absolute deviation

(LAD) estimator is defined by (with the minimum taken over a compact set): T ββà ≡−1 LADargmin β ∑ yx t t . T t =1

T 1 − βà ≡ The first order condition is: ∑ sign() yt x t LAD x t 0, where sign(a) -1 if T t =1 a<0 and 1 if a 0. Because the ε process has zero median and is independent of the x process, E[] sign() y− xβ x = 0. Let Φ()F ≡β (this is a RK-valued functional), where F is the joint cdf of (x,y), be defined by E[] sign() y− xΦ() F x = 0. This provides an example βà = Φ() of an implicitly defined functional. By construction LADF n .Its asymptotic variance was obtained by Pollard [1990] and Phillips [1991].

Here the functional derivative is easy to compute:

  ϕβ()= 1 −−11()− ∈ ≡ LAD[]Fxy ,  A sign y x x © C where AExx[]tt© , so:  20fε ()

12 ββà − →d ()()−2 −−11 nNfAVA{}LAD ()02, ε 0 , where:

≡ ()ε V VAR() signtt x +∞ + ()εε ( ) + () ε () ε ∑ {}COV() signt x t,, sign tk++ x tk COV() sign tk ++ x tk sign t x t k =1 14

Example 3: Censored Least Absolute Deviations

Powell [1984] extended the LAD estimator to the case where the dependent variable ≡+βε is censored, i.e., only yxtttmax{}0 , is observed. This case is typical of situations arising in a labor supply context. In that case, consider the CLAD estimator:

T ββà ≡−1 CLADargminβ ∑ yx t max{}0 , t T t =1

Regularity conditions guaranteeing the identifiability of β, and the existence and unicity (asymptotically) of a solution are given by Powell [1984], and also assumed here. The population first order condition is: E[] sign() y− max{}01 , xββ() x x = 0 and let Φ()F ≡β   βà = Φ() ϕββ()= 1 −−11− {}() ∈ so CLADF n . Now: CLAD[Fxy ] ,  B sign() y max01 , x x x © C  20fε () ≡ ()β 12 ββà − →d ()()−2 −−11 with BEx[]1 ttt xx© . Thus nNfBWB{}CLAD ()02, ε 0 , +∞ ≡ ()()εβ+ ()() εβ ( ε )( β ) W VAR() signt111 x t x t∑{ COV() sign t x t x t, sign tk+++ x tk x tk k =1 + ()()εβ ()() εβ COV() signtk+++11 x tk x tk, sign t x t x t}

This asymptotic distribution for dependent data appears to be new.

Example 4: Integrated Functionals

Consider next the family of real-valued functionals of the following form, where +∞ ω(.) is a trimming function: Φ(FxxFxFxFxdx )≡ ∫ ωψ()() ,()12 () , ( ) () ,..., (m ) () . This class −∞

includes the information matrix giving the asymptotic variance of maximum likelihood estimators, the entropy measure, the average derivative estimators of Powell, Stock and Stoker [1989] and Robinson [1989], the integral of the squared density, etc. The functional m ∂q−1  ∂Ψ  derivative is: ϕ[Fx ]( )=−∑ (1 )q−1 ω (x ) (xF ,()1 ( x ),..., F()m ( x )) C-1, so the ∂ qq−1  ∂ ()  q=1 x F plug-in will converge at rate root-n and have an asymptotic distribution sensitive to dependent data. 15

Example 5: Pointwise Estimation

Φ []≡ ()q Consider the classical example q FFy(), a derivative of the cdf evaluated at ϕ = − ∈ −1 ϕδ= ()q−1 ()∈ −−qq1 − y. Then if q=0, 0[F](x) 1(y x) C , while qy[]()Fx() x C \ C, hence  +∞   ()/21212q− Ã ()qq()− ()() →d ()q− 1() 2 () for q 1: hn n{} FyFyn N 0 ,| ∫ K xdxfy |  . The  −∞  

extension to multivariate data is immediate given the multivariate result in the Appendix.

Example 6: Smooth Quantiles

Take Φ[]FFy≡ −1[] for some y. In the independent case, smooth estimation of

quantiles has been studied e.g., by Parzen [1979] and Silverman and Young [1987]. Here 1 − − the functional derivative can be computed as: ϕ[F](x) =− 1F()1(y) − x ∈C 1 F(1) []F−1(y)

so the asymptotic distribution will converge at rate root-n and have time-dependent terms. Letting f be the joint density of observations at lag k, the asymptotic variance is:

−1 −1  +∞ Fy()Fy()  −+{} −  yy()12∑ ∫ ∫ fstfsftdsdtk (,)()()   =  []=  k 1 −∞ −∞  VFΦ 2 ()FFy()11[]− ()

This result also appears to be new. Weak convergence of the quantile process to a Gaussian process is proved using the same method (see AÔt-Sahalia [1993]).

Example 7: Mode

The mode of a unimodal univariate density, studied by Parzen [1962] for i.i.d. data, can be obtained by the following functional: Φ(F) ≡ [F(2)]−1(0), that is the point at

which the derivative of the density is zero. The functional derivative can be computed here:

1 (1) −3 −2 ϕ[F](x) =− δ − ∈C \C , so it follows that: F(3) [][F(2)]−1(0) ([F(2) ] 1 (0)) 16

 +∞ −    FF()121[][]() ( ) 0 3212/()()Ã 21−−− 21→d  () 1() 2  hnnn{}[ F]()00 [ F ]() N 0 , ∫ | Kxdx |  2 .  ()321 ()−   −∞  ()FF[][]()0 

This result appears to be new for dependent data.

Example 8: Hazard Rate

Fy()1 () Consider Φ[]F ≡ for some fixed y. Its kernel estimation has been studied 1 − Fy() by Roussas [1990]. Hazard rates are typically useful in unemployment studies. Here the derivative can easily be computed:

()1 1 Fy() −− ϕδ[]()Fx= ()x + 1()yCC≥ x ∈ 21\ , and therefore: 1 − Fy() ()y []1 − Fy()2

 +∞    fyà () fy()  () fy() hn12/ 12 n −  →d NKxdx0 ,| 1() | 2  . n − à −  ∫ ()− 2   1 Fyn () 1 Fy()  −∞  1 Fy() 

Example 9: Regression Function

The Nadaraya-Watson method relies on a kernel plug-in to estimate +∞ +∞ Φ[]FEZYy≡=[]= zfyzdzfyzdz(),, (). The asymptotic distribution of ∫∫−∞ −∞

Robinson [1983] and Bierens [1985] can be recovered by computing the functional

 zaF− [] −− derivative: ϕδ[]()Fx=   ()wCC∈ 21 \ where aF[]≡= EZY[] y and  bF[] ()y +∞ bF[]≡ fyzdz() , . Hence for ε≡ZEZYy −[]| = and k regressors in Y: ∫−∞

+∞ 2  2  EYyKwdw[]ε | = {}[]()() ∫ l k //212{}ΦΦÃ − →d  −∞  hnn ( Fn )() F N 0 , +∞   fyzdz(),   ∫−∞ 

17

6. Conclusions

This paper has extended the delta method to nonparametric estimators of unsmooth functionals. The regularity conditions are simple, easily verifiable and generally equal or beat conditions used in case-by-case studies. Generalized derivatives were allowed to permit the inclusion of virtually any functional, global or pointwise, explicitly or implicitly defined. It was found here that both the rate of convergence to the asymptotic distribution and the asymptotic variance were functions of the unsmoothness of the functional derivative. Basing the estimator on dependent data modifies the asymptotic distribution only if the functional is more irregular than some threshold level (cadlag). New functional derivatives were computed for a variety of practical estimators used in econometrics, and used to obtain straightforwardly their asymptotic distribution.

Compared to the case-by-case approach, the generalized delta method has another advantage. It isolates the computation of the functional derivative, which is computed once and for all. When considering dependent sequences, or nonparametric estimation strategies other than kernel-based, the exact same functional derivative will be needed. The kernel results of this paper could therefore potentially be extended to cover other nonparametric methods. Many popular nonparametric procedures for density estimation are indeed of the n à ()≡ 1 () form funni∑Kux, . For example, the kernel method sets n i=1 1  ux−  Kux(), ≡ K i  with a fixed function K(.), while the orthogonal function ni d   hn hn −1 h n ()≡ () () method is based on Kuxni, ∑ pupx j ji where{pi} is a system of orthogonal j=1 functions in L2(Rd). 18

References

AÔt-Sahalia, Y., [1993], "Nonparametric Functional Estimation with Applications to Financial Models," Ph.D. Thesis, MIT, May. Andrews, D.W.K., [1991], "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation," Econometrica, Vol. 59, No. 3, 817-858. Arcones, M.A., and Yu, B., [1992], "Central Limit Theorems for Empirical and U- Processes of Stationary Mixing Sequences," MSRI Mimeo, U.C. Berkeley. Bierens, H.J., [1985], "Kernel Estimators of Regression Functions," in Bewley, T.F., ed., Advances in Econometrics, Fifth World Congress, Vol. I, Econometric Society Monographs, Cambridge University Press, Cambridge, England.

Billingsley, P., [1968], Convergence of Probability Measures, Wiley, New-York. Dudley, R.M., [1990], "Nonlinear Functionals of Empirical Measures and the Bootstrap," in Probability in Banach Spaces 7, ed. E. Eberlein et al., Progress in Probability 21, 63-82, Birh‰user, Boston.

Fernholz, L.T., [1983], Von Mises Calculus for Statistical Functionals, Lecture Notes in Statistics 19, Springer-Verlag. Gill, R.D., [1989], "Non- and Semi-parametric Maximum Likelihood Estimators and the von Mises Method (Part 1)," Scandinavian Journal of Statistics, Vol. 16, 97-128. Goldstein, L. and Messer, P., [1992], "Optimal Plug-In Estimators for Nonparametric Functional Estimation," Annals of Statistics, Vol. 20, 1306-1328.

Gyˆrfi, L., H‰rdle, W., Sarda, P. and Vieu, P., [1989], Nonparametric Curve Estimation from Time Series, Lecture Notes in Statistics 60, Springer-Verlag. Masry, E., [1989], "Nonparametric Estimation of Conditional Probability Densities and Expectations of Stationary Processes: Strong Consistency and Rates," Stochastic Processes and their Applications, 32, 109-127. von Mises, R., [1947], "On the Asymptotic Distribution of Differentiable Statistical Functions," Annals of Mathematical Statistics, Vol. 18, 309-348. Newey, W.K. and West, K.D., [1987], "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix," Econometrica, Vol. 55, No. 3, 703-708. Pakes, A. and Pollard, D., [1989], "Simulation and the Asymptotics of Optimization Estimators," Econometrica, Vol. 57, No. 5, 1025-1057. Parzen, E., [1962], "On Estimation of a Probability Density Function and the Mode," Annals of Mathematical Statistics, 33, 1065-1076. 19

------, [1979], "Nonparametric Statistical Data Modeling," Journal of the American Statistical Association, 74, 105-131.

Phillips, P.C.B., [1991], "A Shortcut to LAD Estimator Asymptotics," Econometric Theory, 7, 450-463. Pollard, D., [1990], "Asymptotics for Least Absolute Deviation Estimator," Econometric Theory, 6. ------, [1984], Convergence of Stochastic Processes, Springer, New-York. Powell, J.L., [1984], "Least Absolute Deviations Estimation for the Censored Regression Model," Journal of Econometrics, Vol. 25, 303-325. Powell, J.L., Stock, J.H. and Stoker, T.M., [1989], "Semiparametric Estimation of Index Coefficients," Econometrica, Vol. 57, No. 6, 1403-1430.

Reeds, J.A.III, [1976], "On the Definition of Von Mises functionals," Ph.D. Thesis, Harvard University, Department of Statistics. Robinson, P.M., [1991], "Automatic Frequency Domain Inference on Semiparametric and Nonparametric Models," Econometrica, Vol. 59, No. 5, 1329-1363. ------, [1989], "Hypothesis Testing in Semiparametric and Nonparametric Models for Econometric Time Series," Review of Economic Studies, Vol. 56, 511-534. ------, [1988], "Root-N Consistent Semiparametric Regression," Econometrica, Vol. 56, 931-954. ------, [1984], "Robust Nonparametric Autoregression," in Robust and Nonlinear Time Series Analysis, Franke, H‰rdle and Martin eds., Lecture Notes in Statistics 26, Springer-Verlag, Heidelberg. ------, [1983], "Nonparametric Estimators for Time Series," Journal of Time Series Analysis, 4, 185-207.

Rosenblatt, M., [1991], Stochastic Curve Estimation, NSF-CMBS Regional Conference Series in Probability and Statistics, Vol. 3, Institute of Mathematical Statistics, Hayward. ------, [1971], "Curve Estimates," Annals of Mathematical Statistics, Vol. 42, 1815-1842.

Roussas, G., [1969], "Nonparametric Estimation in Markov Processes," Annals of the Institute of Statistical Mathematics, Vol. 21, 73-87. ------, [1989], "Hazard Rate Estimation Under Dependence Conditions," Journal of Statistical Planning and Inference, 22, 81-93.

Schwartz, L., [1954,1966], Theory of Distributions, Hermann, Paris. Stone, C.J., [1980], "Optimal Convergence Rates for Nonparametric Estimators," Annals of Statistics, Vol. 8, No. 6, 1348-1360.

Zemanian, A.H., [1965], Distribution Theory and Transform Analysis, McGraw- Hill, New-York. 20

Appendix

The statement of the Theorem in the multivariate case is the following:

(i) If ϕ[]FC∈ −1 , the result reads the same in dimension d;

L ϕ ∈ −−+qq1 ϕα∂δ[]()= []() ()l ∆ l + []() (ii) If []FCC \ : let Fx∑ l Fx()l y () x BFx where l=1 l∆=q2 − , α []FC()⋅ ∈ −1 and BF[]()⋅ ∈ C−+q 1. llyR∈ d() contains d()l components, and l = ()ll− xxx , is partitioned accordingly. The maximal number of variables affected by the Dirac mass is dd* ≡ {}()ll∈{} K L l max /1 , , and is attained at in **≡∈{}lK{}() l= L 1Ld,, / d.

()dq* +−24212//ΦΦÃ − →d ()[] Then under A4(r+d*/2, m): hnFFNVFn {}( n )()0 ,Φ +∞ +∞  2  2  ()l ∆ ll ll−−− ll l where: VFΦ[]= ∑  []()∂α K() () u du {}[] F() y,, t fy() tdt and ∫∫l l l∈L −∞ −∞  +∞ ()⋅ ≡⋅−ll− KKvdv()l ∫ (), . −∞

Proof of Theorem: The following two lemmas will be used:

à ≡−12 Ãà Lemma 1 (): Under A1-A4(,0), AnFEFnnn()[]

converges in law to a Gaussian C0-stochastic distribution G in the space (,C0 ), F L()∞,0 with finite-dimensional covariances given below. If A4( ,0) is replaced by the more stringent requirement A4(r,0), then the preceding statements hold for the centered process à ≡−12 à à ≡−12 Ãà AnFFnn() instead of AnFEFnnn()[]. The covariance kernel of the

≡ Ä generalized Brownian Bridge GBF F o is given by (where Fk is the joint cdf of observations at lag k): 21

+∞ Ä Ä = ()()()− ++−{} EFsBFtF[]B(())( ( )) min stFt ,12 ( )∑ FstFtsFsFtkk ( , ) ( , ) ( ) ( ) . k =1

Lemma 2 (Bounds for Remainder Term): Under A1-A3, for q = 0,...,d+s:

ÃÃ− = −−12 q à −=rqd−−() FEFnn[] Onhpn() and: EF[]n F Oh()n . Lq()2, Lq()2,

Γ≡ ∈d ∈ * To prove Lemma 1, show that the class of functions {}WyRhR()yh, /,+ y 1  t −⋅ where W ()⋅ ≡ K dt forms a subgraph VC class. But such a class is a ()yh, ∫ d   −∞ h h

Euclidean class (from Lemma (2.12) in Pakes and Pollard [1989]); conclude with Theorem 1 of Arcones and Yu [1992]. Alternatively one could use the U-statistics approach of Robinson [1989]. Lemma 2 is easy given A2. Details are in AÔt-Sahalia [1993].

(i) Consider now the first part of the Theorem. By differentiability of the functional +∞ Φ 12 ΦΦÃ − []= ϕ ()Ã ()+ Ã at F: nF{}[]nnn F∫ [] FxdAxRFAΦ[, ], where: −∞

2 Ã =− 12 Ã  Ã = RFAΦ[, n ] On Fn F . First, RFAΦ[, n ]() op 1 follows from X  Lm()∞,  X 2 12/ Ã −=−−12/() 2m + 12 2rmd −− 2 Lemma 2 since: nFFn Onhnhpn()n is op (1) under Lm()∞, 12 ΦΦÃ − [] A4(r,m) as r>2(m-d). Then by Slutsky's Theorem, the distribution of nF{}[]n F +∞ ϕ ()Ã () is given by that of ∫ []F x dAn x . −∞

But by the continuous mapping theorem (e.g., Proposition 9.3.7 in Dudley +∞ +∞ ϕ ()Ã () ϕ Ä [1989]), ∫ []F x dAn x converges in law to ∫ []()(FxdBFx(()) since from Lemma 1 −∞ −∞ +∞ Ã ≡ Ä ϕ Ä An converges in law to the process GBF F o . ∫ []()(FxdBFx(()) is the ItÙ integral of −∞

the real-valued, non-random function ϕ[F] with respect to the Gaussian stochastic process

BFÄ o and is therefore normally distributed. The asymptotic variance of the generic ItÙ +∞ integral ∫ ω()(xdBFxÄ (()) can be computed as: −∞ 22

  +∞  2   +∞+∞  E  ∫∫ωωω()( xdBFxÄ ooo())  = E ∫ () x ()( ydBFydBFxÄ ())(Ä ())    −∞   −∞−∞  +∞+∞ ∂ωd ()x ∂ωd ()y = ∫ ∫ EBFyBFx[]( Ä oo( ))( Ä ()) dydx ∂∂xxL ∂∂∂yyL −∞−∞ 1 d 1 d

Thus:

[]= ϕϕ()2 − ()2 VFΦ E[][] Fxtt E[] [] Fx

+∞ + ϕϕ() ( )− ϕ() ϕ() 2 ∑ {{}EFxFx[][]ttkt []++ EFxEFx[] [][] [] tk k =1

(ii) Consider now the case where:

ϕα∂δ= ()l ∆ ll∈=−−−+qq1 ∆ ≤ ≤ []()F x[] F() x()l () x C \ C , for q 2, 2 q s. l y

The remainder term in the expansion of the functional is bounded as in (i). Then by Slutsky's Theorem the asymptotic distribution is given by the linear term (scaled for now at the rate n1/2):

+∞ ∂δ()l ∆ l α[]() Ã ∫ ()l yn()xFxdAxl () −∞

+∞  +∞  ll−− l l     1 ()l ∆ −  xt−−xt −   =   ∂α []FxxK()ll,,  dxatdtl  () ∫ d ∫ l   n −∞ hn −∞  hnnh  ll     | xy=  

+∞  +∞  ll−− l l    1 ()l ∆ −  xt−−xt −  + n12/   ∂α []FxxK()ll,,  dxftdtl  () ∫ d ∫ l   −∞ h −∞  h h  ll    n nn| xy=  +∞  ()l ∆ −− − − ∂α{}[]()ll() llll ∫ l Fxxfx,,xdxll | xy= −∞  23

+∞  +∞  ll−− l l   1 ()l ∆ −  xt−−xt −  −−() =  ∂α []FxxK()ll,,  dl x a () t dt+ O() n12 hrq 2 ∫ d ∫ l   npn −∞ h −∞  h h  ll   n nn| xy=  +∞ ω () = () Write the leading term as ∫ nn()ta () tdt, where: ann t dt dA t and −∞ ≡−12() νω()≡ ()d()l +22l∆ / () AnFFnn. Then let nnthn t. Next show that:

+∞ 2   ()d()l +22l∆ / ω () ()  Eh ∫ n nn tdat −∞  

+∞ +∞  2   2  ()l ∆ ll ll−−− ll l =  []()∂αKudu() ()   {}[] Fytfytdto(),,() + (1) ∫∫l l −∞  −∞ 

But:

+∞ n  +∞  n ωωωωω() ()= 11()− ()() = {}()− []() ∫∫nnxdAx 12//∑∑ ninxxfxdx 12 nixE n x, so: −∞ n i==1  −∞  n i 1

 +∞  2  +∞  +∞  2 2 E ∫∫∫ωωω() t dA () t  = []() t f() t dt−  () t f () t dt  nn n n   −∞  −∞ −∞

n−1 +∞ +∞ +−{}ωω() () ∑ ∫ ∫ 2 ftsfsftknn(, ) () () t sdtds k =1 −∞ −∞

−+()l∆ −()l The first of the three terms above is Oh()dd2 , while the other two are n −+()l∆ −()l oh()dd2 . In particular the "time-series" term containing the sum over time lags is n of lower order than the first term. The computations are very similar for all three terms; for example for the first term: 24

2 +∞ +∞ +∞  ll−− l l  2 1 ()l ∆ −  xt−−xt − ∫∫∫[]ω∂α()tftdt()=   []FxxK()ll , , dxftdtl  () n h2d  l  h h   −∞ −∞ n −∞  nn| llxy= 

2 +∞  +∞  ll−− l l  1 ()l Θ − ()ll∆Θ−  xt−−xt − =  ∑ ∂α{}[[]F ()llxx,,()∂  K  dxl  ftdt ∫ ∫ l ll   2d | xy=   −∞ h  llΘΘ≤ l ∆−∞  h h  ll  n  / nn| xy=   ll−− l l ll−− l l ()ll∆Θ−  xt−−xt 1 ()ll∆Θ−  xt−−xt Because ∂∂K ,, = ()K   , the   ll∆Θ−    hnnh  ()h hnnh n lΘ= term of highest order corresponds to 0, and is therefore given by:

2 +∞  ∞ ll−− l l  11− ()l ∆  yt−−xt −  α∂[]Fyx()ll,,()()K dxftdtl  ∫∫2d l l∆   −∞ h −∞ h  h h   n  n nn  +∞ +∞  2     11()l ∆ −−−2 =  ()∂αKudu()ll {}[] Fytfytdto()ll,,() ll l + l []() l d()l +2 ∆ ∫∫l l  d()l +2 ∆  h −∞ −∞   h  n n

Hence:

 +∞ 2  +∞ +∞    2  2    ()l ∆ ll ll−−− ll l EtdAt ∫∫ν∂() () =  []() KuduFytfytdto() ()  ∫{} α[](),,() + ()1  nn l l  −∞  −∞ −∞  ≡ []+ () VFΦ o1 +∞ n ννν() ()= 1 {}()− []() →d ()[] Therefore ∫ nnxdAx 12/ ∑ ninxE x NVF0, Φ −∞ n i=1

by the Central Limit Theorem. But the first order Taylor expansion of the functional yields:

+∞ 2 12/ ΦΦÃ − = ω () ()+ 12rq−−() 2+− 12/ Ã  nF{}( nnnpn)() F∫ tdAtOnhOnFF()pn  Lm()∞,  −∞

()= Suppose that dd l * (when there are more than one l the only terms that matter are those corresponding to l ∈L*). Then under Assumption A4(r+d*/2,m), ()()l + l∆ −−() hOnhod 22/ () 122rq = ()1 and the remainder term is also o (1). Therefore: n pn p p

+∞ ()d()l +2212l∆ //ΦΦÃ − = ν () () + () hnFFtatdton {}( nnnp)()∫ 1 −∞

+∞ +∞   2   2   d ()l ∆ ll ll−−− ll l → NKuduFytfytd 0 ,,, ∫[]()∂α() ()   ∫ {}[]()()t   l l  −∞  −∞  25

Under Assumption A4(r,m) only, this will be the asymptotic distribution of ()()l + l∆ ()()l + l∆ hnFEFd 2212//{}ΦΦ( Ã )(− []Ã ) instead of hnFFd 2212//{}ΦΦ( Ã )()− . Indeed: n nn n n ΦΦÃ −=rq−−()2 EF[]( nn)() F Oh(). Under A4(r+d*/2,m), this asymptotic bias term once

()d()l +2212l∆ // multiplied by hn n is o(1).

−()()l + l∆ The absence of a covariance at the same order Oh()d 2 between terms n associated with different l yields:

()dq* +−22212()//ΦΦÃ − hnFFn {}( n )()

+∞ +∞   2     ()l ∆ −−−2 →d NKuduFytfytdt0 ,,, ()∂α()ll  {}[]()ll() ll l  ∑ * []()  l∈L ∫∫l l  −∞  −∞  

If present, the term B[F](.) C-q+1 only contributes terms of higher order in powers of hn-1 and therefore does not change the asymptotic distribution.

Estimation of the Asymptotic Variance: The additional technical regularity ≡ []≡ ϕ[]() ÃÃ≡ ϕ () condition is A5: Let VVFii Fx i and VFxini[] . Assume:

3+δ <∞ δεεε>+()()+ ε (1) EV[]i where 332/ for some >0;

[]2 <∞ (2) EVG[]supGi∈Ν where N is a neighborhood of the true cdf F;

(3) EVGVH[][]− []2 ≤− CGH2 for G and H in N. ii Lm()∞,