<<

Nonparametric Methods

Franco Peracchi University of Rome “Tor Vergata” and EIEF January 2011 Contents

1 Density estimators 2 1.1Empiricaldensities...... 4 1.2Thekernelmethod...... 10 1.3Statisticalpropertiesofthekernelmethod...... 16 1.4Othermethodsforunivariatedensityestimation...... 30 1.5Multivariatedensityestimators...... 34 1.6Statacommands...... 40

2 Linear nonparametric regression estimators 43 2.1Polynomialregressions...... 45 2.2Regressionsplines...... 47 2.3Thekernelmethod...... 51 2.4Thenearestneighbormethod...... 55 2.5Cubicsmoothingsplines...... 59 2.6Statisticalpropertiesoflinearsmoothers...... 62 2.7Averagederivatives...... 68 2.8Methodsforhigh-dimensionaldata...... 70 2.9Statacommands...... 77

3 Distribution function and quantile function estimators 80 3.1Thedistributionfunctionandthequantilefunction...... 80 3.2Theempiricaldistributionfunction...... 88 3.3Theempiricalquantilefunction...... 98 3.4Conditionaldfandqf...... 113 3.5Estimatingtheconditionalqf...... 116 3.6Estimatingtheconditionaldf...... 128 3.7Relationshipsbetweenthetwoapproaches...... 136 3.8Statacommands...... 139

1 1 Density estimators

Let the data Z1,...,Zn be a sample from the distribution of a random vector Z. We are interested in the general problem of estimating the distribution of Z nonparametrically,thatis,without restricting it to belong to a known parametric family.

We begin with the problem of how to estimate nonparametrically the density function of Z.

Although the distribution function (df) and the density function are equiva- lent ways of representing the distribution of Z,theremaybeadvantages in analyzing a density:

• The graph of a density may be easier to interpret if one is interested in aspects such as symmetry or multimodality.

• Estimates of certain population parameters, such as the mode,aremore easily obtained from an estimate of the density.

2 Use of nonparametric density estimates Nonparametric density estimates may be used for:

• Exploratory data analysis.

• Estimating qualitative features of a distribution (e.g. unimodality, skewness, etc.).

• Specification and testing of parametric models.

• If Z =(X, Y ), they may be used to construct a nonparametric estimate of the conditional mean function (CMF) of Y given X f(x, y) μ(x)=E(Y | X = x)= y dy, fX (x) where fX (x)= f(x, y) dy. In fact, given an estimate fˆ(x, y)ofthejoint density f(x, y)of(X, Y ), the analogy principle suggests estimating μ(x)by fˆ(x, y) μˆ(x)= y dy, (1) fˆX (x) where fˆX (x)= fˆ(x, y) dy.

• More generally, many statistical problems involve estimation of a pop- ulation parameter that can be represented as a statistical functional T (f) of the population density f. In all these cases, if fˆ is a reasonable estimate of f, then a reasonable estimate of θ = T (f)isθˆ = T (fˆ).

3 1.1 Empirical densities We begin with the simplest problem of estimating univariate densities, and discuss the problem of estimating multivariate densities later in Section 1.5.

Thus, let Z1,...,Zn be a sample from the distribution of a univariate rv Z with df F and density function f. What methods are appropriate for estimat- ing f depends on the nature of the rv Z.

Discrete Z If Z is discrete with a distribution that assigns positive probability mass to the points z1,z2,..., then the probability fj =Pr{Z = zj} may be estimated by n −1 fˆj = n 1{Zi = zj},j=1, 2,..., i=1 called the empirical probability that Z = zj.

This is just the sample average of n iid binary rv’s X1,...,Xn,whereXi = 1{Zi = zj} has a Bernoulli distribution with mean fj and variance fj(1−fj). Thus, fˆj is unbiased for fj and its sampling variance is

fj(1 − fj) Var fˆj = . n

√ It is easy to verify that fˆj is n -consistent for fj and asymptotically normal, with asymptotic variance AV(fˆj)=fj(1−fj) which can be estimated consistently by AV (fˆj)=fˆj(1 − fˆj).

Notice that a symmetric asymptotic confidence interval for fj, e.g. CI.95(fj)=fˆj ± 1.96 fˆj(1 − fˆj)/n, is problematic because may contain values less than zero or greater than one.

4 Continuous Z The previous approach breaks down when Z is continuous. Notice however that, if z is any point in a sufficiently small interval (a, b] then, by the definition of probability density, Pr{a

The numerator of (2) is the sum of n iid rv’s Xi =1{a

E Xi = F (b) − F (a) and variance Var Xi =[F (b) − F (a)][1 − F (b)+F (a)]. Thus, F (b) − F (a) E fˆ(z)= , b − a so fˆ(z) is generally biased for f(z) although its bias is negligible if b − a is sufficiently small. Further [F (b) − F (a)][1 − F (b)+F (a)] Var fˆ(z)= n(b − a)2 2 1 F (b) − F (a) 1 F (b) − F (a) = − , n(b − a) b − a n b − a which shows that, unless n(b − a) →∞, letting b − a → 0 reduces the bias of fˆ(z) but also increases its sampling variance.

This trade-off between bias (systematic error) and sampling variance (ran- dom error) is typical of nonparametric estimation problems with n fixed.

5 The histogram estimator

A histogram is a classical example of empirical density. Let (a0,b0]bean interval that contains the range of the data and partition (a0,b0]intoJ bins (cj−1,cj]ofequal width h =(b0 − a0)/J,where

cj = cj−1 + h = a0 + jh, j =1,...,J − 1, with c0 = a0 and cJ = b0.Theith sample point belongs to the jth bin if cj−1

A histogram estimator of f(z) is the fraction of observations falling in the bin containing z divided by the bin width h. Formally, if (cj−1,cj] is the bin containing z, then the histogram estimator of f(z)is

1 n f˜(z)= 1{cj−1

This is also the estimator of f(z ) for any other z such that cj−1

˜ ˜ ≥ It is easy to verify that the function f is a proper density,thatis,f(z) 0 and f˜(z) dz =1.

6 Figure 1: Histogram estimates of density.

200 obs from the Gaussian mixture .6 * N(0,1) + .4 * N(4,4) True density Bin=5 .3 .6

.2 .4 Fraction .1 .2

0 0 −5 0 5 10 −5 0 5 10 z z Bin=25 Bin=50 .2 .1

.1 .05 Fraction Fraction

0 0 −5 0 5 10 −5 0 5 10 z z

7 Drawbacks of the histogram estimator Although widely used, the histogram estimator has several drawbacks:

• The results depend on the choice of the range (a0,b0].

• Given (a0,b0], the results also depend on the number of bins J or, equivalently, on the bin width h. For example, given the data, increas- ing J (reducing h) tends to give a histogram that is only informative about the location of the distinct sample points. By contrast, reducing J (increasing h) eventually leads to a completely uninformative rect- angle. It is intuitively clear, however, that J may safely be increased if the sample size n also increases.

• The fact that the bin width h is kept fixed over the range of the data may lead to loss of detail where the data are most concentrated. If h is reduced to deal with this problem, then spurious noise may appear in regions where data are sparse.

• The histogram is a step function with jumps at the end of each bin. Thus, it is impossible to incorporate prior information on the degree of smoothness of a density.

• The use of this method may also create difficulties when estimates of the derivatives of the density are needed as input to other statistical procedures.

8 Sampling properties of the histogram estimator

Let Z1,...,Zn be a sample from a continuous distribution on the interval (0, 1], with df F and density f.Puta0 =0,b0 = 1, and partition the inter- val (0, 1] into J bins of equal width h =1/J.IfXj denotes the number of observations falling in the jth bin and z is any point in the jth bin, then the histogram estimator of f(z)isf˜(z)=Xj/(nh).

The random J-vector (X1,...,XJ )hasamultinomial distribution with index n and parameter π =(π1,...,πJ ), where

πj = F (cj) − F (cj−1)=F (cj) − F (cj − h),j=1,...,J. Hence J n! xj Pr{X1 = x1,...,XJ = xJ } = πj , x1! ···xJ ! j=1 where xj =0, 1,...,n. By the properties of the multinomial distribution, Xj has mean equal to nπj and variance equal to nπj(1 − πj). Hence

πj πj(1 − πj) E f˜(z)= , Var f˜(z)= . h nh2 Thus, the histogram estimator is biased for f(z) in general and its bias is just the error made in approximating f(z)byπj/h.

Now let the number of bins J = Jn increase with the sample size n or, equiv- alently, the bin width hn =1/Jn decrease with n.Inthiscase

F (z) − F (z − hn) → f(z), hn provided that hn → 0asn →∞.Let{(ajn−1,ajn ]} denote the sequence of bins that contain the point z and let πjn = F (ajn ) − F (ajn−1). Then πj E f˜(z)= n → f(z) hn as hn → 0 with n →∞. Further, 2 1 πj 1 πj Var f˜(z)= n − n . nhn hn n hn

Thus, for f˜(z)tobeconsistent, we need not only that hn → 0asn →∞, but also that nhn →∞or, equivalently, that Jn/n → 0asn →∞.Thatis, hn must go to zero or, equivalently, J must increase with n, but not too fast.

9 1.2 The method The histogram method is a useful tool for exploratory data analysis, but has some undesirable features, such as:

• the need to choose a partition of the range of Z into cells,

• the fact that the resulting estimates of f are not smooth.

We now present a related method that tries to overcome these two problems.

Consider the empirical density (2). Putting a = z − h and b = z + h,whereh is a small positive number, gives 1 f˜(z)= 1{z − h

This is just the fraction of sample points falling in the interval (z − h, z + h] divided by the length 2h of the interval.

Notice that z − h

1. Place a “box” of width equal to 2h, height equal to (2nh)−1,andarea −1 equal to n around each observation Zi.

2. Add up the heights of the “boxes” that contain the point z; the result is equal to f˜(z).

If we contructed a histogram of constant cell width 2h having z as the center of one of its cells, then f˜(z) would coincide with the histogram estimate.

An advantage over the histogram method is that there is no need to partition the range of Z into cells. However, f˜(z) still has the following drawbacks:

• the dependence of the estimate on the constant h,

• the fact that f˜ is a step function with jumps at the points z = Zi ± h.

10 Smooth kernel density estimates It is simple to modify f˜ in order to obtain estimates of f that are smooth. Notice that f˜(z) may also be written n 1 z − Zi f˜(z)= w , nh i=1 h where 1/2, if −1 ≤ u<1, w(u)= (3) 0, otherwise is a symmetric bounded non-negative function that integrates to one and cor- responds to the density of a uniform distribution on the interval [−1, 1]. Thus, f˜ is not smooth because is a sum of step functions.Sincethesum of smooth functions is itself smooth, replacing w by a smooth function K gives an estimate of f which inherits all the continuity and differentiability properties of K. This leads to the class of estimates of f(z)oftheform n 1 z − Zi fˆ(z)= K , nh i=1 h where K is a bounded function called the kernel and h is a positive con- stant called the bandwidth. An estimate in this class is called a kernel estimate of f(z). Remarks: • If K integrates to one, then fˆ also integrates to one, if K is a proper density then so is fˆ,andifK is differentiable up to order ,thensois fˆ(z). • The bandwidth h controls the degree of smoothness or regularity of a density estimate. Small values of h tend to produce estimates that are irregular, while large values of h correspond to very smooth estimates. • The fact that the bandwidth is independent of the point where the density is estimated is a nuisance, for it tends to produce spurious effects in regions where data are sparse. If a large enough bandwidth is chosen to eliminate this phenomenon, we may end up losing important detail in regions where the data are more concentrated.

11 Figure 2: Uniform kernel density estimates.

200 observations from .6*N(0,1)+.4*N(4,4) Estimated = + , true = − h=.5 h=1 .3 .3

.2 .2

.1 .1

0 0 −5 0 5 10 −5 0 5 10 z z h=1.5 .3

.2

.1

0 −5 0 5 10 z

12 Example 1 If K = φ,whereφ denotes the density of the N (0, 1) distribution, then the kernel estimate of f is continuous and has continuous derivates of every order.

Such an estimate may be viewed as the average of n Gaussian densities, each centered about one of the observations and with common variance equal to the squared bandwidth h2.

Unlike the uniform kernel (3), which takes constant positive value in the in- terval [Zi − h, Zi + h)andvanishes outside this interval, the Gaussian kernel is always positive, assumes its maximum when z = Zi and tends to zero as |z − Zi|→∞.

Hence, while the uniform kernel estimate of f(z) is based only on the observa- tions that are within h distance from the evaluation point z and assigns them a constant weight, the Gaussian kernel estimate is based on all the observa- tions but assigns them a weight that declines exponentially as the distance from the evaluation point increases. 2

13 Figure 3: Gaussian kernel density estimates.

200 observations from .6*N(0,1)+.4*N(4,4) Estimated = + , true = − h=.25 h=.50 .3 .3

.2 .2

.1 .1

0 0 −5 0 5 10 −5 0 5 10 z z h=1.00 .3

.2

.1

0 −5 0 5 10 z

14 Extensions Given a kernel estimator fˆ of f, one may easily estimate other aspects of the distribution of Z.

Example 2 A natural estimator of the mode of Z isthemodeoffˆ. 2

Example 3 If Z1,...,Zn is a sample from the distribution of Z, then the df of Z, z F (z)= f(u) du, −∞ may be estimated nonparametrically by z n z − Zi Fˆ(z)= fˆ(u) du = n−1 K , −∞ i=1 h

u where K(u)= −∞ K(v) dv.IfK corresponds to a proper density, then K is the associated df, and so Fˆ is itself a proper df. 2

Example 4 An alternative characterization of the distribution of a continu- ous non-negative rv Z is the hazard function

Pr{zz} f(z) λ(z) = lim = , δ→0+ δ 1 − F (z) where F is the df of Z.IfZ1,...,Zn is a sample from the distribution of Z, then a nonparametric estimator of λ(z)is

fˆ(z) λˆ(z)= , 1 − Fˆ(z) where fˆ and Fˆ are nonparametric estimators of f and F respectively. 2

15 1.3 Statistical properties of the kernel method In evaluating the statistical accuracy of a density estimator fˆ, it is important to distinguish between local and global properties:

• the first refer to the accuracy of fˆ(z) as an estimator of f(z), the value of f at a given point z,

• the second refer to the degree of statistical “closeness” between the two functions fˆ and f.

To stress its dependence on the bandwidth h, a kernel density estimator will henceforth be denoted by fˆh.

16 Local properties The kernel density estimator of f at a point z may be written as the sample average n −1 fˆh(z)=n Ki(z), i=1 where 1 z − Zi Ki(z)= K h h is a nonlinear transformation of the Zi. A natural measure of local accuracy is therefore its mean squared error (MSE) 2 2 MSE[fˆh(z)] = E[fˆh(z) − f(z)] =Varfˆh(z) + [Bias fˆh(z)] .

If the data Z1,...,Zn are a sample from the distribution of a continuous (uni- variate) rv Z,then 1 z − Z E fˆh(z)= E K , h h 1 z − Z Var fˆh(z)= Var K . nh2 h

Thus, the estimator fˆh(z)isbiased for f(z) in general. For h fixed,itsbias does not depend on the sample size n, whereas its sampling variance tends to zero as n increases.

By imposing the following additional assumptions about the density f and the kernel K we are able to study in more detail the sampling properties of fˆh(z). Assumption 1 (i) The density f is twice continuously differentiable. (ii) The kernel K satisfies K(u) du =1, uK(u) du =0, 0 < u2K(u) du < ∞.

If K is a nonnegative function, then Assumption 1 requires K to be the den- sity of some probability distribution with zero mean and finite positive variance, e.g. the uniform or the Gaussian.

17 Bias

The bias of fˆh(z)is 1 z − Z Bias fˆh(z)=Efˆh(z) − f(z)= E K − f(z). (4) h h After a change of variable from x to u =(z − x)/h we have 1 z − x Bias fˆh(z)= K f(x) dx − f(z) h h = K(u)[f(z − hu) − f(z)] du,

Because f is twice differentiable (Assumption 1), a 2nd order Taylor expansion of f(z − hu)abouth =0gives 1 f(z − hu) − f(z)=−huf (z)+ h2u2f (z)+O(h3). 2 If h is sufficiently small, then 1 2 Bias fˆh(z) ≈−hf (z) uK(u) du + m2h f (z) 2 (5) 1 2 = m2 h f (z), 2 2 where m2 = u K(u) du and we used the fact that K has mean zero.

2 Thus, for h sufficiently small, the bias of fˆh(z)isO(h ) and depends on: • the degree of curvature f (z) of the density at the point z, • the amount of smoothing of the data through the bandwidth h and the spread m2 of the kernel. In particular:

• Bias fˆh(z) ≈ 0whenf (z) = 0, that is, when the density is linear in a neighborhood of z.

• Bias fˆh(z) → 0ash → 0.

The bias of fˆh(z) depends indirectly on the sample size n if the bandwidth h decreases with n.

18 Variance First notice that 1 z − Z Var fˆh(z)= Var K nh2 h 1 z − Z 2 1 z − Z 2 = E K − E K . nh2 h nh2 h

2 From (4) and the fact that Bias fˆh(z)=O(h )wehave z − Z 2 E K = h[f(z)+Biasfˆh(z)] = h[f(z)+O(h )]. h Hence, after a change of variable from x to u =(z − x)/h, − 2 ˆ 1 z x − 1 ˆ 2 Var fh(z)= 2 K f(x) dx [f(z)+Biasfh(z)] nh h n 1 1 = K(u)2 f(z − hu) du − [f(z)+O(h2)]2. nh n Taking a 1st order Taylor expansion of f(z − hu) around h =0gives 1 2 −1 Var fˆh(z) ≈ K(u) [f(z) − huf (z)] du + O(n ) nh f(z) f (z) = K(u)2 du − uK(u)2 du + O(n−1). nh n Separating the term of order O((nh)−1) from that of order O(n−1)wehave that, for h sufficiently small, 1 2 −1 Var fˆh(z) ≈ f(z) K(u) du + O(n ). nh

19 Remarks • Ifthesamplesizeisfixed, increasing the bandwidth reduces the sampling variability of fˆh(z) but, from (11), it also increases its bias.

• If the sample size increases and smaller bandwidths are chosen for larger n,thenboth the bias and the sampling variance of fˆh(z) may be reduced.

• Consistency of fˆh(z)forf(z) requires not only h → 0asn →∞, but also nh →∞.

To conclude, in large samples and for h sufficiently small, f(z) 2 1 2 4 2 MSE fˆh(z) ≈ K(u) du + m h f (z) . (6) nh 4 2

20 Global properties Common measures of distance between two functions f and g are:

• the L1 distance ∞ d1(f,g)= |f(z) − g(z)| dz, −∞

• the L2 distance ∞ 1/2 2 d2(f,g)= [f(z) − g(z)] dz , −∞

• the L∞ or uniform distance

d∞(f,g)= sup |f(z) − g(z)|. −∞

InthecaseoftheL2 distance, a global measure of performance of fˆh as an estimator of f is the mean integrated squared error (MISE) 2 MISE(fˆh)=Ed2(fˆh,f) ∞ 2 =E [fˆh(z) − f(z)] dz −∞ where the expectation is with respect to joint distribution of Z1,...,Zn. Under appropriate regularity conditions ∞ 2 MISE(fˆh)= E[fˆh(z) − f(z)] dz −∞ ∞ = MSE[fˆ(z)] dz, −∞ that is, the MISE is equal to the integrated MSE. Hence, in large samples and for h sufficiently small, one may approximate the MISE by integrating (6) under the further assumption that the second derivative of f satisfies R(f)= f (z)2 dz < ∞. Our global measure of performance is therefore 1 2 1 2 4 MISE(fˆh) ≈ K(u) du + m h R(f). (7) nh 4 2

21 Optimal choice of bandwidth For a fixed sample size n and a fixed kernel function K, an optimal bandwidth may be chosen by minimizing the MISE(fˆh) with respect to h.Asanapprox- imation to this problem, consider minimizing the right-hand side of (7) with respect to h 1 2 1 2 4 min K(u) du + m2 h R(f). h nh 4 The first-order condition for this problem is 1 0=− K(u)2 du + m2 h3 R(f). nh2 2 Assuming that 0

Remarks: −1/5 • h∗ convergences to zero as n →∞but at the rather slow rate of n .

• Smaller values of h∗ are appropriate for densities that are more wiggly, that is, such that R(f) is high, or kernels that are more spread out, that is, such that m2 is high. Example 5 If f is a N (μ, σ2)densityandK is a standard Gaussian kernel, then 3 R(f)=σ−5 φ(z)2 dz = √ σ−5 8 π and 1 K(u)2 du = √ . 2 π The optimal bandwidth in this case is √ 1/5 1 8 π 5 h∗ = √ σ 2 π 3 4 1/5 = σn−1/5 ≈ 1.059 σn−1/5. 3 2

22 Optimal choice of kernel function

Substituting the optimal bandwidth h∗ into (7) gives

5 1/5 −4/5 MISE(fˆh ) ≈ C(K) R(f) n , ∗ 4 where 4/5 2/5 2 C(K)=m2 K(u) du .

Given the optimal bandwidth, the MISE depends on the choice of kernel func- tion only through the term C(K).

If we confine attention to kernels that correspond to densities of distributions with mean zero and unit variance,thenanoptimal kernel K∗ may be obtained by minimizing 4/5 C(K)= K(u)2 du under the side conditions K(u) ≥ 0 for all u, K(u) du =1, uK(u) du =0, u2 K(u) du =1. The resulting optimal kernel is

3 2 K∗(u)= (1 − u )1{|u|≤1}, 4 called the Epanechnikov kernel.

The efficiency loss from using suboptimal kernels, however, is modest.For example, using the uniform kernel w only implies an efficiency loss of about 7% with respect to K∗. Thus, “it is perfectly legitimate, and indeed desirable, to base the choice of kernels on other considerations, for example the degree of differentiability or the computational effort involved” (Silverman 1986, p. 43).

It turns out that the crucial choice is not what kernel to use, but rather how much to smooth. This choice partly depends on the purpose of the analysis.

23 Figure 4: Uniform, Gaussian and Epanechnikov kernels.

Uniform Gaussian Epanechnikov

.8

.6

.4

.2

0

−2 −1 0 1 2 u

24 Automatic bandwidth selection It is often convenient to be able to rely on some procedure for choosing the bandwidth automatically rather than subjectively. This is especially impor- tant when a density estimate is used as an input to other statistical proce- dures.

• The simplest approach consists of choosing h =1.059ˆσ/n1/5,whereˆσ is some estimate of the standard deviation of the data. This approach works reasonably well for Gaussian kernels and data that are not too far from Gaussian.

• A second approach is based on formula (8) and chooses 2 1/5 K(u) du −1/5 h = 2 n , m2 R(f˜)

where f˜ is a preliminary kernel estimate of f based on some initial bandwidth choice.

• A third approach starts from the observation that the MISE of a kernel density estimator fˆh, based on a given kernel K, may be decomposed as

∞ 2 2 MISE(fˆh)=E [fˆh(z) + f(z) − 2fˆh(z)f(z)] dz −∞ = M(h)+ f(z)2 dz,

where 2 M(h)=E fˆh(z) dz − 2E fˆh(z)f(z) dz.

Hence, the MISE of fˆh depends on the choice of the bandwidth only through the term M(h). Thus, minimizing the MISE with respect to h is equivalent to minimizing the function M with respect to h. Because such a function is unknown, this approach suggests minimizing with respect to h an unbiased estimator of the function M.

25 Cross-validation ˆ 2 To construct an unbiased estimator of M(h), notice first that fh(z) dz is 2 unbiased for E fˆh(z) dz.

Consider next the estimator of f(z) obtained by excluding the ith observation 1 z − Zj fˆ(i)(z)= K ,i=1,...,n. (9) (n − 1)h j=i h

If the data are a sample from the distribution of Z,then 1 n En fˆ(i)(Zi) =En fˆ(i)(Zi) n i=1

= E(i) fˆ(i)(z)f(z) dz

= En fˆh(z)f(z) dz

=En fˆh(z)f(z) dz where En denotes expectations with respect to the joint distribution of Z1,...,Zn, E(i) denotes expectations with respect to the joint distribution of Z1,...,Zi−1,Zi+1,...,Zn, −1 and we used the fact that En fˆh(z)=h E K((z − Z)/h)doesnot depend on n.

Thus, an unbiased estimator of M(h)isthecross-validation criterion

n 2 2 Mˆ (h)= fˆh(z) dz − fˆ(i)(Zi). (10) n i=1

The cross-validation procedure minimizes Mˆ (h) with respect to h.

26 Asymptotic justification for cross-validation Let ˆ 2 ICV (Z1,...,Zn)= [fh∗ (z) − f(z)] dz denote the integrated squared error of the density estimate obtained using the bandwidth h∗ that minimizes the cross-validation criterion Mˆ h.Alsolet 2 I∗(Z1,...,Zn)=min [fˆh(z) − f(z)] dz h denote the integrated squared error when h is chosen optimally for the given sample.

Stone (1984) showed that, if f is bounded and some other mild regularity conditions hold, then ICV (Z1,...,Zn) Pr lim =1 =1. n→∞ I∗(Z1,...,Zn)

Thus, cross-validation achieves the best possible choice of smoothing parame- ter, in the sense of minimizing the integrated squared error for a given sample.

The main drawback of cross-validation is that the resulting kernel density estimates tend to be highly variable and to undersmooth the data.

27 Asymptotic properties of kernel density estimators

Let fˆn,h(z) denote the dependence of a kernel density estimate on both the sample size n and the bandwidth h. Because h typically also depend on n,we ˆ ˆ use fn(z) as a shorthand for fn,hn (z). We now provide sufficient conditions for the sequence {fˆn(z)} to be consistent for f(z)andasymptotically normal.

We again rely on the fact that, if Z1,...,Zn is a sample from the distribution of a continuous rv Z,thenfˆn(z) may be represented as the sample average of n iid rv’s, n −1 fˆn(z)=n Kin(z), i=1 −1 where Kin(z)=hn K((z − Zi)/hn).

Consistency Convergence in probability follows immediately from the fact that if the kernel K is the density of a distribution with zero mean and finite positive variance 2 m2 = u K(u) du then, from our previous results,

1 2 3 E Kin(z)=f(z)+ m2h f (z)+O(h ) (11) 2 n n and 1 2 2 4 Var Kin(z)= f(z) K(u) du + hnf (z) uK(u) du + O(hn). (12) hn

The only technical problem is the behavior of hn as a function of n.

Theorem 1 Let {Zi} be a sequence of iid continuous rv with twice continu- ously differentiable density f, and suppose that the sequence {hn} of bandwidths and the kernel function K satisfy:

(i) hn → 0;

(ii) nhn →∞; 2 (iii) K(u) du =1, uK(u) du =0,andm2 = u K(u) du is finite and positive. p Then fˆn(z) → f(z) for every z in the support of f.

28 Asymptotic normality The next theorem could in principle be used to construct approximate confi- dence intervals for f(z) based on the normal distribution. 5 Theorem 2 In addition to the assumptions of Theorem 1, suppose that nhn → λ,where0 ≤ λ<∞.Then 1 2 nhn [fˆn(z) − f(z)] ⇒N λb(z),f(z) K(u) du 2 for every z in the support of f,whereb(z)=m2 f (z).

Remarks:

• Although consistent, fˆn(z) is generally asymptotically biased for f(z).

• There are three cases when no asymptotic bias arises:

– when hn is chosen such that λ =0; – when f (z) = 0, that is, the density f is linear at the point z;

– when m2 = 0, in which case the kernel K may assume negative values and therefore the estimate fˆn may fail to be a proper density.

• When λ>0, the additional assumption in Theorem 2 implies that hn = −1/5 −1/5 O(n ), as for the optimal bandwidth.√ In this case, letting hn = cn , 5/2 1/2 2/5 where c>0, gives λ = c , nhn = c n , and therefore 2/5 1 2 −m 2 n [fˆn(z) − f(z)] ⇒N c b(z),c f(z) K(u) du . 2

Thus, fˆn(z) converges to its asymptotic normal distribution at the rate of n2/5,whichisslower than the rate n1/2 achieved in regular parametric problems.

• When λ = 0 (no asymptotic bias), the bandwidth tends to zero at a −1/5 faster rate than O(n ) but, in this case, the convergence of fˆn(z)to its asymptotic normal distribution is slower than n2/5.

29 1.4 Other methods for univariate density estimation The nearest neighbor method One of the problems with the kernel method is the fact that the bandwidth is independent of the point at which the density is evaluated. This tends to produce too much smoothing in some regions of the data and too little in others.

For any point of evaluation z,letd1(z) ≤ d2(z) ≤ ...≤ dn(z)bethedistances (arranged in increasing order) between z and each of the n data points. The kth nearest neighbor estimate of f(z) is defined as

k fˆ(z)= ,k

The motivation forthisestimateisthefactthat,ifh is sufficiently small, then we would expect a fraction of the observations approximately equal to 2hf(z) to fall in a small interval [z − h, z + h] around z. Since the interval [z − dk(z),z+ dk(z)] contains by definition exactly k observations,wehave

k =2dk(z)f(z). n Solving for f(z) gives the density estimate (13).

Unlike the uniform kernel estimator, which is based on the number of obser- vations falling in a box of fixed width centered at the point z, the nearest neighbor estimator is inversely proportional to the width of the box needed to exactly contain the k observations nearest to z.Thesmaller is the density of the data around z,thelarger is this width.

The number k regulates the degree of smoothness of the estimator, with larger values of k corresponding to smoother estimates. The fraction λ = k/n of sample points in each neighborhood is called the span.

30 Problems with the nearest neighbor method • fˆ does not integrate to one and so it is not aproperdensity.

• Although continuous, fˆ is not smooth because its derivatives are dis- continuous at all points of the form (Z[i] + Z[i+k])/2.

One way of overcoming the second problem is to notice that fˆ(z) may alter- natively be represented as n 1 z − Zi fˆ(z)= w , ndk(z) i=1 dk(z) where w is the uniform kernel. Thus, the nearest neighbor estimator may be regarded as a uniform kernel density estimator with a varying band- width dk(z).

The lack of smoothness of the nearest neighbor estimator may therefore be overcome by considering a generalized nearest neighbor estimate of the form n 1 z − Zi f˜(z)= K . ndk(z) i=1 dk(z)

If K is a smooth kernel and the bandwidth dk(z)iscontinuously differen- tiable,thenf˜ is a smooth estimator of f.

31 Nonparametric ML

The log-likelihood of a sample Z1,...,Zn from a distribution with strictly positive density f0 is defined as

n L(f0)=c + ln f0(Zi), i=1 where c is an arbitrary constant. One may then think of estimating f0 by

fˆ =argmaxL(f), f∈F where F is some family of densities.

• When F = {f(z; θ),θ ∈ Θ} is a known parametric family of densities, one obtains the classical ML estimator

fˆ(z)=f(z, θˆ),

n where θˆ =argmaxθ∈Θ L(θ)andL(θ)=c + i=1 ln f(Zi,θ).

• When F includes all strictly positive densities on the real line, one can show that the nonparametric ML estimator is

n −1 fˆ(z)=n δ(z − Zi), i=1

where ∞, if u =0, δ(u)= 0, otherwise, is the Dirac delta function.Thus,fˆ is a function equal to infinity at each of the sample points and equal to zero otherwise (why is this unsurprising?).

32 Maximum penalized likelihood Being infinitely irregular, fˆ provides an unsatisfactory solution to the prob- lem of estimating f0. If one does not want to make parametric assumptions, an alternative is to introduce a penalty for lack of smoothness and then maximize the log likelihood function subject to this constraint.

To quantify the degree of smoothness of a density f consider again the functional R(f)= f (z)2 dz. If f is wiggly, then R(f) is large. If the support of the distribution is an interval and f is linear or piecewise linear,thenR(f)=0.

One may then consider maximizing the penalized sample log-likelihood n Lλ(f)= ln f(Zi) − λR(f), i=1 where λ>0 represents the trade-off between smoothness and fidelity to the data. A maximum penalized likelihood density estimator f˜ maximizes Lλ over the class of densities with continuous 1st derivative and square integrable 2nd derivative. The smaller is λ, the rougher in terms of R(fˆ) is the maximum penalized likelihood estimator.

Advantages of this method: • It makes explicit two conflicting goals in density estimation:

– maximizing fidelity to the data, represented here by the term i ln f(Zi), – avoiding densities that are too wiggly, represented here by the term −λR(f).

• The method places density estimation within the context of a unified approach to curve estimation. • The method can be given a nice Bayesian interpretation. Its main disadvantage is that the resulting estimate f˜ is defined only im- plicitly, as the solution to a maximization problem.

33 1.5 Multivariate density estimators

Let Z1,...,Zn be a sample from the distribution of a random m-vector Z with density function f(z)=f(z1,...,zm). How can we estimate f nonparamet- rically?

Multivariate kernel density estimators The natural starting point is the following multivariate generalization of the univariate kernel density estimator

n 1 z1 − Zi1 zm − Zim fˆ(z)=fˆ(z1,...,zm)= Km ,..., , (14) nh1 ···hm i=1 h1 hm

m where Km: → is a multivariate kernel.

Example 6 An example of multivariate kernel is the product kernel

m Km(u1,...,um)= K(uj), j=1 where K is a univariate kernel. 2

We now consider consistency and asymptotic normality of a special class of multivariate kernel density estimators of the form n − − ˆ 1 z1 Zi1 zm Zim f(z)= m Km ,..., . nh i=1 h h

This type of estimators are a special case of (14) obtained by using the same bandwidth h for each component of Z. They may be appropriate when thedatahavebeingpreviously rescaled using some preliminary estimate of scale.

34 Consistency

Theorem 3 Let {Zi} be a sequence of iid continuous random m-vectors with a twice continuously differentiable density f, and suppose that the sequence m {hn} of bandwidths and the multivariate kernel Km: → satisfy:

(i) hn → 0;

m (ii) nhn →∞;  (iii) Km(u) du =1, uKm(u) du =0and uu Km(u) du = M2, a finite m × m matrix. p Then fˆn(z) → f(z) for every z in the support of f.

Asymptotic normality 2 m 1/2 Theorem 4 In addition to the assumptions of Theorem 3, suppose that hn(nhn ) → λ,where0 ≤ λ<∞.Then m 1/2 1 2 (nh ) [fˆn(z) − f(z)] ⇒N λb(z),f(z) Km(u) du n 2 for every z in the support of f,whereb(z) = tr[f (z)M2].

Remarks:

• The speed of convergence of fˆn(z) to its asymptotically normal distribu- tion is inversely related to the dimension m of Z.Whenm =1,we have the results in Theorem 2.

• When λ>0, the additional assumption in Theorem 4 implies that hn = −1/(m+4) −1/(m+4) O(n ). In this case, putting hn = cn ,wherec>0, gives (m+4)/2 m 1/2 m/2 2/(m+4) λ = c and (nhn ) = c n , and therefore 2/(m+4) 1 2 −m 2 n [fˆn(z) − f(z)] ⇒N c b(z),c f(z) Km(u) du . 2

• When λ = 0, the bandwidth tends to zero at a faster rate than O(n−1/(m+4)) but the convergence of fˆn(z) to its asymptotically normal distribution is slower than the optimal rate.

35 The curse-of-dimensionality problem Although conceptually straightforward, extending univariate nonparametric methods to multivariate settings is problematic foratleasttworeasons:

• How to represent the results of a nonparametric analysis involving two or more variables is a neglected but practically important problem.

• When one-dimensional nonparametric methods are generalized to higher dimensions, their statistical properties deteriorate very rapidly because of the so-called curse-of-dimensionality problem, namely the fact that the volume of data required to maintain a tolerable degree of statis- tical precision grows much faster than the number of variables under examination.

For these reasons, simple generalizations of one-dimensional methods to the case of more than two or three variables tend to produce results that are difficult to represent and are too irregular, unless the size of the available data is very large.

36 Example 7 Consider a multivariate histogram constructed for a sample from the distribution of a random m-vector Z with a uniform distribution on the m-dimensional unit hypercube, that is, whose components are iid as U(0, 1).

If we partition the unit hypercube into hypercubical cells of side equal to h, then each cell contains on average only hm percent of the data. Assume that at least 30 observations per cell are needed for a tolerably accurate histogram estimate. Then an adequate sample should have a number of observations at −m least equal to n∗ =30h .

The table below shows some calculations for m =1,...,5andh = .10 and h = .05.

Number m of variables in Z 12 3 4 5 h = .10 number of cells 10 100 1,000 10,000 100,000 n∗ 300 3,000 30,000 300,000 3,000,000 h = .05 number of cells 20 400 8,000 160,000 3,200,000 n∗ 600 12,000 240,000 4,800,000 96,000,000

Leaving aside the nontrivial problem of how it could be represented, a 5- dimensional histogram is likely to be estimated too imprecisely to be of practical use, unless the sample size is truly gigantic. 2

37 Projection pursuit One method for nonparametric estimation of multivariate densities is projec- tion pursuit (PP), introduced by Friedman, Stuetzle and Schroeder (1984).

This method assumes that the density of a random m-vector Z may well be approximated by a density of the form

J  f∗(z)=f0(z) fj(αj z), (15) j=1

 where f0 is a known m-variate density, αj is a vector with unit norm, αj z = m h=1 αjhzh is a linear combination or “projection” of the variables in Z,and f1,...,fp are smooth univariate functions, called ridge functions.

Thechoiceoff0 is left to the investigator and should reflect prior information available about the problem.

The estimation algorithm determines the number J of terms and the vectors αj in (15), and constructs nonparametric estimates of the ridge functions f1,...,fJ by minimizing a suitably chosen index of goodness-of-fit, or projec- tion index. The curse-of-dimensionality problem is bypassed by using linear projections and nonparametric estimates of univariate ridge functions.

As a by-product of this method, the graphical information provided by the shape of the estimated ridge functions may be useful for exploring and interpreting the multivariate distribution of the data.

The PP method may be regarded as a generalization of the principal com- ponents method, where multidimensional data are projected linearly onto subspaces of much smaller dimension, chosen to minimize the variance of the projections.

38 Semi-nonparametric methods Gallant and Nychka (1987) proposed to approximate the m-variate density f(z1,...,zm)byanHermite polynomial expansion.

In the bivariate case (m = 2), their approximation is

1 2 f∗(z1,z2)= τK (z1,z2) φ(z1) φ(z2), ψK where K h k τK (z1,z2)= τhkz1 z2 h,k=0 is a polynomial of order K in z1 and z2,and 2 ψK = (z1,z2) φ(z1) φ(z2) dz1 dz2 is a normalization factor to ensure that f∗ integrates to one.

The class of densities that can be approximated in this way includes densi- ties with arbitrary skewness and kurtosis, but excludes violently oscillatory densities or densities with too fat or too thin tails.

Since the polynomial expansion in (1.5) is invariant to multiplication of τ = (τ00,τ01,...,τKK) by a scalar, some normalization is needed.

After setting τ00 = 1, expanding the square of the polynomial in (1.5) and rearranging terms gives ⎡ ⎤ 2K 1 ⎣ h k⎦ f∗(z1,z2)= γhkz1 z2 φ(z1) φ(z2), ψK h,k=0 with bh bk γhk = τrsτh−r,k−s, r=ah s=ak where ah =max(0,h− K),bh = min(h, K),

ak =max(0,k− K),bk = min(k, K).

39 1.6 commands We now briefly review the commands for histogram and kernel density estima- tion available in Stata, version 11.

These include the histogram and the kdensity commands. Both commands estimate univariate densities.

40 The histogram command The basic syntax is: histogram varname [if][in][weight][,[continuous opts | discrete opts] options] where: • varname is the name of a continuous variable, unless the discrete option is specified, • continuous opts includes: – bin(#): sets the number of bins to #, – width(#): sets the width of bins to #, – start(#): sets the lower limit of first bin to # (the default is the observed minimum value of varname). bin() and width() are alternatives. If neither is specified, results are the same as if bin(k) had been specified, with √ k =min{ n, 10 · ln n/ ln 10}, where n is the (weighted) number of observations. • discrete opts includes: – discrete: specifies that the data are discrete and you want each unique value of varname to have its own bin (bar of histogram), • options includes: – density: draws as density (the default), – fraction: draws as fractions, – frequency: draws as frequencies, – percent: draws as percentages, – addlabels: adds heights label to bars, – normal: adds a normal density to the graph, – kdensity: adds a kernel density estimate to the graph.

41 The kdensity command The basic syntax is: kdensity varname [if][in][weight][,options] where options includes:

• kernel(kernel): specifies the kernel function. The available kernels in- clude epanechnikov (default), biweight, cosine, gaussian, parzen, rectangle,andtriangle.

• bwidth(#): specifies half-width of kernel. If not specified, the half- width calculated and used corresponds to h =1.059σn ˆ −1/5, which would minimize the MISE if the data were Gaussian and a Gaussian kernel were used.

• generate(newvar xnewvard): stores the estimation points in new- var x and the density estimate in newvar d.

• n(#): estimates density using # points. The default is min(n, 50), where n is the number of observations in memory.

• at(var x): estimates the density using the values specified by var x.

• nograph: suppresses graph.

• normal: adds a normal density to the graph.

• student(#): adds a Student’s t density with # degrees of freedom to the graph.

42 2 Linear nonparametric regression estimators

We now assume that the data (X1,Y1),...,(Xn,Yn) are a sample from the joint distribution of (X, Y ), for which the conditional mean function (CMF)

μ(x)=E(Y | X = x) is well defined, and consider how to estimate μ(x) nonparametrically.

This is a more general problem than it may look at first. For example, the problem of nonparametrically estimating the conditional variance function (CVF)

σ2(x)=Var(Y | X = x)=E(Y 2 | X = x) − [E(Y | X = x)]2 reduces to the problem of nonparametrically estimating the CMF of Y and of Y 2.IfY is a binary 0-1 variable, then the problem of nonparametrically estimating its conditional probability function Pr{Y =1| X = x} reduces to the problem of estimating the CMF of 1{Y =1}.

We restrict attention to the class of nonparametric estimators of μ(x)thatare linear, that is, of the form

n μˆ(x)= Sj(x) Yj, (16) j=1 where the weight Sj(x) assigned to Yj depends only on Xi’s and the evaluation point x, not on the Yi’s.

Linearity of this class of estimators is important because:

• It simplifies considerably the task of evaluating the statistical prop- erties of these estimators.

• It enables one to compare estimates produced by methods that may apparently be quite far from each other.

• It lowers the computational burden relative to nonlinear estimators.

43 Regression smoothers It is useful to represent a linear nonparametric regression estimator as a linear regression smoother.

Let Y be the n-vector of observations on the outcome variable and let X be the matrix of n observations on the k covariates. A regression smoother is a way of using Y and X to produce a vector µˆ =(ˆμ1,...,μˆn)offitted values that is less variable than Y itself.

A regression smoother is linear if it can be represented as µˆ = SY, where S =[sij]isann × n smoother matrix that depends on X but not on Y. Thus, the class of linear regression smoothers coincides with the class of linear predictors of Y.Ifµˆ is a linear regression smoother, then its ith element is a weighted average n μˆi = sijYj j=1 of the elements of Y, and the generic element sij of the matrix S represents the weight assigned to the jth element of Y in the construction ofμ ˆi.

Linear nonparametric regression estimates are linear smoothers with sij = Sj(Xi). A parametric example of linear smoother is the vector of OLS fitted valuesµ ˆ = Xβˆ = SY,whereβˆ =(XX)−1XY is the OLS estimate and the smoother matrix S = X(XX)−1X is symmetric idempotent.

A smoother matrix is not necessarily symmetric or idempotent. A matrix S is said to preserve the constant if Sι = ι,whereι is a vector of ones. If a smoother matrix S preserves the constant, then n −1 −1  −1  n μˆi = n ιn SY = n ιn Y = Y,¯ i=1 where Y¯ is the sample mean of the Yi. For example, the OLS smoother matrix preserves the constant if the design matrix X contains a column of ones.

We now present other examples of linear regression smoothers. For ease of presentation, we assume that X is a scalar rv.

44 2.1 Polynomial regressions A polynomial regression represents a parsimonious and relatively flex- ible way of approximating an unspecified CMF.

A k-degree polynomial regression approximates μ(x) by a function of the form k m(x)=β0 + β1x + ···+ βkx ,βk =0 . A polynomial regression estimated by OLS gives a linear regression smoother defined by a symmetric idempotent smoother matrix.

Polynomials are frequently used because they can be evaluated, integrated, differentiated, etc., very easily. For example, if m(x)isak-degree polynomial, then k−1 m (x)=β1 +2β2x + ···+ kβkx is a (k − 1)-degree polynomial and 1 2 1 k+1 m(x) dx = α + β0x + β1x + ···+ βkx 2 k +1 is a (k + 1)-degree polynomial.

However, if the CMF is very irregular,evenonsmall regions of the approxi- mation range, then a polynomial approximation tends to be poor everywhere.

45 Figure 5: Polynomial regression estimates.

200 observations from E(Y)=1−X+exp(−50*(X−.5)^2) Polynomial = o, true = − Data Linear 1.5 1.5

1 1 y

.5 .5 Linear prediction

0 0 0 .5 1 0 .5 1 x x

Quadratic Cubic 1.5 1.5

1 1

.5 .5 Linear prediction Linear prediction 0 0

−.5 −.5 0 .5 1 0 .5 1 x x

46 2.2 Regression splines One way of avoiding global dependence on local properties of the function μ(x) is to consider piecewise polynomial functions.

A regression spline approximates μ(x)byasmooth piecewise polyno- mial function and therefore represents a very flexible way of approximating μ(x). The simplest example is linear splines.

Linear splines

Select J distinct points or knots, c1 < ···cJ .

A linear spline is a continuous piecewise linear function of the form

m(x)=αj + βjx, cj−1

For m to be continuous at the first knot c1, the model parameters must satisfy

α1 + β1c1 = α2 + β2c1, which implies that α2 = α1 +(β1 − β2)c1. On the interval (c0,c2], we therefore have α1 + β1x, if c0

A more compact representation of m on (c0,c2]is

m(x)=α + βx + γ1(x − c1)+,c0

Repeating this argument for all other knots, a linear spline may be represented as J m(x)=α + βx + γj(x − cj)+, j=1 where γj = βj+1 − βj.

47 Remarks • The number of free parameters in the model is only J +2,lessthan the number 2(J + 1) of parameters of an unrestricted piecewise linear function. The difference 2(J +1)− (J + 2) is equal to the number of constraints that must be imposed to ensure continuity.

• Although continuous, a linear spline is not differentiable for its deriva- tive is a step function with jumps at c1,...,cJ .

• This problem may be avoided by considering higher-order (quadratic, cubic, etc.) smooth piecewise polynomial functions.

48 Cubic splines A cubic spline is a twice continuously differentiable piecewise cubic function. Given J distinct knots c1 < ···

• m is a cubic polynomial in each subinterval [cj,cj+1]; • m is twice continuously differentiable;

• the third derivative of m is a step function with jumps at c1,...,cJ .

Remarks: • A natural cubic spline is a cubic spline that is forced to be linear outside the boundary knots c0 and cJ+1. The number of free parameters in this case is equal to J +2. • The representation (17) lends itself directly to estimation by OLS. • In general, given the sequence of knots and the degree of the polynomial, a regression spline may be estimated by an OLS regression of Y on an appropriate set of vectors that represent a basis for the selected family of piecewise polynomial functions evaluated at the sample values of X. • Regression splines estimated by OLS give linear regression smoothers defined by symmetric idempotent smoother matrices. The number and position of the knots control the flexibility and smoothness of the approximation.

Problems with regressions splines: • How to select the number and the position of the knots. • Considerable increase in the degree of complexity of the problem when X is multivariate.

49 Figure 6: Splines vs. polynomial regression estimates.

200 observations from E(Y)=1−X+exp(−50*(X−.5)^2) Polynomial = o, spline = + , true = − Data Linear

1.5 1.5

1 1 y

.5 .5

0 0

0 .5 1 0 .5 1 x x

Quadratic Cubic

1.5 1.5

1 1

.5 .5

0 0

−.5 −.5

0 .5 1 0 .5 1 x x

50 2.3 The kernel method The CMF may be written f(x, y) c(x) μ(x)= y dy = , fX (x) fX (x) where fX (x)= f(x, y) dy is the marginal density of X and c(x)= yf(x, y) dy.

Given nonparametric estimatesc ˆ(x)andfˆX (x)ofc(x)andfX (x), a non- parametric estimate of μ(x)is

cˆ(x) μˆ(x)= . (18) fˆX (x)

An advantage of the kernel method is that, under certain conditions, such an estimate may be computed directly from estimates of the marginal density fX (x) without the need of estimating the joint density f(x, y).

51 The Nadaraya-Watson estimator Consider the bivariate kernel density estimate

n 1 x − Xi y − Yi fˆ(x, y)= K∗ , , nhX hY i=1 hX hY

2 → ˆ where K∗: .IfhX = hY = h and fX (x) is the estimate of fX (x)based on the kernel K(x)= K∗(x, y) dy,then

n 1 x − Xi fˆ(x, y) dy = fˆX (x)= K . nh i=1 h If, in addition, K∗(x, y) is such that yK∗(x, y) dy =0,then

n 1 x − Xi cˆ(x)= y fˆ(x, y) dy = Yi K . nh i=1 h

Substituting these two expressions into (18) gives the Nadaraya–Watson estimator n −1 n x − Xi x − Xi μˆ(x)= K K Yi. (19) i=1 h i=1 h

Thus,μ ˆ(x) is a weighted average of the Yj with nonnegative weights n −1 x − Xi x − Xj wj(x)= K K ,j=1,...,n, i=1 h h which adduptoone. The Nadaraya-Watson estimator is therefore a linear smoother whose smoother matrix S =[wj(Xi)] depends on the sample values Xi, the selected kernel K, and the bandwidth h.

If the kernel corresponds to a unimodal density with mode at zero, then the closer Xj is to x,thebigger is the weight assigned to Yj in formingμ ˆ(x). If the kernel vanishes outside the interval [−1, 1], then only values of Y for which |x − Xj|

52 Example 8 A family of kernels with bounded support which contains many kernels used in practice is

2 α Kα(u)=cα(1 − u ) 1{0 ≤|u|≤1},α=0, 1, 2,..., where the constant cα, chosen such that Kα is a proper density, is

−2α−1 −2 cα =2 Γ(2α +2)Γ(α +1) and Γ(z) is the gamma function.

• If α = 0, we obtain the uniform kernel.

• If α = 1, we obtain the Epanechnikov kernel.

• If α = 2, we obtain the quartic kernel 15 K(u)= (1 − u2)2 1{0 ≤|u|≤1}. 16

• If α →∞,thenKα converges to the Gaussian kernel.

2

53 Figure 7: Kernel regression estimates.

200 observations from E(Y)=1−X+exp(−50*(X−.5)^2) Estimated = + , true = − Gaussian kernel, h=.05 Gaussian kernel, h=.1 1.5 1.5

1 1 mh1 mh2

.5 .5

0 0 0 .5 1 0 .5 1 x x

Epanechnikov kernel, h=.1 Epanechnikov kernel, h=.2 1.5 1.5

1 1 mh3 mh4

.5 .5

0 0 0 .5 1 0 .5 1 x x

54 2.4 The nearest neighbor method

This method suggests considering the Xi that are closest to the evaluation point x and estimating μ(x)byaveraging the corresponding values of Y .

A kth nearest neighborhood of x, denoted by Ok(x), consists of the k points that are closest to x. The corresponding nearest neighbor estimate of μ(x)is

1 n μˆ(x)= Yj = wj(x)Yj, k j∈Ok(x) j=1 where 1/k, if j ∈Ok(x), wj(x)= 0, otherwise.

Thus,μ ˆ(x)isjustarunning mean or moving average of the Yi.

This estimate is clearly a linear smoother, and its degree of smoothness depends on the value of k or, equivalently, on the span λ = k/n of the neighborhood.

If X is a vector, then the choice of the metric is of some importance. Two possibilities are:

• the Euclidean distance

d(x, x)=[(x − x)(x − x)]1/2,

• the Mahalanobis distance

d(x, x)=[(x − x)Ωˆ −1(x − x)]1/2,

where Ωisaˆ pd estimate of the covariance matrix of the covariates.

Notice that both the nearest neighbor estimatorμ ˆ(x) and the Nadaraya- Watson estimator solve the following problem n 2 min wi(x)(Yi − c) . c∈ i=1

The two estimators correspond to different choices of weights.

55 Locally linear fitting Using a running mean may give severe biases near the boundaries of the data, where neighborhoods tend to be highly asymmetric.

Awayofreducing this bias is to replace the running mean by a running line μˆ(x)=ˆα(x)+βˆ(x)x, (20) whereα ˆ(x)andβˆ(x) are the intercept and the slope of the OLS line computed using only the k points (k>2) in the neighborhood Ok(x).

Computation of (20) for the n sample points may be based on the recursive formulae for OLS and only requires a number of operations of order O(n).

Notice thatα ˆ(x)andβˆ(x) in (20) solve the locally weighted LS problem

n 2 min wi(x)(Yi − α − βXi) , (α,β)∈ 2 i=1 where the weight wi(x) is equal to one if the ith point belongs to Ok(x)and to zero otherwise.

Running lines tend to produce ragged curves but the method can be im- proved replacing OLS by locally weighted LS, with weights that decline as the distance of Xi from x increases.

56 Lowess An example of locally linear fitting is the lowess (LOcally WEighted Scatter- plot Smoother) estimator introduced by Cleveland (1979), which is computed as follows:

Algorithm 1

(1) Identify the kth nearest neighborhood Ok(x) of x. O (2) Compute the distance Δ(x)=maxi∈Ok(x) d(Xi,x) of the point in k(x) that is farthest from x.

(3) Assign weights wi(x) to each point in Ok(x) according to the tricube function d(Xi,x) wi(x)=W , Δ(x) where (1 − u3)3, if 0 ≤ u<1, W (u)= 0, otherwise. (4) Compute μˆ(x) as the predicted value of Y corresponding to x from a WLS regression that uses the observations in Ok(x) and the weights defined in step (3).

Notice that the weight wi(x)ismaximumwhenXi = x, decreases as the distance of Xi from x increases, and becomes zero if Xi is the kth nearest neighbor of x.

57 Figure 8: Lowess estimates.

200 observations from E(Y)=1−X+exp(−50*(X−.5)^2) Estimated = + , true = − Span=.2 Span=.4 1.5 1.5

1 1 mh1 mh2

.5 .5

0 0 0 .5 1 0 .5 1 x x Span=.6 Span=.8 1.5 1.5

1 1 mh3 mh4

.5 .5

0 0 0 .5 1 0 .5 1 x x

58 2.5 Cubic smoothing splines The regression analogue of the maximum penalized likelihood problem is the problem of finding a smooth function that best interpolates the observations on Y without fluctuating too wildly.

This problem may be formalized as follows

n b 2 2 min Q(m)= [Yi − m(Xi)] + λ m (u) du, (21) m∈C2[a,b] i=1 a where C2[a, b] is the class of functions defined on the closed interval [a, b]that have continuous 1st derivative and integrable 2nd derivative, a and b are the limits of a finite interval that contains the observed values of X,and λ ≥ 0 is a fixed parameter.

The residual sum of squares in the functional Q measures fidelity to the data, whereas the second term penalizes for excessive fluctuations of the estimated CMF.

Asolutionˆμ to problem (21) exists,isunique,andcanberepresented as a natural cubic spline with at most J = n − 2 interior points corresponding to distinct sample values of X. Such a solution is called a cubic smoothing spline.

Larger values of λ tend to produce solutions that are smoother,whereas smaller values tend to produce solutions that are more wiggly. In particular:

• When λ →∞, problem (21) reduces to the OLS problem.

• When λ =0,thereisno penalty for curvature and the solutionμ ˆ is a twice differentiable function that exactly interpolates the data.

59 Smoothing splines and penalized LS Knowledge of the form of the solution to (21) makes it possible to define its smoother matrix. If we consider the representation ofμ ˆ as a cubic spline, then the required number of basis functions is J +4=n − 2+4=n +2.The solution to problem (21) may therefore be written

n+2 μˆ(x)= αjBj(x), j=1 where α1,...,αn+2 arecoefficientstobedeterminedandB1(x),...,Bn+2(x)is asetoftwice differentiable basis functions.

Define the n×(n+2) matrix B =[Bij], where Bij = Bj(Xi) is the value of the jth basis function at the point Xi. Notice that B does not have full column rank. Thus ⎡ ⎤2 n n n+2 2 ⎣ ⎦  [Yi − μˆ(Xi)] = Yi − αjBj(Xi) =(Y − Bα) (Y − Bα), i=1 i=1 j=1 where α =(α1,...,αn+2). Also define the (n +2)× (n + 2) matrix Ω = [Ωrs], where b Ωrs = Br (x)Bs (x) dx. a Then b n+2 n+2 2  m (u) du = αrαsΩrs = α Ωα. a r=1 s=1 Problem (21) is then equivalent to the penalized LS problem

−  −  minα (Y Bα) (Y Bα)+λα Ωα,

Thus, the original problem (21) is reduced to the much simpler problem of minimizing a quadratic form in the (n + 2)-dimensional vector α.

60 The penalized LS estimator The solution to the penalized least squares problem exists and must satisfy the normal equation

0=−2B(Y − Bα˜)+2λΩ˜α.

If BB + λΩisanonsingular matrix, then

α˜ =(BB + λΩ)−1BY.

Becauseα ˜ formally coincides with a ridge-regression estimate,italsohas a nice Bayesian interpretation.

Since µˆ = Bα˜,thesmoother matrix of a cubic smoothing spline is

S = B(BB + λΩ)−1B, and is symmetric but not idempotent.

61 2.6 Statistical properties of linear smoothers The smoother matrix of a linear smoother typically depends on the sample val- ues of X and a parameter (or set of parameters) which regulates the amount of smoothing of the data. To emphasize this dependence, we denote the smoother matrix by Sλ =[sij], where λ is the smoothing parameter. We adopt the convention that larger values of λ correspond to more smoothing.

Let the data be a sample from the distribution of (X, Y ), suppose that the CMF μ(x)=E(Y | X = x)andtheCVFσ2(x)=Var(Y | X = x) are both well defined, and let µ and Σ denote, respectively, an n-vector with generic element 2 2 μi = μ(Xi) and a diagonal n × n matrix with generic element σi = σ (Xi).

For a linear smoother µˆ = SλY, with generic elementμ ˆi =ˆμ(Xi), we have  E(µˆ | X)=Sλµ and Var(µˆ | X)=SλΣSλ .Thebias of µˆ is therefore

Bias(µˆ | X)=E(µˆ | X) − µ =(Sλ − In)µ.

Acasewhenµˆ is unbiased is when µ = Xβ and Sλ is such that SλX = X.

Given an estimate Σˆ of the matrix Σ, we may estimate the sampling variance of  µ by SλΣˆSλ . The result may then be used to construct pointwise standard error bands for the regression estimates.

If the smoother is approximately unbiased, then these standard error bands also represent pointwise confidence intervals. An alternative way of con- structing confidence intervals is the bootstrap.

If the observations are conditionally homoskedastic,thatis,σ2(x)=σ2 2 2 −1 2 for all x, then Σ may be estimated byσ ˆ In,whereˆσ = n i(Yi − μˆi) .The 2 n 2 sampling variance ofμ ˆi may then be estimated byσ ˆ j=1 sij. As a global measure of accuracy, consider the average conditional MSE n −1 2 AMSE(λ)=n E(ˆμi − μi) i=1 −1    = n [tr(SλΣSλ )+µ (Sλ − In) (Sλ − In)µ]. When the data are conditionally homoskedastic, this reduces to −1 2    AMSE(λ)=n [σ tr(SλSλ )+µ (Sλ − In) (Sλ − In)µ].

62 Cross-validation In general, increasing the amount of smoothing tends to increase the bias and to reduce the sampling variance of a smoother. The AMSE provides a summary of this trade-off. Since the AMSE depends on unknown parameters, one way of choosing the smoothing parameter optimally is to minimize an unbiased estimate of the AMSE. The following relationship links the AMSE to the average mean squared error of prediction or average predictive risk n n −1 2 −1 2 n E(Yi − μˆi) = n σi +AMSE(λ). i=1 i=1 Thus, minimizing an unbiased estimate of the AMSE is equivalent to mini- mizing an unbiased estimate of the average predictive risk. An approximately unbiased estimator of the predictive risk is the cross- validation criterion n −1 2 CV(λ)=n [Yi − μˆ(i)(Xi)] , i=1 whereμ ˆ(i)(Xi) denotes the value of the smoother at the point Xi, computed by excluding the ith observation, and Yi − μˆ(i)(Xi) denotes the ith predicted residual. Minimizing CV(λ)representsanautomatic method for choosing the smoothing parameter λ.

Cross-validation requires computing the sequenceμ ˆ(1)(X1),...,μˆ(n)(Xn). This is very simple if the smoother matrix preserves the constant. Because i sij = 1 in this case,μ ˆ(i)(Xi) may be computed by setting the weight of the ith observation equal to zero and then dividing all the other weights by 1 − sii in order for them to add up to one. This gives n sij j=1 sijYj − siiYi μˆi − siiYi μˆ(i)(Xi)= Yj = = . j=i 1 − sii 1 − sii 1 − sii

Hence, the ith predicted residual is Yi − μˆ(i)(Xi)=Uˆi/(1 − sii), where Uˆi = Yi − μˆi, and the cross-validation criterion becomes 2 n 1 Uˆi CV(λ)= , n i=1 1 − sii which coincides with the result obtained for OLS.

63 Equivalent kernels and equivalent degrees of freedom Using the representation (16), it is easy to compare linear smoothers obtained from different methods.

The set of weights S1(x),...,Sn(x) assigned to the sample values of Y in the construction ofμ ˆ(x) is called the equivalent kernel evaluated at the point x. The ith row of the smoother matrix Sλ is just the equivalent kernel evaluated at the point Xi. A comparison of the equivalent kernels is therefore a simple and effective way of comparing different linear smoothers.

A synthetic index of the amount of smoothing of the data is the effective number of parameters defined, by analogy with OLS, as

k∗ =trSλ.

The number df = n − k∗ is called equivalent degrees of freedom.

The effective number of parameters is equal to the sum of the eigenvalues of the matrix Sλ.IfSλ is symmetric and idempotent (as in the case of regression splines), then k∗ =rankSλ.

The effective number of parameters depends mainly on the value of the smooth- ing parameter λ, while the configuration of the predictors tends to have a much smaller effect. As λ increases, we usually observe a decrease in k∗ and a corresponding increase in df .

64 Asymptotic properties We now present some asymptotic properties of regression smoothers. For sim- plicity, we confine our attention to estimators based on the kernel method.

Given a sample of size n from the distribution of (X, Y ), where X is a random k-vector and Y is a scalar rv, a Nadaraya-Watson estimator of the CMF of Y is n −1 n x − Xi x − Xi μˆn(x)= K K Yi. i=1 hn i=1 hn

Theorem 5 Suppose that the sequence {hn} of bandwidths and the kernel K: k → satisfy:

(i) hn → 0;

k (ii) nhn →∞; (iii) K(u) du =1, uK(u) du =0,and uuK(u) du is a finite k × k matrix. p Then μˆn(x) → μ(x) for every x in the support of X.

Theorem 6 In addition to the assumptions of Theorem 5, suppose that

2 k 1/2 hn(nhn) → λ, where 0 ≤ λ<∞.IftheCMFμ(x) and the CVF σ2(x) are smooth and X has a twice continuously differentiable density fX (x) then 2 k 1/2 b(x) σ (x) 2 (nhn) [ˆμn(x) − μ(x)] ⇒N λ , K(u) du fX (x) fX (x) for every x in the support of X,where  −(k+2) x − Xi i − b(x) =n lim→∞ hn E [μ(X ) μ(x)] K . hn

k 1/2 k 1/2 Further, (nhn) [ˆμn(x) − μ(x)] and (nhn) [ˆμn(x ) − μ(x )] are asymptotically independent for x = x.

65 Remarks

• μˆn(x)isasymptotically biased for μ(x), unless hn is chosen such that λ = 0. In this case, however, the convergence ofμ ˆn(x) to its limiting distribution is slower than in the case when λ>0.

• The convergence rate ofμ ˆn(x) is inversely related to the number k of co- variates, reflecting the curse-of-dimensionality problem. For k fixed, −1/(k+4) the maximal rate is achieved when hn = cn ,forsomec>0. In this case, λ = c(k+4)/2 and therefore 2 2/(k+4) 2 b(x) −k σ (x) 2 n [ˆμn(x) − μ(x)] ⇒N c ,c K(u) du . f(x) f(x)

Even when there is a single covariate (k = 1), the fastest convergence rate is only equal to n−2/5 and is lower than the rate of n−1/2 achieved by a regular estimator of μ(x) in a parametric context.

These problems have generated two active lines of research:

• Oneseekswaysofeliminating the asymptotic bias ofμ ˆn(x) while retaining the maximal convergence rate of n−2/(k+4). One possibility (Schucany and Sommers 1977, Bierens 1987) is to consider a weighted average of two estimators with different bandwidths chosen such that the resulting estimator is asymptotically centered at μ(x).

• The other seeks ways of improving the speed of convergence of μˆn(x) which, from the proof of Theorem 6, depends on the convergence  rate ofq ˆ3(x). If the kernel K is chosen such that uu K(u) du =0, −3 allowing therefore for negative values of K, then limn→∞ hn qˆ3(x) < ∞ and so the optimal convergence rate becomes n−3/(k+6). More generally, if one chooses a kernel with zero moments up to order m,then −1/(k+2m) the optimal bandwidth becomes hn = cn ,andthemaximal convergence rate becomes n−m/(k+2m) which, as m →∞, tends to the rate n−1/2 typical of a parametric estimator.

66 Tests of parametric models Nonparametric regression may be used to test the validity of a parametric model. For this purpose one may employ both informal methods, especially graphical ones, and more formal tests.

To illustrate, consider testing for the goodness-of-fit of a linear regres- sion model.Let[X, Y] be a sample of size n from the distribution of (X, Y ), and assume for simplicity that Y | X = x ∼N(μ(x),σ2). The null hypothesis H0: μ(x)=α + βx specifies the CMF as linear in x, whereas the alternative hypothesis specifies μ(x)asasmooth nonlinear function of x.

Let Uˆ = MY be the OLS residual vector and U˜ =(In − Sλ)Y be the residual vector associated with some other linear smoother µˆ = SλY. Because H0 is nested within the alternative, an F -type test rejects H0 for large values of the statistic     Uˆ Uˆ − U˜ U˜ Y MY − Y M∗Y F =  =  , U˜ U˜ Y M∗Y  where M∗ =(In − Sλ) (In − Sλ). For general types of linear smoother, the distribution of F under H0 is difficult to derive as it depends on the unknown parameters α and β.

Azzalini and Bowman (1993) suggest replacing H0 by the hypothesis that E Uˆ = 0. The alternative is now that E Uˆ is a smooth function of the co- variates. This leads to a test that rejects H0 for large values of the statistic

  Uˆ Uˆ − Uˆ M∗Uˆ ∗ F =  . Uˆ M∗Uˆ

It can be shown that F∗ depends only on the data and the value of the smooth- ing parameter λ. Rejecting H0 for large values of F∗ is equivalent to rejecting for small values of the statistic

 1 Uˆ M∗Uˆ ξ = = , F∗ +1 Uˆ Uˆ which has the same form as the Durbin–Watson test statistic.

67 2.7 Average derivatives

Suppose that the CMF is sufficiently smooth and write it as μ(x)=c(x)/fX (x), where c(x)= yf(x, y) dy.Thenitsslope is

c (x)fX (x) − c(x) fX (x) c (x) − μ(x) fX (x) μ (x)= 2 = , fX (x) fX (x) and its average slope or average derivative is δ =Eμ (X)= μ (x) fX (x) dx, (22) where the expectation is with respect to the marginal distribution of X.The average derivative provides a convenient way of summarizing the CMF through a finite dimensional parameter (a scalar parameter if X is unidimensional).

Notice that, if μ(x)islinear,thatis,μ(x)=α + βx,then

μ(x)=β =Eμ(X).

This is no longer true if μ(x)isnonlinear.

Under regularity conditions, integrating by parts gives the following equiva- lent representation of the average derivative

δ =EsX (X) μ(X), (23)

where sX (x)=−fX (x)/fX (x)isthescore of the marginal density of X.

A third representation follows from the fact that, by the law of iterated expec- tations,

δ =EsX (X) μ(X)=E[sX (X)E(Y | X)] = E sX (X) Y, (24) where the last expectation is with respect to the joint distribution of (X, Y ).

These three equivalent representations provide alternative way of estimating the average derivative.

68 Estimation of the average derivative If the kernel K is differentiable, an analog estimators of the average derivative δ basedon(22)is n −1 δˆ1 = n Di μˆ (Xi), i=1 where Di =1{fˆX (Xi) >cn} for some trimming constant cn that goes to zero as n →∞. The lower bound on fˆX is introduced to avoid erratic behavior when the value of fˆX (x) is very small.

An analog estimator based on (23) is

n −1 δˆ2 = n Di sˆ(Xi)ˆμ(Xi), i=1

wheres ˆX (x)=−fˆX (x)/fˆX (x)istheestimated score.

Another analog estimator, based on (24), is

n −1 δˆ3 = n Di sˆX (Xi) Yi. i=1

69 2.8 Methods for high-dimensional data We now review a few nonparametric approaches that try to overcome some of the problems encountered in trying to extend univariate nonparametric meth- ods to multivariate situations.

Project pursuit regression estimation The projection pursuit (PP) method was originally introduced in the re- gression context by Friedman and Stuetzle (1981). Specifically, given a rv Y and a random k-vector X, they proposed to approximate the CMF of Y by a function of the form p  m(x)=α + mj(βj x), (25) j=1 k where βj is a vector with unit norm that specifies a particular direction in ,  k βj x = i=1 βijxi is a linear combination or projection of the variables in X, and m1,...,mp are smooth univariate “ridge functions” to be determined empirically.

Remarks:

• By representing the CMF as a sum of arbitrary functions of lin- ear combinations of the elements of x, PP regression allows for quite complex patterns of interaction among the covariates.

• Single-index models may be viewed as special cases corresponding to p = 1, and linear regression as a special case corresponding to p =1 and m1(u)=u.

• The curse-of-dimensionality problem is bypassed by using linear pro- jections and nonparametric estimates of univariate ridge functions.

70 The PP algorithm Algorithm 2

(1) Given estimates βˆj of the projection directions and estimates mˆ j of the first h − 1 ridge functions in (25), compute the approximation errors h−1  ri = Y˜i − mˆ j(βˆj Xi),i=1,...,n, j=1

where Y˜i = Yi − Y¯ . (2) Given a vector b ∈ k such that b =1, construct a linear smoother  mˆ (b Xi) based on the errors ri and compute the RSS n  2 S(b)= [ri − mˆ (b Xi)] . i=1

(3) Determine b∗ and the function mˆ ∗ for which S(b) is minimized.

(4) Insert b∗ and the function mˆ ∗ as the hth terms into (25). (5) Iterate (1)–(4) until the decrease in the RSS becomes negligible.

As stressed by Huber (1985), PP regression “emerges as the most powerful method yet invented to lift one-dimensional statistical techniques to higher dimensions”. However, it is not without problems: • Interpretability of the individual terms of (25) is difficult when p>1. • PP regression has problems in dealing with highly nonlinear struc- tures. In fact, there exist functions that cannot be represented as a finite sum of ridge functions (e.g. μ(x)=exp(x1x2)). • The sampling theory of PP regression is still lacking. • The choice of the amount of smoothing in constructing the non- parametric estimates of the ridge functions is delicate. • PP regression tends to be computationally expensive.

We now discuss an approach that may be regarded as a simplification of PP regression.

71 Additive regression An important property of the linear regression model

k m(x)=α + βjxj j=1 is additivity. If there is no functional relation between the elements of x, then this property makes it possible to “separate” the effect of the different covariates and to interpret βj as the partial derivative of μ(x) with respect to the jth covariate.

An additive regression model approximates the CMF of Y by a function of the form k m(x)=α + mj(xj), (26) j=1 where m1,...,mk are univariate smooth functions, one for each covari- ate. The linear regression model corresponds to mj(u)=βju for all j or, equivalently, mj(u)=βj.

Remarks:

• The additive model (26) is a special case of the PP regression model (25) corresponding to p = k and βj = ej,thejth unit vector.

• Because mj(xj)=∂m(x)/∂xj , additive regression retains the inter- pretability of the effect of the individual covariates.

• Interactions between categorical and continuous ones may be mod- eled by estimating separate versions of the model for each value of the categorical variable.

• Interactions between continuous covariates may be modelled by creat- ing compound variables, such as products of pairs.

72 Characterization of additive regression models The following theorem characterizes an additive regression model as the so- lution to a projection problem. In turn, this leads to a general iterative procedure for estimating additive regression models.

Theorem 7 Let Z =(X1,...,Xk,Y) be a random (k +1)-vector with finite second moments. Let M be the Hilbert space of zero-mean functions of Z, with the inner product defined as ψ | φ =Eψ(Z)φ(Z). For any j =1,...,k, let Mj be the subspace of zero-mean square integrable functions of Xj and let a M = M1 + ···+ Mk be the subspace of zero-mean square integrable additive functions of X =(X1,...,Xk). Then a solution to the following problem min E[Y − m(X)]2 m∈Ma exists, is unique and is of the form k m∗(X)= mj(Xj), j=1 where m1,...,mk are univariate functions such that mj(Xj)=E[Y − mh(Xh) | Xj],j=1,...,k. (27) h=j

Denote now by Sj a smoother matrix for univariate smoothing on the jth covariate. Because the univariate smoother SjY is the sample analogue of the conditional mean E(Y | Xj), the analogy principle suggests the following equation system as the sample counterpart of (27) mj = Sj(Y˜ − mh),j=1,...,k, h=j where mj =(mj(X1),...,mj(Xn)) and Y˜ is the n-vector of deviations of Yi from its sample mean. This results in the system of nk × nk linear equations ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ ··· ˜ In S1 S1 S1 m1 ⎜ S1Y ⎟ ⎢ ⎥ ⎜ ⎟ ⎜ ⎟ ⎢ S2 In S2 ··· S2 ⎥ ⎜ m2 ⎟ ⎜ S2Y˜ ⎟ ⎢ ⎥ ⎜ ⎟ ⎜ ⎟ ⎢ . . . . ⎥ ⎜ . ⎟ = ⎜ . ⎟ . (28) ⎣ . . . . ⎦ ⎝ . ⎠ ⎝ . ⎠ Sk Sk Sk ··· In mk SkY˜

If (mˆ 1,...,mˆ k) is a solution to the above system and we putα ˆ = Y¯ ,thenan k estimate of µ =(μ(X1),...,μ(Xn)) is µˆ =ˆαιn + j=1 mˆ j.

73 The backfitting algorithm One way of solving system (28) is the Gauss–Seidel method. This method solves iteratively for each vector mj from the relationship mj = Sj(Y˜ − ml), h=j using the latest values of {mh,h = j} at each step. The process is repeated for j =1,...,k,1,...,k,..., until convergence. The result is the backfitting algorithm:

Algorithm 3

(1) Compute the univariate smoother matrices S1,...,Sk.

(0) (2) Initialize the algorithm by setting mj = mj , j =1,...,k. (3) Cycle for j =1,...,k, mj = Sj(Y˜ − h=j mh), where the smoother matrix Sj for univariate smoothing on the jth covariate is applied to the residual vector Y˜ − h=j mh obtained from the previous step.

(4) Iterate step (3) until the changes in the individual functions become neg- ligible.

The smoother matrices Sj can be computed using any univariate linear smoothers and do not change through the backfitting algorithm. To ensure that Y˜ − h=j mh has mean zero at every stage, Sj may be replaced −1  by the centered smoother matrix Sj(In − n ιnιn ).

A possible starting point for the algorithm is the fitted value of Y from an (0) OLS regression of Y on X.Inthiscasemj = Xjβˆj,whereXj and βˆj denote respectively the jth column of X and the jth element of the OLS coefficient vector.

It can be shown that convergence of the backfitting algorithm is guaranteed in the case of smoothing splines or when the chosen linear smoother is a projection operator, that is, defined by a symmetric idempotent smoother matrix.

74 Partially linear models A partially linear model is a model of the form

E(Y | X, W)=m1(X)+m2(W ),

 where m1(x)=β x and m2 is an unknown univariate smooth function.

For simplicity (and without loss of generality), suppose that X is a scalar rv’s and write the model as

Y = βX + m2(W )+U, (29) where E(U | X, W)=0.IfEXm2(W )=0,thenβ may be estimated consis- tently and efficiently by an OLS regression of Y on X. This is typically the case when X has mean zero and is independent of W . In general, however, X and m2(W )arecorrelated, and so a regression of Y on X gives inconsistent estimates of β.

Notice however that

E(Y | W )=E[E(Y | X, W) | W ]=β E(X | W )+m2(W ), (30)

Subtracting (30) from (29) gives

Y − E(Y | W )=β[X − E(X | W )] + U.

If Yˆ = P W is a linear regression smoother of Y on W and Xˆ = SW is a linear regression smoother of X on W ,thenβ maysimplybeestimatedbyan OLS regression of Y − Yˆ =(In − P )Y on X − Xˆ =(In − S)X.

This method, proposed by Robinson (1988), may be viewed as a generalization of the method for partitioned regressions.

75 The backfitting algorithm for partially linear models

Now consider applying the backfitting algorithm. Because in this case h1 is linear, two different smoothers may be employed:

 −1  • An OLS smoother, with smoother matrix S1 = X(X X) X ,which produces estimates of the form mˆ 1 = Xβˆ.

• A nonparametric smoother with smoother matrix S2. Applied to the vector W =(W1,...,Wn), this smoother produces estimates of the vector m2 =(m2(W1),...,m2(Wn)) of the form mˆ 2 = S2W.

In this case, the steps of the backfitting algorithm are

 −1  mˆ 1 = X(X X) X (Y − mˆ 2)=Xβ,ˆ

mˆ 2 = S2(Y − mˆ 1)=S2(Y − Xβˆ).

Premultiplying the first of these two equations by X and substituting the expression for mˆ 2 from the second equation gives

     X mˆ 1 = X (Y − mˆ 2)=X (In − S2)Y + X S2Xβˆ = X Xβ,ˆ from which   X (In − S2)Y = X (In − S2)Xβ.ˆ Solving for βˆ we then get

 −1  βˆ =[X (In − S2)X] X (In − S2)Y.

In this case, no iteration is necessary.

This method may also be viewed as a generalization of the method for parti- tioned regressions.

If S2 is a projection matrix (symmetric and idempotent), then the backfit- ting algorithm coincides with the method proposed by Robinson (1988).

76 2.9 Stata commands We now briefly review the Stata commands for linear nonparametric regression estimation.

These include the lowess and lpoly commands, respectively for lowess and kernel-weighted local polynomial smoothing. Both commands consider the case of a single regressor.

77 The lowess command This is a computationally intensive command and may therefore take a long time to run on a slow computer. Lowess calculations on 1,000 observations, for instance, require performing 1,000 regressions.

The basic syntax is: lowess yvar xvar [if][in][weight][,options] where options includes:

• mean: running-mean smooth (default is running-line least squares).

• noweight: suppresses weighted regressions (default is tricube weighting function).

• bwidth(#): span of the data (default is 0.80). Centered subsets con- taining a fraction bwidth() of the observations are used for calculating smoothed values for each point in the data except for end points, where smaller uncentered subsets are used.

• logit: transforms dependent variable to logits and adjust smoothed mean to equal mean of yvar.

• adjust: adjusts smoothed mean to equal mean of yvar.

• generate(newvar):createsnewvar containing smoothed values of yvar.

• nograph: suppresses graph. This option is often used with the generate() option.

78 The lpoly command The basic syntax is: lpoly yvar xvar [if][in][weight][,options] where options includes:

• kernel(kernel): specifies kernel function. The available kernels in- clude epanechnikov (default), biweight, cosine, gaussian, parzen, rectangle,andtriangle.

• bwidth(#): specifies the half-width of the kernel. If bwidth() is not specified, a “rule-of-thumb” bandwidth is calculated and used. A local variable bandwidth may be specified in varname, in conjunction with an explicit smoothing grid using the at() option.

• degree(#): specifies the degree of the polynomial smooth. The default is degree(0) corresponding to a running mean.

• generate([newvar x] newvar s): stores smoothing grid in newvar x and smoothed points in newvar s.Ifat() is not specified, then both newvar x and newvar s must be specified. Otherwise, only newvar s is to be specified.

• n(#): obtain the smooth at # points. The default is min(n, 50).

• at(varname): specifies a variable that contains the values at which the smooth should be calculated. By default, the smoothing is done on an equally spaced grid.

• nograph: suppresses graph. This option is often used with the generate() option.

• noscatter: suppresses scatterplot only. This option is useful when the number of resulting points would be so large as to clutter the graph.

• pwidth(#): specifies pilot bandwidth for standard error calculation.

79 3 Distribution function and quantile function estimators

The distribution function and the quantile function are equivalent ways of characterizing the probability distribution of a univariate rv Z.

3.1 The distribution function and the quantile function The distribution function The distribution function (df)ofZ is a real-valued function defined on by

F (z)=Pr{Z ≤ z},z∈ .

AdfF is called absolutely continuous if it is differentiable almost every- where and satisfies z F (z)= F (t) dt, −∞

z Any function f such that F (z)= −∞ f(t) dt is called a density for F .Any such density must agree with F except possibly on a set with measure zero. If f is continuous at z0,thenf(z0)=F (z0).

80 Properties of the distribution function • F is non decreasing, right-continuous, and satisfies F (−∞)=0and F (∞)=1.

• For any z, Pr{Z = z} = F (z+) − F (z−).

where F (z+) = limh→0 F (z + h)andF (z−) = limh→0 F (z − h).

• If F is continuous at z,then

F (z+) = F (z−)=F (z),

and so Pr{Z = z} =0.

• The rv F (Z) is distributed as U(0, 1).

• Given c

81 Quantiles

For any p ∈ (0, 1), a pth quantile of a rv Z is a number Qp ∈ such that

Pr{ZQp}≤1 − p (31) or, equivalently, Pr{Z

A quantile corresponding to p = .50 is called a median of Z and will be denoted by ζ. Quantiles corresponding to p = .10,.25,.75 and .90 are called, respectively, lower deciles, lower quartiles, upper quartiles and upper deciles.

It can be shown that a pth quantile of Z is any solution to the problem

min E p(Z − c), (32) c∈ where p denotes the asymmetric absolute loss function p|v|, if v ≥ 0, p(v)=[p − 1{v<0}] v = (1 − p)|v|, if v<0, and 1{A} denotes the indicator function of the event A.

If p = .50, then p(v)=|v|/2 is proportional to the familiar symmetric absolute loss function and the corresponding minimum expected loss (E |Z − ζ|)/2 is proportional to the mean absolute deviation of Z from its median.

82 Figure 9: The asymmetric absolute loss function p(v).

p = .25 p = .50 p = .75

1.5

1

.5

0

−2 −1 0 1 2 z

83 The quantile function The quantile function (qf)ofarvZ is a real-valued function defined on (0, 1) by Q(p) = inf {z ∈ : F (z) ≥ p}. (33)

If Qp is unique, then Q(p)=Qp. While the df F maps into (0, 1), the qf Q maps (0, 1) into .

Properties of the quantile function • Q is non decreasing and left-continuous.

• Q(F (z)) ≤ z and F (Q(p)) ≥ p.Thus,F (z) ≥ p iff z ≥ Q(p). This jus- tifies calling Q the inverse of F and using the notation Q(p)=F −1(p). Also, F (z)=sup{p ∈ [0, 1]: Q(p) ≤ z}. (34)

• Q is continuous iff F is strictly increasing.

• Q is stricly increasing iff F is continuous,inwhichcase

Q(p) = inf {z ∈ : F (z)=p},F(Q(p)) = p.

Further, if F is continuous, then Z has the same distribution as Q(U), where U ∼U(0, 1) (quantile transformation).

84 Figure 10: Quantile functions of some continuous distributions.

Uniform Chi−square(1) Exponential 1 8 6

6 4

.5 4

2 2

0 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 p p p Gumbel Gaussian Laplace 2 4 5

0 2

−2 0 0

−4 −2

−6 −4 −5 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 p p p Logistic Lognormal Weibull, alpha=2, gamma=1

5 15 3

10 2

0

5 1

−5 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 p p p

85 Other properties of the quantile function • If the first two moments of Z exist, then

1 1 1 2 E Z = Q(p) dp, Var Z = Q(p)2 dp − Q(p) dp . 0 0 0

• If Z has a continuous and strictly positive density f in a neighbor- hood of Q(p), then the derivative of Q at p exists and is equal to 1 Q(p)= . f(Q(p))

The slope Q is known as the sparsity function or quantile-density function, whereas the composition f ◦ Q is known as the density- quantile function. The quantile-density function is well defined whenever the density is strictly positive,andisstrictly positive whenever the density is bounded.

• By differentiating the identity ln Q(p)=− ln f(Q(p)), one also sees that the score function s = −f /f must satisfy

Q(p) d 1 d s(Q(p)) = = − = − f(Q(p)). [Q(p)]2 dp Q(p) dp

• If the function g is a monotonically increasing and left-continuous, then the qf of the rv g(Z)isequaltog◦Q. In particular, if g(z)=α+βz, with β ≥ 0, then the qf of g(Z)isequaltoα + βQ(p).

• Given c

86 Summaries of a distribution Center, spread, symmetry and (to a lesser extent) kurtosis of a distribu- tion are intuitive but somewhat vague concepts.

If Z is a rv with finite 4th moment,thenitiscustomary to define:

• Center in terms of the mean μ =EZ.

• Spread in terms of the variance σ2 =E(Z − μ)2.

• Symmetry in terms of the 3rd moment E(Z − μ)3.

• Kurtosis in terms of the 4th moment E(Z − μ)4.

An alternative is to use quantile-based summaries of a distribution. Thus, we may alternatively define:

• Center in terms of the median ζ = Q(.50).

• Spread in terms of either the interquartile range IQR = Q(.75) − Q(.25) or the interdecile range IDR = Q(.90) − Q(.10). The IQR (IDR) is the interval of values on which is concentrated the central 50% (80%) of the probability mass of Z.

• Symmetry in terms of either the difference

[Q(.90) − Q(.50)] − [Q(.50) − Q(.10)]

or the ratio Q(.90) − Q(.50) , Q(.50) − Q(.10) or analogous measures with Q(.90) and Q(.10) replaced by Q(.75) and Q(.25).

• Kurtosis (or, more precisely, tail weight) in terms of either the dif- ference IDR - IQR or the ratio IDR/IQR.

87 3.2 The empirical distribution function

Let Z1,...,Zn be a sample from the distribution of a rv Z, and consider the problem of estimating the df F of Z when F is not restricted to a known parametric family.

The key is the fact that, for any z,

F (z)=Pr{Z ≤ z} =E1{Z ≤ z}, that is, F (z) is just the mean of the Bernoulli rv 1{Z ≤ z}. This suggests estimating F (z) by its sample counterpart

n −1 Fˆ(z)=n 1{Zi ≤ z}, (35) i=1 namely the fraction of sample points such that Zi ≤ z. Viewed as a function defined on , Fˆ is called the empirical distribution function (edf).

In fact, Fˆ is the df of a discrete probability measure, called the empirical measure, which assigns to a set A a probability equal to the fraction of sample points contained in A. In particular, each distinct sample point receives probability equal to 1/n, whereas each sample point repeated m ≤ n times receives probability equal to m/n.

Suppose that all observations are distinct, and let Z[1] < ···

This shows that the edf contains all the information carried by the sample, except the order in which the observations have been drawn or are arranged.

88 Figure 11: Empirical distribution function Fˆ for a sample of 100 observations from a N (0, 1) distribution.

Population df Empirical df

1

.5

0

−4 −2 0 2 4 z

89 The Riemann-Stieltjes integral In what follows we shall sometimes use the notation of the Riemann-Stieltjes integral (see e.g. Apostol 1974, Chpt. 7).

Given a df F and an integrable function g, we define EF g(Z)= g(z) dF (z)= g(z)f(z) dz if F is continuous with probability density function f(z), and EF g(Z)= g(z) dF (z)= g(zj)f(zj) j if F is discrete with probability mass function f(zj).

Because the edf Fˆ is discrete, with probability mass function that assigns probability mass 1/n to each distinct sample point, we have

n −1 EFˆ g(Z)= g(z) dFˆ(z)=n g(Zi). i=1

90 The empirical process

Let Z1,...,Zn be a sample from the distribution of a rv Z with df F ,let θ = T (F )= g(z) dF (z) be the population parameter of interest, let Fˆ be the edf, and let

n −1 θˆ = T (Fˆ)= g(z) dFˆ(z)=n g(Zi) i=1 be a “plug-in” or bootstrap estimator of θ.

The estimation error associated with θˆ is θˆ − θ = g(z) dFˆ(z) − g(z) dF (z)= g(z) d[Fˆ(z) − F (z)].

Thus, the estimation error depends on the difference between the edf Fˆ and the population df F .

In asymptotic analysis, attention typically focuses on the rescaled estima- tion error √ n (θˆ − θ)= g(z) dpn(z), where √ pn(z)= n [Fˆ(z) − F (z)].

Notice that pn(z) is a function defined on for any given sample, but is a rv for any given z. The collection

pn = {pn(z),z ∈ } of all such rv’s is therefore a stochastic process with index set .This process is called the empirical process.

91 Figure 12: Realization of the empirical process pn for a sample of n = 100 observations from a N (0, 1) distribution.

.5

0

−.5

−4 −2 0 2 4 z

92 Finite sample properties

The key is the fact that Fˆ(z)isjusttheaverage of n iid rv’s 1{Zi ≤ n} that have a common Bernoulli distribution with parameter F (z).

Theorem 8 If Z1,...,Zn is a sample from a distribution with df F , then for all z:

(i) nFˆ(z) ∼ Bi(n, F (z));

(ii) E Fˆ(z)=F (z);

(iii) Var Fˆ(z)=n−1F (z)[1 − F (z)];

(iv) Cov[Fˆ(z), Fˆ(z)] = n−1F (z)[1 − F (z)] for any z >z.

Remarks:

• For any z, Fˆ(z)isanunbiased for F (z).

• Because of the correlation between Fˆ(z)andFˆ(z), care is needed in drawing inference about the shape of F from small samples.

• As n →∞, the sampling variance of Fˆ(z) vanishes, so Fˆ(z)isconsistent for F (z). Further, as n →∞, the correlation between Fˆ(z)andFˆ(z) also vanishes.

• Being the average of n iid rv’s with finite variance, Fˆ(z) also satisfies the standard CLT and is therefore asymptotically normal.

93 Asymptotic properties

We now consider the properties of a sequence {Fˆn(z)} of edf’s indexed by thesamplesizen. As for density estimation, we distinguish between local properties (valid at a finite set of z values) and global properties.

Given J distinct points

−∞

As n →∞we have: as • Fˆ n → F,

• pn ⇒NJ (0, Σ), where Σ is the J × J matrix with generic element

σjk = min(Fj,Fk) − FjFk,j,k=1,...,J.

94 The Glivenko-Cantelli theorem

As a measure of distance between the edf Fˆn and the population df F consider the Kolmogorov-Smirnov statistic

Dn =sup|Fˆn(z) − F (z)|. −∞

Since Fˆn depends on the data, Dn is itself a rv.

ItwouldbeniceifDn → 0, in some appropriate sense, as n →∞. In fact, one as can shown that Dn → 0asn →∞,thatis,

{ n } Pr nlim→∞ D =0 =1 or, equivalently, the event that Dn does not converge to zero as n →∞occurs with zero probability under repeated sampling.

This fundamental result, known as the Glivenko-Cantelli theorem, implies that the entire probabilistic structure of Z can almost certainly be uncovered from the sample data provided that n is large enough. Vapnik (1995) calls this “the most important result in the foundation of statistics”.

95 Figure 13: Empirical distribution function Fˆ for a sample of 1, 000 observations from a N (0, 1) distribution.

Population df Empirical df

1

.5

0

−4 −2 0 2 4 z

96 Multivariate and conditional edf The definition of edf and its sampling properties are easily generalized to the multivariate case.

If Z1,...,Zn is a sample from the distribution of a random m-vector Z,then the m-variate edf is defined as n −1 Fˆ(z1,...,zm)=n 1{Zi1 ≤ z1,...,Zim ≤ zm} i=1 n −1 = n 1{Zi1 ≤ z1}···1{Zim ≤ zm}. i=1

Although conceptually straightforward, multivariate df estimates suffer of the curse of dimensionality problem and are difficult to represent graphically unless m is small, say equal to 2 or 3.

Example 9 If m =2andZi =(Xi,Yi), then the bivariate edf is defined as

n −1 Fˆ(x, y)=n 1{Xi ≤ x, Yi ≤ y} i=1 n −1 = n 1{Xi ≤ x} 1{Yi ≤ y}. i=1 2

97 3.3 The empirical quantile function We now consider the problem of estimating the qf of a rv Z given a sample Z1,...,Zn from its distribution.

Sample quantiles A sample analogue of (31) is called a sample quantile. Thus, for any p ∈ (0, 1), a pth sample quantile is a number Qˆp ∈ such that

n n −1 −1 n 1{Zi < Qˆp}≤p, n 1{Zi > Qˆp}≤1 − p, i=1 i=1 that is, about np of the observations are smaller than Qˆp and about n(1 − p) are greater. Equivalently, we have the condition

n n 1{Zi < Qˆp}≤np ≤ 1{Zi ≤ Qˆp}, (36) i=1 i=1 that is, np cannot be smaller than the number of observations strictly less than Qˆp and cannot be greater than the number of observations less than or equal to Qˆp.

A pth sample quantile is unique if np is not an integer, and lies in a closed interval defined by two adjacent order statistics if np is an integer.

Example 10 A sample quantile corresponding to p =1/2 is called a sample median and will be denoted by ζˆ.Infact,ifZ[1] ≤···≤Z[n] are the sample order statistics, then:

• when n is odd (n/2 is not an integer), ζˆ = Z[k],wherek =(n +1)/2,

• when n is even (n/2 is an integer), ζˆ is any point in the closed interval [Z[k],Z[k+1]], where k = n/2. 2

98 Asymmetric LAD It can be shown that a pth sample quantile is also a solution to the sample analogue of (32), namely n −1 min E ˆ p(Z − c)=n p(Zi − c), (37) c∈ F i=1 where p(v)=[p − 1{v<0}] v is the asymmetric absolute loss function. Problem (37) is called an asymmetric least absolute deviations (ALAD) problem.

Notice that, since the function p(v)isconvex but not differentiable at −1 n v = 0, the ALAD objective function L(c)=n i=1 p(Zi −c)isitselfconvex but not differentiable at all z that correspond to a sample value. However, we can always define its subgradients n L(c − h) − L(c) −1 L (c−) = lim = n [p − 1{Zi

99 Figure 14: LAD objective function for different sample sizes. 1.2 1 .8 .6 .4

−1 −.5 0 .5 1 c

n=4 n=5

100 Figure 15: Subgradients L(c−)andL(c+) of the LAD objective function for n =4. .5 0 −.5 −1 −.5 0 .5 1 c

L’(c−) L’(c+)

101 Figure 16: Subgradients L(c−)andL(c+) of the LAD objective function for n =5. .5 0 −.5 −1 −.5 0 .5 1 c

L’(c−) L’(c+)

102 The empirical quantile function The empirical quantile function (eqf)isareal-valued function defined on (0, 1) by Qˆ(p) = inf {z ∈ : Fˆ(z) ≥ p}. Thus, the eqf is just the sample analogue of problem (33) and satisfies all properties of a qf. In particular, it is the left-continuous inverse of the right-continuous edf Fˆ.

Notice that Qˆ(p)=Qˆp when np is not an integer. Further, there is a close association between the values of the eqf and the sample order statistics, for i − 1 i Qˆ(p)=Z[i], for

The rescaled difference √ qn(p)= n [Qˆ(p) − Q(p)], is a function on (0, 1) for any given sample, and is a rv for any given p.The collection qn = {qn(p), 0

103 Figure 17: Empirical quantile function Qˆ for a sample of 100 observations from a N (0, 1) distribution.

Population qf Empirical qf

4

2

0

−2

−4

0 .25 .5 .75 1 p

104 Figure 18: Realization of the sample quantile process qn for a sample of n = 100 observations from a N (0, 1) distribution.

5

0

−5

0 .25 .5 .75 1 p

105 Finite sample properties Because sample quantiles essentially coincide with sample order statistics, we present a result on the df and the density of the sample order statistic Z[i], corresponding to the sample quantile Qˆ(i/n).

Theorem 9 Let Z1,...,Zn be a sample from a continuous distribution with df F and density f. Then the df of Z[i], 1 ≤ i ≤ n,is n n k n−k G(z)=Pr{Z[i] ≤ z} = F (z) [1 − F (z)] k=i k and its density is n − 1 g(z)=G(z)=n f(z) F (z)i−1[1 − F (z)]n−i. i − 1

Interest often centers not just on a single quantile, but on a finite number of them. This is the case when we seek a detailed description of the shape of a probability distribution, or we are interested in constructing some L-estimate, that is, linear combinations of sample quantiles, such as the sample IQR,the sample IDR,oratrimmed mean.

Sample quantiles corresponding to different values of p are dependent. Hence, from a practical point of view, what matters is their joint distribution.The next result presents the joint density of two sample order statistics, Z[i] and Z[j], corresponding respectively to the sample quantiles Qˆ(i/n)andQˆ(j/n).

Theorem 10 Let Z1,...,Zn be a sample from a continuous distribution with df F and density f. Then the joint density of Z[i] and Z[j], 1 ≤ i

n! g(u, v)= f(u)f(v) F (u)i−1 (i − 1)! (j − 1 − i)!(n − j)! × [F (v) − F (u)]j−1−i[1 − F (v)]n−j.

106 Asymptotic properties Since the exact sampling properties of a finite set of sample quantiles are somewhat complicated, we now consider their asymptotic properties.

Let Qˆn be a solution to (37) (to simplify notation we henceforth drop the ref- erence to p) and again assume that Z1,...,Zn is a sample from a continuous distribution with df F and density f,whichisfinite and strictly positive everywhere on . Under this assumption, Q = F −1(p)istheunique solution to (32) for any 0

n −1 Ln(c)=n p(Yi − c) i=1 can be shown to converge almost surely as n →∞, uniformly on ,to the population objective function

L(c)=Ep(Y − c).

as Thus, as n →∞, Qˆn → Q for any 0

107 Asymptotic normality of a sample quantile −1 n Because the ALAD objective function Ln(c)=n i=1 p(Zi − c)isnot differentiable,wecannot derive asymptotic normality of sample quantiles by taking a 2nd-order Taylor expansion of Ln.

There are various approaches to the problem. A simple one is based on the fact that the subgradient of Ln, n −1 Ln(c)=n [1{Zi 0. √ √ Next notice that, for any δ, n (Qˆn − Q) <δiff Qˆn 0 n n % & =Pr W¯ n > 0 , √ ¯ −1 n { }− with Wn = n i=1 Wi,whereWi =1 Zi 0 =Pr⎩ > ⎭ Φ , p(1 − p)/n ω ω where ω = p(1 − p)/f(Q). Hence, √ 2 n (Qˆn − Q) ⇒N(0,ω ).

108 Asymptotic joint normality of sample quantiles Given J distinct values

0

Extending the previous argument one can show that, if F possesses a contin- uous density f which is positive and finite at Q1,...,QJ ,then

qn ⇒NJ (0, Ω) as n →∞, where Ω is the J × J matrix with generic element

min(pj,pk) − pjpk ωjk = ,j,k=1,...,J. (38) f(Qj) f(Qk)

109 Bahadur representation

Bahadur (1966) and Kiefer (1967)√ established the close link between the ˆ sample quantile√ process qn(p)= n [Qn(p) − Q(p)] and the empirical process pn(z)= n [Fˆn(z) − F (z)].

They showed that, under regularity conditions, with probability tending to one as n →∞,

n 1 p − 1{Zi ≤ Q(p)} Qˆn(p) − Q(p)= + Rn, n i=1 f(Q(p)) uniformly for p ∈ [, 1 − ]andsome>0,√ where the remainder Rn is −3/4 3/4 Op(n (ln ln n) ). Multiplying both sides by n gives √ √ n [F (Q(p)) − Fˆn(Q(p))] n [Qˆn(p) − Q(p)] = + R , f(Q(p)) n

−1/4 3/4 where the remainder Rn is Op(n (ln ln n) ). Thus,

pn(Q(p)) qn(p)=− + op(1). f(Q(p))

Multivariate normality of sample quantiles follows immediately from this rep- resentation.

110 Estimating the asymptotic variance of sample quantiles From (38), a pth sample quantile has asymptotic variance

p(1 − p) AV (Qˆn(p)) = . f(Q(p))2

In particular, a sample median ζˆn has asymptotic variance 1 AV (ζˆn)= , 4f(ζ)2 where ζ = Q(.50).

In practice, the asymptotic variance of sample quantiles has to be estimated from the data. This requires estimating the density f at Q(p).

As a simple consistent estimator of f(Q(p))−1, Cox & Hinkley (1974) suggest

Z([np]+h ) − Z([np]−h ) n n , 2hn/n where [x] denotes the integer part of x and hn is a bandwidth that goes to zero as n →∞.

111 Asymptotic relative efficiency of the median to the mean We may use result (38) to compare the asymptotic properties of the sample mean Z¯n and the sample median ζˆn for a random sample from a rv Z with a unimodal distribution that is symmetric about μ with variance 0 <σ2 < ∞ and density f. Under these assumptions √ 2 n (Z¯n − μ) ⇒N(0,σ ), √ 1 n (ζˆn − μ) ⇒N 0, . 4f(μ)2

Because the two estimators are asymptotically normal with the same asymp- totic mean, their comparison may be based on the ratio of their asymptotic variances AV (Z¯n) 2 2 ARE(ζˆn, Z¯n)= =4σ f(μ) , AV (ζˆn) called the asymptotic relative efficiency (ARE) of the sample median to the sample mean.

−1 −1 Because Var Z¯n ≈ n AV (Z¯n)andVarζˆn ≈ n AV(ζˆ), the ARE is equal to the ratio of the sample sizes that the sample mean and the sample median respectively need to attain the same level of precision (inverse of the sampling variance). Thus, if 100 observations are needed for the sample median to attain a given level of precision, then the sample mean would need about 100 × ARE observations. The ARE of the sample median increases with the peakedness f(μ)ofthe density f at μ.Itiseasytoverifythat ⎧ ⎨⎪ 2/π ≈ .64, if Z Gaussian, 2 ARE(ζˆn, Z¯n)= π /12 ≈ .82, if Z logistic, ⎩⎪ 2, if Z double exponential. The table below shows the ARE of the sample median for t distributions with m ≥ 3dof.

m 3458∞ ARE 1.62 1.12 .96 .80 .64

112 3.4 Conditional df and qf Consider a random vector (X, Y ), where X is k-valued and Y is real val- ued. The conditional distribution of Y given X = x may equivalently be characterized through either:

• The conditional df, a real-valued function defined on × k by

F (y | x)=Pr{Y ≤ y | X = x}.

• The conditional qf (or quantile regression function), a real-valued function defined on (0, 1) × k by

Q(p | x) = inf {y ∈ : F (y | x) ≥ p}.

For any fixed x, F (y | x)andQ(p | x) satisfy all properties of a df and a qf, respectively. In particular, for any fixed x, the following monotonicity or no-crossing properties must be satisfied:

• if y >y,thenF (y | x) ≥ F (y | x),

• if p >p,thenQ(p | x) ≥ Q(p | x).

To stress the relationship between the conditional qf and the conditional df, the notation Q(p | x)=F −1(p | x) will also be used.

113 Conditional and marginal df’s and qf’s The Law of Iterated Expectations implies that E Y =Eμ(X)= μ(x) dH(x), where H is the df of X. Thus, the difference μ1 −μ2 in the unconditional mean of Y for two sub-populations may be decomposed as follows μ1 − μ2 = [μ1(x) − μ2(x)] dH1(x)+ μ2(x)[dH1(x) − dH2(x)], where the 1st term on the rhs reflects differences in the CMF of the two subpopulations, while the 2nd term reflects differences in their composition.

 When the two CMF’s are linear,thatis,μj(x)=βj x, j =1, 2, we have the so-called Blinder-Oaxaca decomposition

  μ1 − μ2 =(β1 − β2) μX,1 + β2 (μX,1 − μX,2), where μX,j denotes the mean of X in group j =1, 2.

In the case of a df we have F (y)= F (y | x) dH(x), so differences in the marginal distribution of Y for two sub-populations may be decomposed as follows F1(y) − F2(y)= [F1(y | x) − F2(y | x)] dH1(x)+ (39) + F2(y | x)[dH1(x) − dH2(x)], where the 1st term on the rhs reflects differences in the cdf of the two sub- populations, while the 2nd term reflects differences in their composition.

Simple decomposition of this kind are not available for quantiles. Machado & Mata (2005) and Melly (2005) propose methods essentially based on in- version of (39). Recently, Firpo, Fortin & Lemieux (2007) have proposed an approximate method based on the recentered influence function.

114 Estimation when X is discrete The methods reviewed so far can easily be extended to estimation of the con- ditional df and qf when the covariate vector X has a discrete distribution.

All is needed is partitioning the data according to the values of X.The conditional df and the regression qf corresponding to each of these values may then be estimated by treating the relevant subset of observations as if they were a separate sample.

Example 12 When X is a discrete rv and x is one of its possible values, the conditional df of Y given X = x is defined as

Pr{Y ≤ y,X = x} F (y | x)=Pr{Y ≤ y | X = x} = . Pr{X = x}

If O(x)={i: Xi = x} is the set of sample points such that Xi = x and n(x) is their number, then the sample counterpart of Pr{X = x} is the fraction n(x)/n of sample points such that Xi = x. Hence, if n(x) > 0, a reasonable estimate of F (y | x) is the fraction of sample points in O(x) such that Yi ≤ y

n −1 Fˆ(y | x)=n(x) 1{Yi ≤ y,Xi = x} i=1 (40) −1 = n(x) 1{Yi ≤ y}. i∈O(x)

Viewed as a function of y for x fixed, Fˆ(y | x) is called the conditional edf of Yi given Xi = x. 2

For these estimates to make sense, the number of data points corresponding to each of the possible values of X ought to be sufficiently large. This approach breaks down, therefore, when either X is continuous or X is discrete but its support contains too many points relative to the available sample size.

In what follows, we begin with the problem of estimating the conditional qf Q(p | x)whenX is a continuous random k-vector. In the last two decades, this problem has received considerable attention in econometrics.

115 3.5 Estimating the conditional qf

Let (X1,Y1),...,(Xn,Yn) be a sample from the joint distribution of (X, Y ). The conditional qf Q(p | x), viewed as a function of x for a given p,maybe obtained by solving the problem

min E p(Y − c(X)), 0

For a given p ∈ (0, 1), this suggests estimating Q(p | x) by choosing a function of x, out of a suitable family C∗ ⊂C, as to solve the sample analogue of (41)

n −1 min n p(Yi − c(Xi)), 0

To recover Q(p | x), now viewed as a function of both p and x, it is customary to select J distinct values

0

k and then estimate J distinct functions Q1,...,QJ , each defined on , with Qj(x)=Q(pj | x). There may be as many such values as one wishes. By suitably choosing their number and position, one may get a reasonably accurate description of Q(p | x).

Given an estimate Qˆ(p | x)ofQ(p | x), F (y | x) may be estimated by inversion

Fˆ(y | x)=sup{p ∈ (0, 1): Qˆ(p | x) ≤ y}.

This is a proper df iff Qˆn is a proper qf.

116 Linear quantile regression In the original approach of Koenker & Bassett (1978), C∗ is the class of linear functions of x,thatis,c(x)=bx. In this case, the sample counterpart of problem (41) is the ALAD problem

n −1  min n p(Yi − b Xi), 0

If βˆ(p) is a solution to (42), then an estimate of Q(p | x)is

Qˆ(p | x)=βˆ(p)x.

By suitably redefining the elements of the covariate vector Xi, one may easily generalize problem (42) to cases in which C∗ is a class of functions that de- pend linearly on a finite dimensional parameter vector, such as the class of polynomial functions of X of a given degree.

117 Computational aspects of linear quantile regression The lack of smoothness of the ALAD objective function implies that gradi- ent methods cannot be employed to solve (42).

An ALAD estimate may however be computed efficiently using linear pro- gramming methods (Barrodale & Roberts 1973, Koenker & d’Orey 1987). This is because the ALAD problem (42) may be reformulated as the following linear program min pιU+ +(1− p)ιU− β∈ k subject to Y = Xβ + U+ − U− U+ ≥ 0, U− ≥ 0, where ι is the n-vector of ones, and U+ and U− are n-vectors with generic +  −  elements equal to Ui =max(0,Yi − β Xi)andUi = − min(0,Yi − β Xi) respectively.

Simpler but cruder algorithms based on iterative WLS may also be em- ployed.

The dual of the above linear program is

max Yδ δ∈[p−1,p]n subject to Xδ =0 or, setting α = δ +(1− p)ιn,

max Yα α∈[0,1]n   subject to X α =(1− p)X ιn.

The solutionα ˆp to the dual problem connects quantile regression to regression rank scores, the regression generalization of classical rank scores proposed by Gutenbrunner & Jureˇckov´a (1992).

118 Asymptotics for linear quantile regression estimators

Let βˆn =(βˆn1,...,βˆnJ ), where βˆnj = βˆn(pj), be a kJ-vector of linear quantile regression estimators, and consider the following assumptions:

Assumption 2 (X1,Y1),...,(Xn,Yn) is a random sample from the joint dis- tribution of (X, Y ). Assumption 3 E XXT = P , a finite pd matrix; Assumption 4 The conditional df of Y given X = x satisfies Pr{Y ≤ y | X = x} = F (y − βx), where F has a continuous and strictly positive density f for all p such that 0

Assumption 4 is equivalent to the assumption that Y = βX + U, where U is distributed independently of X with df F (location shift model). Under Assumptions 2–4, √ −1 n (βˆn − β) ⇒N(0, Ω ⊗ P ), (43) where Ω is the J × J matrix with generic element

min(pj,pk) − pjpk ωjk = ,j,k=1,...,J, f(Q(pj)) f(Q(pk)) where Q = F −1 is the qf associated with F . Suppose, in addition, that F has finite variance σ2. Then, under the above two assumptions, result (43) implies that the ARE of a linear quantile regression estimator βˆn(p) relative to the OLS estimator β˜n is equal to σ2f(Q(p))2 ARE(βˆn(p), β˜n)= . p(1 − p)

In practice, to construct estimates of the asymptotic variance of linear quan- tile regression estimators, one needs estimates of the density of the quantile  regression errors Ui = Yi − Xi β.

119 Extensions Koenker & Portnoy (1987) provide a uniform Bahadur representation for linear quantile regression estimators.

They show that, under Assumptions 2–4, with probability tending to one as n →∞, √ n −1 1 p − 1{Ui ≤ Q(p)} n [βˆn(p) − β(p)] = P √ Xi + op(1), (44) n i=1 f(Q(p)) uniformly for p ∈ [, 1 − ]andsome>0.

Among other things, the asymptotically linear representation (44) helps ex- plain both the asymptotic normality and the good robustness proper- ties of linear quantile regression estimators with respect to outliers in the Y -space (although not to outliers in the X-space).

Both properties follow from the fact that the influence function of βˆ(p)is equal to −1 p − 1{Ui ≤ Q(p)} P Xi, f(Q(p)) which is a bounded function of Ui (although not of Xi).

120 Drawbacks of linear quantile regression estimators Although increasingly used in empirical work to describe the conditional dis- tribution of an outcome of interest, linear quantile regression estimators have several drawbacks.

Some have to do with their specific properties, especially their behavior under heteroskedasticity. The relevant issues are:

• the validity of the linearity assumption,

• the form of the asymptotic variance of linear quantile regression estimators when the linearity assumption does not hold,

• how to consistently estimate this asymptotic variance.

Others have to do with general properties of quantiles, especially:

• the difficulty of imposing the no-crossing condition,

• the difficulty of generalizing quantiles to the case when Y is vector valued,

• the complicated relationship between marginal and conditional quan- tiles.

121 Behavior under heteroskedasticity To illustrate the problem, let X be a scalar rv and suppose first that Y = μ(X)+U, where the regression error U is independent (not just mean in- dependent) of Y with continuous df F .ThenF (y | x)=F (y − μ(x)). By definition, the pth conditional quantile of Y satisfies F (Q(p | x) | x)=p. Hence Q(p | x)=Q(p)+μ(x), where Q(p)=F −1(p)isthepth quantile of F . Thus, for any p = p, Q(p | x) − Q(p | x)=Q(p) − Q(p) for all x, that is, the distance between any pair of conditional quantiles is independent of x.Ifμ(x)=α + βx,thenQ(p | x)=[α + Q(p)] + βx,thatis, the conditional quantiles of Y are a family of parallel lines with common slope β. Now suppose that Y is conditionally heteroskedastic,thatis,Y = μ(X)+ σ(X) U, where the function σ(·)isstrictly positive.Then y − μ(x) F (y | x)=F , σ(x) and so Q(p | x)=α + σ(x)Q(p)+μ(x). In this case: • even if μ(x) is linear in x, conditional quantiles need not be linear in x, • the distance between any pair of them depends on x Q(p | x) − Q(p | x)=σ(x)[Q(p) − Q(p)].

Partial exceptions are the cases when: • μ(x)islinear and F is symmetric about zero, implying that the conditional median Q(.5 | x) (but not other quantiles) is linear in x, • σ(x)islinear in x, implying that conditional quantiles are linear in x, although no longer with a common slope.

122 Figure 19: Quantiles of Y | X = x ∼N(μ(x),σ2(x)) when μ(x)=1+x and either σ2(x) = 1 (homoskedasticity) or σ2(x)=1+(2x + .5)2 (heteroskedas- ticity).

Homoskedasticity Heteroskedasticity 10 4

5 2

0 0

−2 −5

−2 −1 0 1 2 −2 −1 0 1 2 x x

123 Remarks • Linear quantile regressions may be a poor approximation when data are conditionally heteroskedastic and the square root of the condi- tional variance function is far from being linear in x.

• In the general case when μ(x) is an arbitrary function, the conditional quantiles of Y are of the form

Q(p | x)=μ(x)+σ(x)Q(p).

In the absence of prior information, it is impossible to determine whether nonlinearity of Q(p | x) is due to nonlinearity of μ(x)orheteroskedas- ticity.

• From the practical point of view, heteroskedasticity implies that esti- mated linear quantile regressions may cross each other, thus violating a fundamental property of quantiles and complicating the interpre- tation of the results of a statistical analysis.

• It also implies that the asymptotic variance matrix of βˆ is no longer given by (43). The block corresponding to the asymptotic covariance between βˆj and βˆk is instead equal to

−1 −1 [min(pj,pk) − pjpj] Bj PBk ,

where  T Bj =Ef(X βj | X) XX and f(y | x) denotes the conditional density of Y given X = x.An analog estimator of Bj is not easy to obtain, as it requires estimating the conditional density f(y | x), so bootstrap methods are typically used to carry out inference in this case.

124 Figure 20: Scatterplot of the data and estimated quantiles for a random sample of 200 observations from the model Y | X ∼N(μ(X),σ2(X)), with X ∼N(0, 1), μ(X)=1+X and either σ2(X) = 1 (homoskedasticity) or σ2(X)=1+(2X + .5)2 (heteroskedasticity).

Homoskedasticity Heteroskedasticity 6 15

4 10

y 2 5

0 0

−2 −5 −4 −2 0 2 4 −4 −2 0 2 4 x x Homoskedasticity Heteroskedasticity

5

3 5

3 1

1

−1 −1

−3 −3 −4 −2 0 2 4 −4 −2 0 2 4 x x

125 Interpreting linear quantile regressions Consider the ordinary (mean) regression model Y = βX + U,whereU and X are uncorrelated.Inthismodel,βX may equivalently be interpreted as the best linear predictor of Y given X,orasthebest linear approxi- mation to the conditional mean μ(X)=E(Y | X), that is,

β =argmin E(Y − bX)2 b∈ k  2 =argmin EX [μ(X) − b X] , b∈ k where the first expectation is with respect to the joint distribution of X and Y , while the second is with respect to the marginal distribution of X.

Angrist, Chernozhukov and Fernandez-Val (2006) show that a similar inter- pretation is available for linear quantile regression. More precisely, they show that if β(p) is the slope of a linear quantile regression, then

 β(p)=argmin E p(Y − b X) b∈ k  2 =argmin EX w(X, b)[Q(p | X) − b X] , b∈ k with 1 w(x, b)= (1 − u) f (Q(p | x)+uΔ(x, b) | x) du, 0 where Δ(x, b)=bx−Q(p | x)andf(y | x) is the conditional density of Y given X = x. Thus, linear quantile regression minimizes a weighted quadratic mea- sure of discrepancy between the population conditional qf and its best linear approximation. The weighting function, however, is not easily interpretable.

For an alternative proof of this result, see Theorem 2.7 in Koenker (2005).

126 Nonparametric quantile regression estimators To overcome the problems arising with the linearity assumption, several non- parametric quantile regression estimators have been proposed. These include:

• kernel and nearest neighbor methods (Chauduri 1991),

• regression splines with a fixed number of knots (Hendricks & Koenker 1992),

• locally linear fitting (Yu & Jones 1998),

• smoothing splines and penalized likelihood (Koenker, Ng and Port- noy 1994).

In all these cases, the family of functions C∗ is left essentially unrestricted, except for smoothness.

Drawbacks • Curse-of-dimensionality problem: It is not clear how to general- ize these estimators to cases when there are more than two or three covariates.

• Because all these estimators are nonlinear,itishardtorepresentsthem in ways that facilitate comparisons. For example, it is not clear how to generalize the concepts of equivalent kernel and equivalent degrees of freedom that prove so useful for linear smoothers.

127 3.6 Estimating the conditional df If the interest is not merely in a few quantiles but in the whole conditional dis- tribution of Y given a random k-vector X, why not estimating the conditional df F (y | x) directly?

Given a sample (X1,Y1),...,(Xn,Yn) from the joint distribution of (X, Y ), Stone (1977) suggested estimating F (y | x) nonparametrically by a nearest- neighbor estimator of the form

n −1 Fˆ(y | x)=n Wi(x)1{Yi ≤ y}, −∞

Viewed as a function of y for x given, Fˆ(y | x)isaproper df if:

• Wi(x) ≥ 0, i =1,...,n,

n • i=1 Wi(x)=1.

Stute (1986) studied the properties of two estimators of this type:

• The first is based on weights of the form 1 Hˆ (x) − Hˆ (Xi) Wi(x)= K , hn hn

where Hˆ (x)istheedfoftheX, K is a smooth kernel with bounded support and hn is the bandwidth. For any x, Fˆ(y | x)isnot aproperdf, as the kernel weights do not add up to one.

• The second, denoted by Fˆ∗(y | x) is a proper df for it chooses normalized ∗ weights Wi (x)=Wi(x)/ j Wj(x).

Estimators of this kind tend to do rather poorly when data are sparse, which is typically the case k is greater than 2 or 3.

128 Avoiding the curse of dimensionality We now describe a simple semi-nonparametric method that appears to perform well even when k is large relative to the available data. The basic idea is to partition the range of Y into J + 1 intervals defined by a set of “knots” −∞

If the conditional distribution of Y is continuous with support on the whole real line then, at any x in the support of X, the sequence of functions {Fj(x)} must satisfy the following conditions:

0

0

One way of automatically imposing the nonnegativity condition (45) is to model not Fj(x) directly, but rather the conditional log odds

Fj(x) ηj(x)=ln . 1 − Fj(x)

Given an estimateη ˆj(x)ofηj(x), one may then estimate Fj(x)by

expη ˆj(x) Fˆj(x)= . 1+expˆηj(x)

129 Estimation

By the analogy principle, ηj may be estimated by maximizing the sample log likelihood n L(η)= [1{Yi ≤ yj} η(Xi) − ln(1 + exp η(Xi)) ] i=1 over a suitable family H∗ of functions of x.

This entails fitting J separate logistic regression, one for each threshold value yj. Boundedness of 1{Yi ≤ yj} ensures good robustness properties with respect to outliers in the Y -space.

Alternative specifications of H∗ correspond to alternative estimation methods. As for quantile regressions, one may distinguish between:

• parametric methods (e.g. logit, probit),

• nonparametric methods with H∗ unrestricted,

• nonparametric methods with H∗ restricted (such as projection pur- suit, additive or locally-linear modeling).

130 Example 13 Thesimplestcaseiswhenη(y | x)=θ(y)x.

For j =1,...,J,letθˆjn be the logit estimator of θj = θ(yj)andletˆηjn(x)=  θˆjnx be the implied estimator of ηj(x). Also let ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ F1(x) θ1 η1(x) ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟  F(x)=⎝ . ⎠ ,θ= ⎝ . ⎠ ,η(x)=⎝ . ⎠ =(IJ ⊗ x ) θ, FJ (x) θJ ηJ (x)

 where IJ is the J×J unit matrix. Finally, let Fˆ n(x), θˆn andη ˆn(x)=(IJ ⊗x ) θˆn be the estimators of F(x), θ and η(x) respectively.

Under mild regularity conditions, as n →∞, √ −1 n (θˆn − θ) ⇒NkJ(0, I ), where I =E[V (X) ⊗ XX] and V (x)istheJ × J matrix with generic element

Vjk(x)=min(Fj(x),Fk(x)) − Fj(x) Fk(x).

Hence, as n →∞, √ n [ˆηn(x) − η(x)] ⇒NJ (0,A(x)), where  −1 A(x)=(IJ ⊗ x ) I (IJ ⊗ x). Therefore, as n →∞, √ n [Fˆ n(x) − Fˆ(x)] ⇒NJ (0, Σ(x)), where Σ(x)=V (x) A(x) V (x). 2

131 Figure 21: Estimated conditional df’s of log monthly earnings (Peracchi 2002).

Austria Belgium Denmark Finland 1 1 1 1

.5 .5 .5 .5

0 0 0 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 France Germany Greece Ireland 1 1 1 1

.8

.5 .5 .6 .5

.4

0 0 .2 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Italy Luxembourg Netherlands Portugal 1 .6 1 1

.8 .4 .5 .5 .6 estimated probability .2 .4

0 0 0 .2 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Spain UK 1 1

.5 .5

0 0 0 10 20 30 40 0 10 20 30 40

experience

132 Imposing monotonicity While our approach automatically imposes the nonnegativity condition (45), it does not guarantee that the monotonicity condition (46) also holds.

Since ηj(x) is strictly increasing in Fj(x), monotonicity is equivalent to the condition that −∞ <η1(x) <...<ηJ (x) < ∞ for all x in the support of X.

If ηj(x)=γj + μj(x), then monotonicity holds if

γj >γj−1,μj(x) ≥ μj−1(x).

One case where these two conditions are satisfied is the ordered logit model. This model is restrictive, however, for it implies that changes in the covariate vector X affect the conditional distribution of Y only through a location shift.

An alternative is to model F1(x) and the conditional probabilities or discrete hazards λj(x)=Pr{Y ≤ yj | Y>yj−1,X = x}

Sj−1(x) − Sj(x) = ,j=2,...,J, Sj−1(x) where Sj(x)=1− Fj(x)isthesurvivor function evaluated at yj.Usingthe recursion Sj(x)=[1− λj(x)] Sj−1(x),j=2,...,J, we get k Sk(x)=S1(x) [1 − λj(x)], j=2 that is, k Fk(x)=1− [1 − F1(x)] [1 − λj(x)]. j=2

If F1(x)andλj(x) are modeled to guarantee that 0

133 Figure 22: Estimated conditional df’s of log monthly earnings imposing mono- tonicity (Peracchi 2002).

Austria Belgium Denmark Finland 1 1 1 1

.5 .5 .5 .5

0 0 0 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 France Germany Greece Ireland 1 1 1 1

.8

.5 .5 .6 .5

.4

0 0 .2 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Italy Luxembourg Netherlands Portugal

1 .8 1 1

.6 .8

.5 .4 .5 .6 estimated probability

.2 .4

0 0 0 .2 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Spain UK 1 1

.5 .5

0 0 0 10 20 30 40 0 10 20 30 40

experience

134 Extensions: autoregressive models Consider the discrete-time univariate AR(1) process

Yt = ρYt−1 + σUt, where |ρ| < 1, σ>0andthe{Ut} are iid with zero mean and marginal df G. The conditional df of Yt given Yt−1 = x is y − ρx F (y | x)=Pr{Yt ≤ y | Yt−1 = x} = G . (47) σ

By the stationarity assumption, this is also the conditional df of Yt+h given Yt+h−1 = x for any |h| =0, 1, 2,....

The assumptions implicit in (47) are strong. As an alternative, one may retain the assumption that F (y | x)istime-invariant and apply the results of the previous section by letting Xt = Yt−1.

Extensions to a univariate pth autoregression is straightforward. In this case, what is assumed to be time-invariant is

F (y | x)=Pr{Yt ≤ y | Xt = x}, where Xt =(Yt−1,...,Yt−p)andx =(x1,...,xp).

135 3.7 Relationships between the two approaches How does the approach outlined in the previous section relate to modeling the family of conditional quantiles? We address two questions:

• What restrictions on the family of conditional quantiles are implied by specific assumptions on the family of conditional log odds?

• What restrictions on the family of conditional log odds are implied by specific assumptions on the family of conditional quantiles?

136 From log odds to quantiles If F (y | x)iscontinuous,thefactthatF (Q(p | x) | x)=p implies p η(Q(p | x) | x)=ln ,p∈ (0, 1). 1 − p

Suppose that η(y | x) is some known continuously differentiable function of (x, y)andηy(y | x) is strictly positive for every (x, y), with subscripts denoting partial derivatives. By the implicit function theorem applied to p η(y | x)=ln , 1 − p the conditional qf is unique and continuously differentiable in x with derivative ηx(y | x) Fx(y | x) | − − Qx(p x)= | = | . ηy(y x) y=Q(p | x) f(y x) y=Q(p | x)

Thus, Qx(p | x)andηx(Q(p | x)) have opposite sign.

Next notice that conditional quantiles are linear in x iff the partial derivative Qx(p | x)doesnot depend on x. If the conditional log odds are linear in x, that is, η(y | x)=γ(y)+δ(y)x, then a quantile regression is linear in x iff δ(y) | − Qx(p x)= γ (y)+δ (y)x y=Q(p | x) does not depend on x. Sufficient conditions are: (i) γ(y)islinear in y, and (ii) δ(y)doesnot depend on y (as with the ordered logit model).

137 From quantiles to log odds If Q(p | x) is a known continuously differentiable function of (p, x) such that Q(p | x)=y,wherey is a fixed number, then

F (Q(p | x)) = F (y | x).

By the chain rule,

Fx(y | x)=−Qx(p | x) f(y | x)|p=F (y | x) or, equivalently,

ηx(y | x)=−Qx(p | x) ηy(y | x)|p=F (y | x) .

Clearly, conditional log odds are linear iff ηx(y | x)doesnot depend on x.If conditional quantiles are linear, that is Q(p | x)=α(p)+xβ(p), then condi- tional log odds are linear iff

ηx(y | x)=−β(p) ηy(y | x)|p=F (y | x) does not depend on x. Sufficient conditions are: (i) β(p)doesnot depend on p, and (ii) ηy(y | x)doesnot depend on x.

Example 14 Let Y = α + βX + U,whereU has a logistic distribution with mean zero and qf Q.ThenQ(p | x)=α(p)+βx,whereα(p)=α + Q(p), whereas exp(y − α − βx) F (y | x)= . 1+exp(y − α − βx) Hence η(y | x)=γ(y)+δ(y)x,whereδ(y)=−β and γ(y)=y −α, is also linear in x. Notice that, except for the sign, Q(p | x)andη(y | x), viewed as functions of x,havethesame slope. 2

138 3.8 Stata commands We now briefly review the available Stata commands.

These include the qreg command for linear conditional quantile estimation, the associated iqreg, sqreg and bsqreg commands, and the post-estimation tools in qreg postestimation.

At the moment, Stata only offers the cumul for estimating univariate distribu- tion functions and has no command for estimating cdf’s.

139 The qreg command The basic syntax is: qreg depvar indepvars [if][in][weight][,qreg options] where qreg options includes:

• quantile(#): specifies the quantile to be estimated and should be a number between 0 and 1, exclusive. Numbers larger than 1 are in- terpreted as percentages. The default value of 0.5 corresponds to the median.

• level(#): sets the confidence level. The default is level(95) (95 per- cent).

• wlsiter(#): specifies the number of WLS iterations that will be at- tempted before the linear programming iterations are started. The de- fault value is 1. If there are convergence problems, increasing this number should help.

Notice that the standard errors produced by qreg are based on the homoske- datsicity assumption and should not be trusted.

140 Other commands The iqreg, sqreg and bsqreg commands all assume linearity of conditional quantiles but estimate the variance matrix of the estimators (VCE) via the bootstrap. Their syntax similar to that of qreg. For example, iqreg depvar indepvars [if][in][weight][,iqreg options]

Remarks:

• iqreg estimates IQR regressions, i.e. regressions of the difference in quan- tiles. The available options include:

– quantiles(# #): specifies the quantiles to be compared. The first number must be less than the second, and both should be be- tween 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. Not specifying this option is equivalent to specify- ing quantiles(.25 .75), meaning the IQR. – reps(#): specifies the number of bootstrap replications to be used to obtain an estimate of the VCE. The default is reps(20), arguably a little small.

• sqreg estimates simultaneous-quantile regression essentially using the algorithm in Koenker & d’Orey (1987). It produces the same coefficients as qreg for each quantile. Bootstrap estimation of the VCE includes between-quantile blocks. Thus, one can test and construct confidence intervals comparing coefficients describing different quantiles. The avail- able options include:

– quantiles(# [# [# ...]]) specifies the quantiles to be estimated and should contain numbers between 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. The default value of 0.5 corresponds to the median. – reps(#):sameasabove.

• bsqreg is equivalent to sqreg with one quantile.

141 Post-estimation tools The following postestimation commands are available for qreg, iqreg, sqreg and bsqreg:

• estat: variance matrix and estimation sample summary,

• estimates: cataloging estimation results,

• lincom: point estimates, standard errors, testing, and inference for linear combinations of the coefficients,

• linktest: link test for model specification,

• margins: marginal means, predictive margins, marginal effects, and av- erage marginal effects,

• nlcom: point estimates, standard errors, testing, and inference for non- linear combinations of coefficients,

• predict: predictions, residuals, influence statistics, and other diagnostic measures,

• predictnl: point estimates, standard errors, testing, and inference for generalized predictions,

• test: Wald tests of simple and composite linear hypotheses,

• testnl: Wald tests of nonlinear hypotheses.

142 References

Angrist J., Chernozhukov V. and Fern´andez-Val I. (2006) Quantile Regression un- der Misspecification, with an Application to the U.S. Wage Structure. Econo- metrica, 74: 539–563.

Apostol T.M. (1974) Mathematical Analysis, 2nd Ed., Addison-Wesley, Reading, MA.

Barrodale I. and Roberts F. (1973) An Improved Algorithm for Discrete l1 Linear Approximation. SIAM Journal of Numerical Analysis, 10: 839–848.

Bassett G. and Koenker R. (1978) The Asymptotic Distribution of the Least Ab- solute Error Estimator. Journal of the American Statistical Association, 73: 618–622.

Bassett G. and Koenker R. (1982) An Empirical Quantile Function for Linear Models with IID Errors. Journal of the American Statistical Association, 77: 407–415.

Bickel P. (1982) On Adaptive Estimation. Annals of Statistics, 10: 647–671.

Bierens H.J. (1987) Kernel Estimators of Regression Functions. In Bewley T.F. (ed.) Advances in Econometrics, Fifth World Congress, Vol. 1, pp. 99–144, Cambridge University Press, New York.

Bloomfield P. and Steiger W. (1983) Least Absolute Deviations: Theory, Applica- tions, Algorithms,Birkh¨auser, Boston, MA.

Blundell R. and Duncan A. (1998) Kernel Regression in Empirical Microeconomics. Journal of Human Resources, 33: 62–87.

Buchinsky M. (1998) Recent Advances in Quantile Regression Models: A Practical Guideline for Empirical Research. Journal of Human Resources, 33: 88–126.

Chauduri P. (1991) Nonparametric estimates of Regression Quantiles and Their Local Bahadur Representation. Annals of Statistics, 2: 760–777.

Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scat- terplots. Journal of the American Statistical Association, 74: 829–836.

Cleveland W.S. and Devlin S.J. (1988) Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association, 93: 596–610.

143 De Angelis D., Hall P. and Young G.A. (1993) Analytical and Bootstrap Approxi- mations to Estimator Distributions in L1 Regression. Journal of the American Statistical Association, 88: 1310–1316. de Boor C. (1978) A Practical Guide to Splines, Springer, New York.

Engle R.F., Granger C.W.J., Rice J.A. and Weiss A. (1986) Semiparametric Esti- mates of the Relationship Between Weather and Electricity Sales. Journal of the American Statistical Association, 81: 310–320.

Eubank R.L. (1988) Spline Smoothing and Nonparametric Regression, Dekker, New York.

Eubank R.L. and Spiegelman C.H. (1990) Testing the Goodnes of Fit of a Linear Model Via Nonparametric Regression Techniques. Journal of the American Statistical Association, 85: 387–392.

Fan J. (1992) Design-adaptive Nonparametric Regression. Journal of the American Statistical Association, 87: 998–1004.

Fan J. and Gijbels I. (1996) Local Polynomial Modelling and Its Applications, Chap- man and Hall, London.

Fan J., Heckman N.E. and Wand M.P. (1995) Local Polynomial Regression for Generalized Linear Models and Quasi-likelihood Functions. Journal of the American Statistical Association, 90: 141–150.

Firpo S., Fortin N. and Lemieux T. (2009) Unconditional Quantile Regressions. Econometrica, 77: 953–973.

Friedman J.H. and Stuetzle W. (1981) Projection Pursuit Regression. Journal of the American Statistical Association, 76: 817–823.

Friedman J.H., Stuetzle W. and Schroeder A. (1984) Projection Pursuit Density Estimation. Journal of the American Statistical Association, 79: 599–608.

Gallant A.R. and Nychka D.W. (1987) Semi-Nonparametric Maximum Likelihood Estimation. Econometrica, 55: 363–390.

Good I.J. and R.A. Gaskins (1971) Nonparametric Roughness Penalties for Prob- ability Densities. Biometrika, 58: 255-277.

Green P.J. and Silverman B.W. (1994) Nonparametric Regression and Generalized Linear Models, Chapman and Hall, London.

144 Gutenbrunner C. and Jureˇckov´a J. (1992) Regression Rank Scores and Regression Quantiles. Annals of Statistics, 20: 305–330.

Hall P., Wolff R.C.L. and Yao Q. (1999) Methods for estimating a Conditional Distribution Function. Journal of the American Statistical Association, 94: 154–163.

H¨ardle W. (1990) Applied Nonparametric Regression, Cambridge University Press, New York.

H¨ardle W. and Linton O. (1994) Applied Nonparametric Methods. In Engle R.F. and McFadden D.L. (eds.) Handbook of Econometrics, Vol. 4, pp. 2297–2339, North-Holland, Amsterdam.

H¨ardle W. and Stoker T. (1989) Investigating Smooth Multiple Regression by the Method of Average Derivatives. Journal of the American Statistical Associa- tion, 84: 986–995.

Hastie T.J. and Loader C.L. (1993) Local Regression: Automatic Kernel Carpentry (with discussion). Statistical Science, 8: 120–143.

Hastie T.J. and Tibshirani R.J. (1990) Generalized Additive Models, Chapman and Hall, London.

Hendricks W. and Koenker R. (1992) Hierarchical Spline Models for Conditional Quantiles and the Demand for Electricity. Journal of the American Statistical Association, 87: 58-68.

Koenker, R., Bassett, G., 1978. Regression Quantiles. Econometrica, 46:33–50.

Horowitz J.L. (1998b) Semiparametric Methods in Econometrics, Springer, New York.

Koenker, R. (2005) Quantile Regression, Cambridge University Press, New York.

Koenker R. and Bassett G. (1978) Regression Quantiles. Econometrica, 46: 33–50.

Koenker R. and Bassett G. (1982) Robust Tests for Heteroskedasticity Based on Regression Quantiles. Econometrica, 50: 43–61.

Koenker R. and d’Orey V. (1987) Computing Regression Quantiles. Applied Statis- tics, 36: 383–393.

145 Koenker R. and Machado J.A.F. (1999) Goodness of Fit and Related Inference Processes for Quantile Regression. Journal of the American Statistical Asso- ciation, 94: 1296–1310. Koenker R., Ng P. and Portnoy S. (1994) Quantile Smmothing Splines. Biometrika, 81: 673–680. Machado J.A.F. and Mata J. (2005) Counterfactual Decomposition of Changes in Wage Distributions Using Quantile Regression. Journal of Applied Economet- rics, 20: 445–465. Marron J.S. and Nolan D. (1988) Canonical Kernels for Density Estimation. Statis- tics and Probability Letters, 7: 195–199. Melly B. (2005) Decomposition of Differences in Distribution Using Quantile Re- gression. Labour Economics, 12: 577–590. Pagan A.R. and Ullah A. (1999) Nonparametric Econometrics, Cambridge Univer- sity Press, New York.

Peracchi F. (2002) On Estimating Conditional Quantiles and Distribution Func- tions. Computational Statistics and Data Analysis, 38: 433–447. Pollard D. (1991) Asymptotics for Least Absolute Deviation Regression Estimators. Econometric Theory, 7: 186–199. Portnoy S. and Koenker R. (1997) The Gaussian Hare and the Laplacian Tortoise: Computability for Squared-Error versus Absolute-Error Estimators (with dis- cussion). Statistical Science, 12: 279–300. Robinson P. (1988) Root-N Consistent Semiparametric Regression. Econometrica, 56: 931–954. Rosenblatt M. (1956) Remarks on Some Nonparametric Estimates of a Density Function. Annals of Mathematical Statistics, 27: 832–837. Ruppert D., Wand M.P. and Carroll R.J. (2003) Semiparametric Regression,Cam- bridge University Press, New York. Sheater S.C. and Jones M.C. (1991) A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Soci- ety, Series B, 53: 683–690.

Schlossmacher E.G. (1973) An Iterative Technique for Absolute Deviation Curve Fitting. Journal of the American Statistical Association, 68: 857–865.

146 Schucany W.R. and Sommers J.P. (1977) Improvement of Kernel Type Density Estimators. Journal of the American Statistical Association, 72: 420–423.

Silverman B.W. (1986) Density Estimation for Statistics and Data Analysis, Chap- man and Hall, New York.

Stoker T.M. (1986) Consistent Estimation of Scaled Coefficients. Econometrica, 54: 1461–1481.

Stone C.J. (1977) Consistent Nonparametric Regression (with discussion). Annals of Statistics, 5: 595–645.

Stone C.J. (1982) Optimal Global Rates of Convergence for Nonparametric Re- gression. Annals of Statistics, 10: 1040–1053.

Stone C.J. (1984) An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates. Annals of Statistics, 12: 1285–1297.

Stone, C.J. (1985) Additive regression and other nonparametric models. Annals of Statistics, 13: 689–705.

Tibshirani R. and Hastie T. (1987) Local Likelihood Estimation. Journal of the American Statistical Association, 82: 559–567.

Vapnik V.N. (1995) TheNatureofStatisticalLearningTheory, Springer, New York.

Wahba G. (1990) Spline Models for Observational Data, SIAM, Philadelphia, PA.

Yatchew A. (1998) Nonparametric Regression Techniques in Economics. Journal of Economic Literature, 36: 669-721.

Yatchew A. (2003) Semiparametric Regression for the Applied Econometrician, Cambridge University Press, New York.

Yu K. and Jones M.C. (1998) Local Linear Quantile Regression. Journal of the American Statistical Association, 93: 228–237.

147