Quantile Regression

QUANTILE REGRESSION ROGER KOENKER Abstract. Classical least squares regression maybeviewed as a natural way of extending the idea of estimating an unconditional mean parameter to the problem of estimating conditional mean functions; the crucial link is the formulation of an optimization problem that encompasses b oth problems. Likewise, quantile regression o ers an extension of univariate quantile estimation to estimation of conditional quantile functions via an optimization of a piecewise linear ob jective function in the residuals. Median regression minimizes the sum of absolute residuals, an idea intro duced by Boscovich in the 18th century. The asymptotic theory of quantile regression closely parallels the theory of the univariate sample quantiles. Computation of quantile regression estimators may b e formulated as a linear program- ming problem and eciently solved by simplex or barrier metho ds. A close link to rank based inference has b een forged from the theory of the dual regression quantile pro cess, or regression rankscore pro cess. Quantile regression is a statistical technique intended to estimate, and conduct inference ab out, conditional quantile functions. Just as classical linear regression metho ds based on minimizing sums of squared residuals enable one to estimate mo dels for conditional mean functions, quantile regression metho ds o er a mechanism for estimating mo dels for the conditional median function, and the full range of other conditional quantile functions. By supplementing the estimation of conditional mean functions with techniques for estimating an entire family of conditional quantile functions, quantile regression is capable of providing a more complete statistical analysis of the sto chastic relationships among random variables. Quantile regression has b een used in a broad range of application settings. Reference growth curves for childrens' heightandweighthave a long history in p ediatric medicine; quantile regression metho ds may b e used to estimate upp er and lower quantile reference curves as a function of age, sex, and other covariates without imp osing stringent parametric assumptions on the relationships among these curves. Quantile regression metho ds have b een widely used in economics to study determinents of wages, discrimination e ects, and trends in income inequality. Several recent studies have mo deled the p erformance of public scho ol students on standardized exams as a function of so cio-economic characteristics like their parents' income and educational attainment, and p olicy variables likeclass size, scho ol exp enditures, and teacher quali cations. It seems rather implausible that suchcovariate e ects should all act so as to shift the entire distribution of test results by a xed amount. It is of obvious interest to know whether p olicy interventions alter p erformance of the strongest students in the same way that weaker students are a ected. Such questions are naturally investigated within the quantile regression framework. In ecology, theory often suggests how observable covariates a ect limiting sustainable p opulation sizes, and quantile regression has b een used to directly estimate mo dels for upp er quantiles of the conditional distribution rather than inferring such relationships from mo dels based on conditional central tendency. In survival analysis, and event history analysis more generally, there is often also a Version: Octob er 25, 2000. This article has b een prepared for the statistics section of the International Encyclo- pedia of the Social Sciences edited by Stephen Fienb erg and Jay Kadane. The researchwas partially supp orted by NSF grant SBR-9617206. 1 2 ROGER KOENKER desire to fo cus attention on particular segments of the conditional distribution, for example survival prosp ects of the oldest-old, without the imp osition of global distributional assumptions. 1. Quantiles, Ranks and Optimization Wesay that a student scores at the th quantile of a standardized exam if he p erforms better than the prop ortion ,andworse than the prop ortion 1 , of the reference group of students. Thus, half of the students p erform b etter than the median student, and half p erform worse. Similarly,the quartiles divide the p opulation into four segments with equal prop ortions of the p opulation in each segment. The quintiles divide the p opulation into 5 equal segments; the deciles into 10 equal parts. The quantile, or p ercentile, refers to the general case. More formally,anyrealvalued random variable, Y ,maybecharacterized by its distribution function, F y = ProbY y while for any0< <1, Q = inffy : F y g is called the th quantile of X . The median, Q1=2, plays the central role. Like the distribution function, the quantile function provides a complete characterization of the random variable, Y. The quantiles may b e formulated as the solution to a simple optimization problem. For any 0 < <1, de ne the piecewise linear \check function", u= u I u<0 illustrated in Figure 1. ρτ (u) τ−1 τ Figure 1. Quantile Regression Function ^ Minimizing the exp ectation of Y with resp ect to yields solutions, , the smallest of whichisQ de ned ab ove. The sample analogue of Q , based on a random sample, fy ; :::; y g,of Y 's, is called the th 1 n sample quantile, and may b e found by solving, n X min y ; i 2R i=1 QUANTILE REGRESSION 3 While it is more common to de ne the sample quantiles in terms of the order statistics, y 1 y ::: y , constituting a sorted rearrangement of the original sample, their formulation as a 2 n minimization problem has the advantage that it yields a natural generalization of the quantiles to the regression context. Just as the idea of estimating the unconditional mean, viewed as the minimizer, X 2 ^ = argmin y i j 2 R 0 can b e extended to estimation of the linear conditional mean function E Y jX = x= x by solving, X 0 2 ^ p = argmin y x ; i j i 2 R 0 the linear conditional quantile function, Q jX = x= x , can b e estimated by solving, Y i X 0 ^ p = argmin y x : i j 2 R The median case, =1=2, which is equivalent to minimizing the sum of absolute values of the residuals has a long history. In the mid-18th century Boscovich prop osed estimating a bivariate linear mo del for the ellipticity of the earth by minimizing the sum of absolute values of residuals sub ject to the condition that the mean residual to ok the value zero. Subsequentwork by Laplace characterized Boscovich's estimate of the slop e parameter as a weighted median and derived its asymptotic distribution. F.Y. Edgeworth seems to have b een the rst to suggest a general formulation of median regression involving a multivariate vector of explanatory variables, a technique he called the \plural median". The extension to quantiles other than the median was intro duced in Ko enker and Bassett 1978. 2. An Example To illustrate the approachwemay consider an analysis of a simple rst order autoregressivemodel for maximum daily temp erature in Melb ourne, Australia. The data are taken from Hyndman, Bashtannyk, and Grunwald 1996. In Figure 2 weprovide a scatter plot of 10 years of daily temp erature data: to day's maximum daily temp erature is plotted against yesterday's maximum. Our rst observation from the plot is that there is a strong tendency for data to cluster along the dashed 45 degree line implying that with high probabilitytoday's maximum is near yesterday's maximum. But closer examination of the plot reveals that this impression is based primarily on the left side of the plot where the central tendency of the scatter follows the 45 degree line very closely. On the right side, however, corresp onding to summer conditions, the pattern is more complicated. There, it app ears that either there is another hot day, falling again along the 45 degree line, or there is a dramatic co oling o . But a mild co oling o app ears to b e more rare. In the language of conditional densities, if to dayishot,tomorrow's temp erature app ears to b e bimo dal with one mo de roughly centered at to day's maximum, and the other mo de centered at ab out 20 . Several estimated quantile regression curves have b een sup erimp osed on the scatterplot. Each curve is sp eci ed as a linear B-spline. Under winter conditions these curves are bunched around the 45 degree line, however in the summer it app ears that the upp er quantile curves are bunched around the 45 degree line and around 20 .Intheintermediate temp eratures the spacing of the quantile curves is somewhat greater indicating lower probability of this temp erature range. This impression is strengthened by considering a sequence of density plots based on the quantile regression estimates. Given a family of reasonably densely spaced estimated conditional quantile functions, it is straightforward to estimate the conditional density of the resp onse at various values of the conditioning covariate. In Figure 3 we illustrate this approach with several of density estimates based on the Melb ourne data. Conditioning on a low previous day temp erature weseeanice 4 ROGER KOENKER . .. .. .... .. .. .. .. .. ... .... ... ...... .. .... .. ....... .. .. ... ... .. ..... .. ... ... .... .. .. ... .. .. ......... .. ...... .. ....... ..................... .. .. .... ... .. .. ........... .. .... .. .................. .. ..... .. ................ .... .. ........ .. ..... ........ ........ .......... .... ... .. ... .. .............. ............. ........... ... .. ..................................... .... ... .. .....................................

Load more