Notes on Median and Quantile Regression
Total Page:16
File Type:pdf, Size:1020Kb
Notes On Median and Quantile Regression James L. Powell Department of Economics University of California, Berkeley Conditional Median Restrictions and Least Absolute Deviations It is well-known that the expected value of a random variable Y minimizes the expected squared deviation between Y and a constant; that is, µ E[Y ] Y ≡ =argminE(Y c)2, c − assuming E Y 2 is finite. (In fact, it is only necessary to assume E Y is finite, if the minimand is normalized by|| subtracting|| Y 2, i.e., || || µ E[Y ] Y ≡ =argminE[(Y c)2 Y 2] c − − =argmin[c2 2cE[Y ]], c − and this normalization has no effect on the solution if the stronger condition holds.) It is less well-known that a median of Y, defined as any number ηY for which 1 Pr Y η and { ≤ Y } ≥ 2 1 Pr Y η , { ≥ Y } ≥ 2 minimizes an expected absolute deviation criterion, η =argminE[ Y c Y ], Y c | − | − | | though the solution to this minimization problem need not be unique. When the c.d.f. FY is strictly increasing everywhere (i.e., Y is continuously distributed with positive density), then uniqueness is not an issue, and 1 ηY = FY− (1/2). In this case, the first-order condition for minimization of E[ Y c Y ] is | − | − | | 0= E[sgn(Y c)], − − for sgn(u) the “sign” (or “signum”) function sgn(u) 1 2 1 u<0 , ≡ − · { } here defined to be right-continuous. 1 Thus, just as least squares (LS) estimation is the natural generalization of the sample mean to estimation of regression coefficients, least absolute deviations (LAD) estimation is the generalization of the sample median to the linear regression context. For the linear structural equation yi = xi0 β0 + εi, if the error terms εi are assumed to have (unique) conditional median zero given the regressors xi,i.e. E[sgn(εi) xi]=0, | E[sgn(εi c) xi] =0 if c = c(xi) =0, − | 6 6 then the true regression coefficients β0 are identified by β0 =argminE[ yi xi0 β εi ], β | − | − | | and, given an i.i.d. sample of size n from this model, a natural sample analogue to β0 is n ˆ 1 β arg min yi xi0 β ≡ β n | − | i=1 X arg min Sn(β), ≡ β a slightly-nonstandard extremum estimator (because the minimand is not twice continuously differentiable for all β. Consistency of LAD Demonstration of consistency of βˆ is straightforward, because the LAD minimand Sn(β) is clearly continuous in β with probability one; in fact, Sn(β) is convex in β, so consistency follows if Sn can be shown to converge pointwise to a function that is uniquely minimized at the true value β0. (Typically we need to show uniform convergence, but pointwise convergence of convex functions implies their uniform convergence on compact subsets.) To prove consistency, we need to impose some conditions on the model; here are some conditions that will suffice: n A1. The data (yi,x ) are independent and identically distributed across i; { i0 0}i=1 2 A2. The regressors have bounded second moment, i.e., E[ xi ] < . || || ∞ A3. The error terms εi are continously distributed given xi, with conditional density f(ε xi) satisfying the conditional median restriction, i.e. | 0 1 f(λ xi)dλ = . | 2 Z−∞ A4. The regressors and error density satisfy a “local identification” condition — namely, the matrix C E[f(0 xi)xix0 ] ≡ | i is positive definite. 2 Note that moments of yi or εi need not exist under these assumptions, which is why LAD estimation is attractive for heavy-tailed error distributions. Condition A4. combines a “unique median” assumption (implied by positivity of the conditional density f(ε xi) at ε =0) with the usual full-rank assumption on the second moments of the regressors. | Imposing these conditions, the first step in the consistency proof is to calculate the probability limit of the minimand. To avoid assuming E[ yi ] < , it is convenient to normalize the minimand Sn(β) by | | ∞ ˆ subtracting off its value at the true parameter β0, which clearly does not affect the minimizing value β. That is, ˆ β =argminSn(β) Sn(β0) β − 1 n =argmin [ yi xi0 β yi xi0 β0 ] β n | − | − | − | i=1 Xn 1 =argmin [ εi xi0 δ εi ], β n | − | − | | i=1 X where δ β β . ≡ − 0 But since xi δ εi x0 δ εi xi δ −|| ||·|| || ≤ | − i | − | | ≤ || ||·|| || by the triangle and Cauchy-Schwarz inequalities, the normalized minimand is a sample average of i.i.d. random variables with finite first (and even second) moments under condition A2,sobyKhintchine’sLaw of Large Numbers, p Sn(β) Sn(β ) S¯(δ) − 0 → E[Sn(β) Sn(β )] ≡ − 0 = E[ εi x0 δ εi ] | − i | − | | = E (εi x0 δ)sgn εi x0 δ εisgn εi − i { − i } − { } = E (εi x0 δ)(sgn εi x0 δ sgn εi ) £ − i { − i } − { } ¤ £ 0 ¤ = E 2 [λ (xi0 δ)]f(λ xi)dλ , x δ − | " Z i0 # where the second-to-last equality uses the fact that E[(x δ)sgn εi ]=E[E[(x δ)sgn εi xi]] = 0. (The i0 { } i0 { }| integral in the last equality is well-defined for both positive and negative values of xi0 δ, under the standard b a convention a dF = b dF .) By inspection, the− limit S¯(δ) equals zero at δ = β β =0, and is non-negative otherwise (since the R R − 0 sign of the integrand is the same as the sign of the lower limit xi0 δ). Furthermore, since Sn(β) Sn(β0) is ¯ − convex for all n, so is its probability limit S(β β0);thus,ifβ = β0 is a unique local minimizer, it is also a global minimizer, implying consistency of βˆ.− But by Leibnitz’ rule, ∂S¯(δ) 0 = 2E[xi f(λ xi)dλ], ∂δ − · x δ | Z i0 ∂S¯(0) =0, ∂δ 3 and ∂2S¯(δ) =2E[xixi0 f(xi0 δ xi)], ∂δ∂δ0 · | ∂2S¯(0) =2E[xixi0 f(0 xi)] 2C, ∂δ∂δ0 · | ≡ which is positive definite by condition A4.Soδ =0=β β0 is indeed a unique local (and global) minimizer of S¯(δ)=S¯(β β ), and thus − − 0 βˆ p β . → 0 ¥ To generalize this consistency result to nonlinear median regression models yi = g(xi, β0)+εi, 0=E[sgn εi xi], { }| the regularity conditions would have to be strengthened, since convexity of the corresponding LAD mini- 1 mand n− i yi g(xi, β) in β is no longer assured. Standard conditions would include the assumption that the LAD| criterion− is minimized| over a compact parameter space B (and not over all of Rp), and a P uniform Lipschitz continuity condition on the median regression function g(xi, β) would typically be im- posed, with the Lipschitz term assumed to have finite moments. Finally, the identification condition A4 would have to be strengthened to a global identification condition, such as: A4.’ For some τ > 0, the conditional density f(λ xi) > τ if λ < τ, and Pr g(xi, β) g(xi, β0) τ > 0 if β = β . | | | {| − | ≥ } 6 0 Asymptotic Normality of LAD Returning to the linear LAD estimator, while demonstration of consistency of βˆ involves routine appli- cation of asymptotic arguments for extremum estimators, demonstration of √n-consistency and asymptotic normality is complicated by thefactthattheLADcriterionS¯(β) is not continuously differentiable in β. For comparison, consider the “standard” theory for extremum estimators, where the estimator θˆ is defined to minimize (or maximize) a twice-differentiable criterion, e.g., 1 n ˆθ =argmin ρ(zi, θ), θ Θ n ∈ i=1 X and (for large n) to satisfy a first-order condition for an interior minimum, 1 n ∂ρ(z , θˆ) 0= i n ∂θ i=1 Xn 1 ψ (θˆ), ≡ n i i=1 X assuming consistency of θˆ for an (interior) parameter θ0 has been established. The true value θ0 satisfies the corresponding population first-order condition 0=E[ψi(θ0)]; 4 derivation of the asymptotic distribution of θˆ is based upon a Taylor’s series expansion of the sample first-order condition for θˆ around θˆ = θ0 : n 1 0 ψ (θˆ) (*) ≡ n i i=1 Xn n 1 1 ∂ψi(θ0) ψ (θ0)+ (θˆ θ0)+op( θˆ θ0 ), ≡ n i n ∂θ − || − || i=1 " i=1 0 # X X hich is solved for θˆ to yield the asymptotic linearity expression n 1 1 1 θˆ = θ + H− ψ (θ )+o , 0 0 n i 0 p √n i=1 X µ ¶ where ∂ψi(θ0) H0 E ≡− ∂θ · 0 ¸ ∂2ρ(z , θ ) = E i 0 − ∂θ∂θ · 0 ¸ is minus one times the expected Hessian of the original minimand. From the linearity expression, it follows that d 1 1 √n(ˆθ θ0) (0,H− V0H− ), − → N 0 0 where V0 is the asymptotic covariance matrix of the sample average of ψi(θ0), i.e., n 1 d ψ (θ0) (0,V0), √n i → N i=1 X which is established by appeal to a suitable central limit theorem. In the LAD case (where θ β), the criterion function ≡ ρ(zi, θ)= yi x0 β | − i | is not continuously differentiable at values of β for which yi = xi0 β; furthermore, the (discontinuous) subgradient ∂ρ(zi, θ) = sgn yi x0 β xi ∂θ { − i } itself has a derivative that is identically zero wherever it is defined. Thus the Taylor’s expansion (*) is not applicable to this problem, even though an approximate first-order condition 1 n 1 sgn yi x0 βˆ xi = op n { − i } √n i=1 X µ ¶ can be established for this problem.