Quantile Regression for Correlated Observations
Total Page:16
File Type:pdf, Size:1020Kb
Quantile Regression for Correlated Observations Li Chen1, Lee-Jen Wei2,andMichaelI.Parzen 1 Division of Biostatistics, University of Minnesota, Minneapolis, MN 55414 2 Department of Biostatistics, Harvard University, Boston, MA 02115 3 Graduate School of Business, University of Chicago, Chicago, IL 60637 Abstract In this paper, we consider the problem of regression analysis for data which consist of a large number of independent small groups or clusters of correlated observations. Instead of using the standard mean regression, we regress various percentiles of each marginal response variable over its covariates to obtain a more accurate assessment of the covariate effect. Our inference procedures are derived using the generalized estimating equations approach. The new proposal is robust and can be easily implemented. Graphical and numerical methods for checking the adequacy of the fitted quantile regression model are also proposed. The new methods are illustrated with an animal study in toxicology. Key Words: Estimating equations; Gaussian process; Linear Programming; Om- nibus test; Resampling method 1 Introduction Although quite a few useful parametric and semi-parametric regression meth- ods are available for analyzing correlated observations, they can only be used to evaluate the covariate effect on the mean of the response variable (Laird and Ware, 1982; Liang and Zeger, 1986). To obtain a global picture about the covariate effect on the distribution of the response variable, one may use the quantile regression model. Specifically, let τ be a constant between 0 and 1, Y be the response variable and x be the corresponding (p +1)× 1covariate vector. Given x, let the 100τth percentile of Y be βτ x,whereβτ is an unknown (p +1)× 1 parameter vector and may depend on τ. Inference procedures for βτ with a set of properly chosen τ’s would provide much more information about the effect of x on Y than their counterparts based on the usual mean regression model (Mosteller and Tukey, 1977). For independent observations, inference procedures for βτ have been proposed, for example, by Bassett and Koenker 2 Li Chen, Lee-Jen Wei, and Michael I. Parzen (1978, 1982), Koenker and Bassett (1978, 1982) and Parzen et al. (1994). When τ =1/2, which corresponds to the median regression model, the celebrated L1 estimator which minimizes the sum of the absolute residuals is consistent for β0.5 (Bloomfield and Steiger, 1983). Recently, Jung (1996) proposed an interesting quasi-likelihood equation ap- proach for median regression models with dependent observations. However, his method assumes a known relationship between the median and the den- sity function of the response variable. The variance estimate of his estimator for the regression parameter appears to be rather sensitive to this assumption. Moreover, Jung’s optimal estimating equations may have multiple roots and, therefore, the estimator for βτ may not be well-defined. In this paper, we present a simple and robust procedure to make infer- ences about βτ without imposing any parametric assumption on the density function of the response variable or on the dependent structure among those correlated observations. Furthermore, our estimating functions are monotonic component-wise and the resulting estimator for the regression parameter can be easily obtained through well-established linear programming techniques. The new proposal is illustrated with an animal study in toxicology. 2 Inferences for Regression Parameters In this section, we derive regression methods for analyzing data that consist of a large number of independent small groups or clusters of correlated observations. Let Yij be the continuous response variable for the jth measurement in the ith cluster, where i =1, ..., n; j =1, .., Ki,whereKi is relatively small with respect to n.Letxij be the corresponding covariate vector. Furthermore, assume that the 100τth percentile of Yij is βτ xij . The observations within each cluster may be dependent, but (Yij ,xij )and(Yi j ,xij ) are independent when i = i .Note that the distribution function Fτij(·) of the error term (Yij −βτ xij ) is completely unspecified and may involve xij . Suppose that we are interested in βτ for a particular τ. If all the observations {(Yij ,xij )} are mutually independent, the following estimating functions are often used to make inferences about βτ : n Ki −1/2 Wτ (β)=n xij {I(Yij − β xij ≤ 0) − τ}, (1) i=1 j=1 where I(·) is the indicator function. For the aforementioned correlated observa- tions, (1) are estimating functions based on the “independence working model” (Liang and Zeger, 1986) and the expected value of Wτ (βτ ) is 0. Therefore, a solution βˆτ to the equations Wτ (β) = 0, would be a reasonable estimate for βτ . The consistency of βˆτ can be easily established using similar arguments for the case of independent observations. In practice, βˆτ can be obtained by minimizing Quantile Regression for Correlated Observations 3 n Ki ρτ (Yij − β xij ), (2) i=1 j=1 where ρτ (v)isτv if v>0, and (τ − 1)v,ifv ≤ 0 (Koenker and Bassett, 1978). This optimization problem can be handled by linear programming techniques (Barrodale and Roberts, 1973). An efficient algorithm developed by Koenker and D’Orey (1987) is available in Splus to obtain a minimizer βˆτ for (2). Us- ing a similar argument given in Chamberlain (1994) for the case of indepen- dent observations, one can show that for the present case, the distribution of 1/2 n (βˆτ −βτ ) goes to a normal distribution as n →∞. The corresponding covari- −1 T −1 ance matrix is Aτ (βτ )var{Wτ (βτ )}{Aτ (βτ )} ,whereAτ (β) is the expected value of the derivative of Wτ (β) with respect to β. For the heteroskedastic quan- tile regression model considered here, it is difficult to estimate the covariance matrix because Aτ (β) may involve the unknown underlying density functions. Complicated and subjective nonparametric functional estimates are needed to estimate the variance directly. Recently, Parzen et al. (1994) developed a general resampling method which can be used to approximate the distribution of (βˆτ − βτ ) without involving any complicated and subjective nonparametric functional estimation. To apply this resampling method to the case with correlated observations, let ⎡ ⎤ n Ki −1/2 ⎣ ⎦ Uτ = n xij {I(yij − β˜τ xij ≤ 0) − τ} Zi, i=1 j=1 where {Zi,i =1, ...n} is a random sample from the standard normal popu- lation, y and β˜τ are the observed values of Y and βˆτ , respectively. Note the only component that is random in Uτ is Zi. It is straightforward to show that the unconditional distribution of Wτ (βτ ) and the conditional distribution of Uτ converge to the same limiting distribution. Let wτ (β)betheobservedWτ (β). ∗ ∗ Define a random vector βτ such that wτ (βτ )=−Uτ . Then, the unconditional distribution of (βˆτ − βτ ) can be approximated by the conditional distribution ∗ ∗ of (βτ − β˜τ ). The adequacy of using the distribution of (βτ − β˜τ ) to approxi- mate the unconditional distribution of (βˆτ − βτ ) has been addressed by Parzen et al. (1994) through extensive simulation studies. Furthermore, the distribu- ∗ tion of βτ can be estimated using a large random sample {uτm,m =1, ..., M} ∗ generated from Uτ . For each realized uτm, we obtain a solution of βτm,by ∗ solving the equation w(βτm)=−uτm, m =1, .., M. The covariance matrix ˆ of βτ can then be estimated by the empirical distribution function based on ∗ M ∗ ∗ T {βτm,m =1, ..., M}, for example, by m=1(βτm − β˜τ )(βτm − β˜τ ) /M .The standard bootstrap method can be used for estimating the variance of the re- gression parameters. However, as far as we know, there is no analytical proof that the bootstrap method is valid for the general quantile regression model with correlated observations. In order to use existing statistical software (for example, Koenker and D’Orey, 1987) to solve the equation wτ (β)=−u, one may artificially cre- 4 Li Chen, Lee-Jen Wei, and Michael I. Parzen ate an extra data point (y∗,x∗), where x∗ is n1/2u/τ and y∗ is an ex- ∗ ∗ ∗ tremely large number such that I(y − β x ≤ 0) is always 0. Let wτ (β)= −1/2 ∗ ∗ ∗ wτ (β)+n x {I(y − β x ≤ 0) − τ}. Then, solving the equation wτ (β)=u ∗ is equivalent to solving the equation wτ (β)=0. To illustrate the above method, we use an animal study in developmental toxicity evaluation of Dietary Di(2-ethylhexyl)phthalate (DEHP), a widely used plasticizing agent, in timed-pregnant mice (Tyl et al˙, 1988). DEHP was admin- istered in the diet on days 6 through 15 of gestation with dose levels of 0, 44, 91, 191 and 292 (mg/kg/day). On the 17th gestational day, the maternal animals were sacrificed and all the fetuses were examined. One of the major outcomes for the study is the fetal body weight. The investigators would like to know whether DEHP has a negative effect on the fetal body weight. Since the sex of the fetus is expected to be correlated with the weight, an adjustment from this covariate in the analysis is needed. Here, the litter is the cluster and each live fetus is a member of the cluster. Furthermore, Yij is the weight and xij is a 3 × 1 vector, where the first component is one, the second one is the dose level, and the third one is the sex indicator for the fetus. For the animal study data, there are total of 108 clusters and the cluster sizes range from 2 to 16. With the aforementioned quantile regression, estimates for βτ and the corresponding estimated standard errors obtained based on the estimating functions (1) are reported in the third and fourth columns in Table 1.