On Small Area Prediction Interval Problems
Total Page:16
File Type:pdf, Size:1020Kb
ASA Section on Survey Research Methods On Small Area Prediction Interval Problems Snigdhansu Chatterjee, Parthasarathi Lahiri, Huilin Li University of Minnesota, University of Maryland, University of Maryland Abstract In the small area context, prediction intervals are often pro- √ duced using the standard EBLUP ± zα/2 mspe rule, where Empirical best linear unbiased prediction (EBLUP) method mspe is an estimate of the true MSP E of the EBLUP and uses a linear mixed model in combining information from dif- zα/2 is the upper 100(1 − α/2) point of the standard normal ferent sources of information. This method is particularly use- distribution. These prediction intervals are asymptotically cor- ful in small area problems. The variability of an EBLUP is rect, in the sense that the coverage probability converges to measured by the mean squared prediction error (MSPE), and 1 − α for large sample size n. However, they are not efficient interval estimates are generally constructed using estimates of in the sense they have either under-coverage or over-coverage the MSPE. Such methods have shortcomings like undercover- problem for small n, depending on the particular choice of age, excessive length and lack of interpretability. We propose the MSPE estimator. In statistical terms, the coverage error a resampling driven approach, and obtain coverage accuracy of such interval is of the order O(n−1), which is not accu- of O(d3n−3/2), where d is the number of parameters and n rate enough for most applications of small area studies, many the number of observations. Simulation results demonstrate of which involve small n. If the naive mspe, i.e. the one the superiority of our method over the existing ones. that does not account for the uncertainties involving the model parameter estimation, is used, the resulting prediction inter- Keywords: Predictive distribution, prediction interval, gen- val suffers from under-coverage problem. On the other hand, eral linear mixed model, bootstrap, coverage accuracy. if the Prasad-Rao (1990) second-order unbiased MSPE esti- mator is used, the interval suffers from the over-coverage and 1 Introduction over-lengthening problem. For a general mixed linear model, Jeske and Harville (1988) Large scale sample surveys are usually designed to produce re- proposed a prediction interval for a mixed effect. But, their liable estimates of various characteristics of interest for large method does not address the complexity introduced by the es- geographic areas. However, for effective planning of health, timation of the variance components. In particular, they did social and other services, and for apportioning government not study the effect of estimated unknown variance compo- funds, there is a growing demand to produce similar estimates nents on the accuracy of the coverage error of their proposed for smaller geographic areas and for other sub-populations. interval. To meet this demand, it is necessary to supplement the survey data with other relevant information that are often obtained Jiang and Zhang (2002) used a distribution-free method for from different administrative and census records. In many constructing prediction intervals for a future observation under small area applications, mixed linear models are now routinely a non-Gaussian linear mixed model, based on the theory de- used in combining information from various sources and ex- veloped by Jiang (1998). This technique does not employ any plaining different sources of errors. These models incorporate area specific information and can be useful in constructing in- area specific random effects which explain the “between small tervals when there is no survey data on the response variable. area variations”, not otherwise explained by the fixed effects Jiang and Zhang (2002) proposed another method which can part of the model. be applied to the situation when the sample size is large within For a good review on small area research, the readers are re- each area. This is a technique of first obtaining the EBLUP for ferred to the review papers by Rao (1986, 1999, 2001, 2004), the random effects and the residuals. Then, under conditions Ghosh and Rao (1994), Pfeffermann (2002), Marker (1999), sufficient to imply that the number of times each random ef- and the book by Rao (2003), among others. Point prediction fect is repeated (ie, number of observations in each small area) using the empirical best linear unbiased (EBLUP) and the as- tends to infinity, the empirical distribution of random effects sociated mean square prediction error (MSPE) estimation have as well as the residuals converge appropriately. This technique been discussed extensively in the small area literature. But fails when we do not have large samples for each small area, a little advancement has been made in interval prediction prob- situation that is common in many small area applications. lems. Needless to say that interval prediction is a useful data Other than the papers cited above, research on small area analysis tool, since this integrates the concepts of both point prediction intervals is limited except for some special cases prediction and hypothesis testing in an effective manner. Pre- of the Fay-Herriot model (see Fay and Herriot 1979), a well- diction intervals are useful in small area studies in many ways. celebrated mixed regression model. Varieties of Bayesian and For example, prediction intervals may help establish if differ- empirical Bayes methods are used in the Fay-Herriot model ent counties have similar resources and needs, or if different for such intervals. Because of the extensive use of the Fay- ethnic or other sub-population groups are equally exposed to Herriot model and its particular cases in the small area inter- a particular disease. val estimation, in Section 2 we review different approaches to 2831 ASA Section on Survey Research Methods interval estimation for this model. produce per-capita income for small places (population less In order to study the MSPE of EBLUP predictors of linear than 1,000), Fay and Herriot (1979) considered the following combinations of fixed and random effects, Das, Jiang and Rao aggregate level model and used an empirical Bayes method (2004) considered the following general linear mixed model which combines survey data from the U.S. Current Popula- tion Survey with various administrative and census data. Their Y = Xβ + Zv + en, (1) model is now widely used in the small-area literature. The Fay-Herriot Model: n where Y ∈ R is a vector of observed responses, Xn×p and T T Zn×q are known matrices and v and en are independent ran- 1. Conditional on θ = (θ1, . , θn) , Y = (Y1,...,Yn) dom variables with dispersion matrices D(ψ) and Rn(ψ) re- follows a n-variate normal distribution with mean θ and p k 2 spectively. Here β ∈ R and ψ ∈ R are fixed parameters. dispersion matrix D with diagonal entries Dii = σi and The mixed ANOVA model, longitudinal models including the off-diagonal entries 0. The sequence {σi} is a known Fay-Herriot model and the nested error regression model are sequence. special cases of the above framework. Here (and in the sequel) all vectors are taken to be column In this paper we address the problem of obtaining prediction vectors, for any vector (matrix) a (A), the notation aT intervals of a mixed effect for the above general linear mixed (AT ) denotes its transpose. model. Our approach is to employ parametric bootstrap. This results in a higher order coverage accuracy of the interval com- 2. The variable θ follows a n-variate normal distribution pared to the existing methods, most of which are applicable with mean given by Xβ for a known n × p matrix X only to special cases like the Fay-Herriot model. It has been and unknown but fixed vector β ∈ Rp. The dispersion 2 long recognized that health, economic activity and other mea- matrix is given by τ In, where the matrix In is the n di- sures of human well-being depend on a number of exogenous mensional identity matrix and τ is an unknown constant. and endogenous factors, many of which must be measured at the individual level and incorporated in the model. In statis- We are interested in obtaining a prediction interval for θi = T tical terms, this translates to high dimensionality of β and ψ. xi β + vi. There are several options for constructing such in- In order to address the dimension asymptotics aspect of small tervals: one may use only the Level 1 model for the observed area prediction, we allow parameter dimension d = p + k to data, or only the Level 2 model for the borrowed strength com- grow with sample size n, and obtain coverage accuracy of the ponent, or a combination of both. order O(d3n−3/2). The interval for θi based only on the Level 1 model is given −1 D Traditional small area interval estimates achieve O(n ) by Ii (α): Yi ± zα/2σi, where zα/2 is the upper α/2 percent for the Fay-Herriot model or its particular cases, for fixed val- point of N(0, 1). Obviously, for this interval, the coverage ues of p and k. As stated earlier, the O(n−1) rate is typically probability is 1 − α. However, it is not efficient, since its not adequate for related follow up applications. Hence, cali- average length is too large to make any reasonable conclusion. brations and corrections of these intervals have been proposed, This is due to the high variability of the point predictor Yi. some of which are discussed in Section 2. For clarity of our An interval based only on Level 2 ignores the crucial area proposed methodology and technical conditions, and also for specific data that is modeled in Level 1, and hence falls short easy use in future applications, in Subsection 2.3 we discuss on two counts: it fails to be relevant to the specific small area our proposed prediction interval in detail for the Fay-Herriot under consideration, and it fails to achieve sufficient coverage model.