Small-Area Estimation: Theory and Practice
Total Page:16
File Type:pdf, Size:1020Kb
Section on Survey Research Methods Small-Area Estimation: Theory and Practice Michael Hidiroglou Statistical Innovation and Research Division, Statistics Canada, 16 th Floor Section D, R.H. Coats Building, Tunney's Pasture, Ottawa, Ontario, K1A 0T6, Canada Abstract This paper draws on the small area methodology discussed in Rao (2003) and illustrates how some of Small area estimation (SAE) was first studied at the estimators have been used in practice on a number Statistics Canada in the seventies. Small area estimates of surveys at Statistics Canada. It is structured as have been produced using administrative files or follows. Section 2 provides a summary of the primary surveys enhanced with administrative auxiliary data uses of small area estimates as criteria for computing since the early eighties. In this paper we provide a them. Section 3 defines the notation, provides a summary of existing procedures for producing official number of typical direct estimators, and indirect small-area estimates at Statistics Canada, as well as a estimators used in small area estimation. Section 4 summary of the ongoing research. The use of these provides four examples that reflect the diverse uses of techniques is provided for a number of applications at small area estimations at Statistics Canada Statistics Canada that include: the estimation of health statistics; the estimation of average weekly earnings; 2. Primary uses and Criteria for SAE Production the estimation of under-coverage in the census; and the estimation of unemployment rates. We also highlight One of the primary objectives for producing small area problems for producing small-area estimates for estimates is provide summary statistics to central or business surveys. local governments so that they can plan for immediate or future resource allocation. Typical small area estimates include Employment indicators (employed KEY WORDS: Small Area, Official Statistics, Fay- and unemployed), Health indicators (drug use, alcohol Herriot use) and Business indicators such as average salary. 1. Introduction The production of small area estimates depends on a number of factors. What the demand for such Small domain or area refers to a population for which statistics? What is the commitment and will of the reliable statistics of interest cannot be produced due to agency to support methodological, systems, and certain limitations of the available data. Examples of subject matter staff. How much methodology and domains include a geographical region (e.g. a subject matter expertise exist within the agency. How province, county, municipality, etc.), a demographic well correlated are existing auxiliary data with the group (e.g. age x sex), a demographic group within a variables of interest? Is the survey sample size large geographic region. The demand for such data small enough to allow reliable estimates by using both the areas has greatly increased during the past few years survey data and the existing auxiliary data? How much (Brackstone, 1987). This increase is due to the bias are the agency and clients willing to tolerate with usefulness of these data in government policy and the estimates,: what are the consequences for making program development, allocation of various funds and incorrect decisions? The size of the small areas in regional planning. terms of the number of the units that belong to them is also an important consideration. Small areas that are A number of national and regional statistical agencies, too small may results in confidentiality breeches. including Statistics Canada, have introduced programs Furthermore, small area estimates may be quite aimed at producing estimates for small areas to meet different from statistics based on local knowledge. the new demand. Available data to produce such estimates are based on surveys that are not designed for these levels. However, if administrative sources 3. SAE Estimators have data at the small area level, and that they are well correlated with variables of interest at the 3.1 Introduction corresponding level, several procedures are available to estimates various parameters of interest for these A survey population U consists of N distinct elements lower levels. (or ultimate units) identified through the labels j = 1, . , N. A sample s is selected from U with probability 3445 Section on Survey Research Methods p(s), and the probability of including the j-th element in Rao (2003). We will confine ourselves to just a few in the sample is π j . The design weight for each of them that include the synthetic estimator, and the more well-known composite estimators. selected unit js∈ is defined as w = 1/π . Suppose jj Ui denotes a domain (or subpopulation) of interest. 3. 2 Direct Estimation Denote as ssUii=∩ the part of the sample s that Let wj be the design weight associated with js∈ . falls in domainUi .The realized sample size of si is a The Horvitz-Thompson is the simplest direct estimator. random variable ni , where 0 ≤≤nNii. Auxiliary data x If the small area total Yi is to be estimated for small will either be known at the element level x j for js∈ area Ui , then the corresponding Horvitz-Thompson or for each small area i as totals Xxij= ∑ or ˆ estimator is given by YwyiHT, = ∑ j j provided jU∈ i js∈ i that the realized sample size n is non-zero. means XXiii= / N . i The problem is to estimate the domain total Auxiliary information can be available either at the population level or at the domain level. If it available Yyij= ∑ or the domain mean YYNiii= / , jU∈ i at the population level, then we used the Generalized where Ni , the number of elements in Ui may or may Regression Estimator (GREG) given by not be known. We define y to be y if jU∈ , and 0 ˆˆˆβ%%β ij j i YYi,,, GR=+−X ′′ i GREG() i HTX HT i , GREG where otherwise. An indicator variable aij is similarly m ˆ defined: it is equal to one if jU∈ i and 0 otherwise. Xx′ = ∑∑ j , XxHT′′= ∑ k/π k , and jU∈ i s Note that Y can be written asYy==∑∑ ya. i=1 i iijjij β% jU∈∈ jU iGREG, is the set of regression coefficient obtained by Small area estimation is categorized into two types of regressing y on x . That is estimators: direct and indirect estimators. A direct ij j estimator is one that uses values of the variable of ⎛⎞−1 % wwyjjjxx′ jjij x interest, y, only from the sample units in the domain of β = ⎜⎟∑∑, i, GREG ⎜⎟ss interest. However, a major disadvantage of such ⎝⎠ccjj estimators is that unacceptably large standard errors may where c is a specified constant ( c >0 ). result: this is especially true if the sample size within j j the domain is small or nil. An indirect estimator uses values of the variable of interest from a domain and/or The straight GREG is estimator is not efficient, and it time period other than the domain and time period of is better to use regression estimators that use auxiliary interest. Three types of indirect estimators can be data available as close possible to the small areas of identified. A domain indirect estimator uses values of interest. One such estimator is the domain–specific the variable of interest from another domain but not GREG that uses auxiliary data at the domain level. It is * ββˆˆˆˆ from another time period. A time indirect estimator given by YYXiGR,,,,,=+−X i′′ iGREG iHT iHT iGREG uses values of the variable of interest from another () −1 time period but not from another domain. An estimator βˆ ⎛⎞ where iGREG, = ⎜⎟∑∑wc jjjxx′ / j wyc jjj x / j. that is both domain and time indirect uses values of the ⎝⎠ssii variable of interest from another domain and another time period. An estimator that is approximately p-unbiased as the overall sample size increases but uses y-values outside An alternative is to use estimators that borrow strength the domain is the modified direct estimator given by across small areas, by modeling dependent on ˆˆˆββˆˆ independent variables across a number of small areas: YYXi,,, SR=+−X i′′ GREG() i HT i HT GREG where they are called indirect estimators. Indirect estimators −1 will be quite good (i.e.: indirectly increase the effective βˆ = ∑∑wcxx′ / wyc x / .This GREG()ss jjj j jjj j sample size and thus decrease the standard error) if the estimator is also referred to in Woodruff (1966), and models obtained across small areas still hold at the Battese, Harter, and Fuller (1988) as the “survey small area level. Departures from the model will result regression estimator”. in unknown biases. There is a wide variety of indirect estimators available, and a good summary is provided 3446 Section on Survey Research Methods Hidiroglou and Patak (2004) compared a number of composite estimator is most insensitive when the mean the direct estimators. One of their conclusions was that square errors of the two component estimators do not the direct estimators would be best if the domains of differ greatly. Simple weighting factors for the interest coincided as closely as possible with the composite estimators that depend on the realized design strata. domain size were given by Drew, Singh and Choudhry (1982), and by Hidiroglou and Särndal (1985) 3.2 Indirect Estimation Small area estimators are split into two main types, Some of the most widely used indirect estimators have depending on how models are applied to the data been the synthetic estimator, the regression-adjusted within the small areas: these two types are known as synthetic, the composite estimator, and the sample- area level and unit level. Small area estimators are dependent estimator. based on area level computations if models link small area means of interest (y) to area-specific auxiliary The synthetic estimator uses reliable information of a variables (such as x sample means).