Section on Survey Research Methods

Small-Area Estimation: Theory and Practice

Michael Hidiroglou Statistical Innovation and Research Division, Canada, 16 th Floor Section D, R.H. Coats Building, Tunney's Pasture, Ottawa, Ontario, K1A 0T6, Canada

Abstract This paper draws on the small area methodology discussed in Rao (2003) and illustrates how some of Small area estimation (SAE) was first studied at the estimators have been used in practice on a number Statistics Canada in the seventies. Small area estimates of surveys at Statistics Canada. It is structured as have been produced using administrative files or follows. Section 2 provides a summary of the primary surveys enhanced with administrative auxiliary data uses of small area estimates as criteria for computing since the early eighties. In this paper we provide a them. Section 3 defines the notation, provides a summary of existing procedures for producing official number of typical direct estimators, and indirect small-area estimates at Statistics Canada, as well as a estimators used in small area estimation. Section 4 summary of the ongoing research. The use of these provides four examples that reflect the diverse uses of techniques is provided for a number of applications at small area estimations at Statistics Canada Statistics Canada that include: the estimation of health statistics; the estimation of average weekly earnings; 2. Primary uses and Criteria for SAE Production the estimation of under-coverage in the ; and the estimation of unemployment rates. We also highlight One of the primary objectives for producing small area problems for producing small-area estimates for estimates is provide summary statistics to central or business surveys. local governments so that they can plan for immediate or future resource allocation. Typical small area estimates include Employment indicators (employed KEY WORDS: Small Area, Official Statistics, Fay- and unemployed), Health indicators (drug use, alcohol Herriot use) and Business indicators such as average salary.

1. Introduction The production of small area estimates depends on a number of factors. What the demand for such Small domain or area refers to a population for which statistics? What is the commitment and will of the reliable statistics of interest cannot be produced due to agency to support methodological, systems, and certain limitations of the available data. Examples of subject matter staff. How much methodology and domains include a geographical region (e.g. a subject matter expertise exist within the agency. How province, county, municipality, etc.), a demographic well correlated are existing auxiliary data with the group (e.g. age x sex), a demographic group within a variables of interest? Is the survey sample size large geographic region. The demand for such data small enough to allow reliable estimates by using both the areas has greatly increased during the past few years survey data and the existing auxiliary data? How much (Brackstone, 1987). This increase is due to the bias are the agency and clients willing to tolerate with usefulness of these data in government policy and the estimates,: what are the consequences for making program development, allocation of various funds and incorrect decisions? The size of the small areas in regional planning. terms of the number of the units that belong to them is also an important consideration. Small areas that are A number of national and regional statistical agencies, too small may results in confidentiality breeches. including Statistics Canada, have introduced programs Furthermore, small area estimates may be quite aimed at producing estimates for small areas to meet different from statistics based on local knowledge. the new demand. Available data to produce such estimates are based on surveys that are not designed for these levels. However, if administrative sources 3. SAE Estimators have data at the small area level, and that they are well correlated with variables of interest at the 3.1 Introduction corresponding level, several procedures are available to estimates various parameters of interest for these A survey population U consists of N distinct elements lower levels. (or ultimate units) identified through the labels j = 1, . . . , N. A sample s is selected from U with probability

3445 Section on Survey Research Methods

p(s), and the probability of including the j-th element in Rao (2003). We will confine ourselves to just a few in the sample is π j . The design weight for each of them that include the synthetic estimator, and the more well-known composite estimators. selected unit jsis defined as w 1/ . Suppose ∈ jj= π Ui denotes a domain (or subpopulation) of interest. 3. 2 Direct Estimation

Denote as ssUii=∩ the part of the sample s that Let wj be the design weight associated with js∈ . falls in domainUi .The realized sample size of si is a The Horvitz-Thompson is the simplest direct estimator. random variable ni , where 0 ≤≤nNii. Auxiliary data x If the small area total Yi is to be estimated for small will either be known at the element level x j for js∈ area Ui , then the corresponding Horvitz-Thompson or for each small area i as totals Xxij= ∑ or ˆ estimator is given by YwyiHT, = ∑ j j provided jU∈ i js∈ i that the realized sample size n is non-zero. means XXiii= / N . i

The problem is to estimate the domain total Auxiliary information can be available either at the population level or at the domain level. If it available Yyij= ∑ or the domain mean YYNiii= / , jU∈ i at the population level, then we used the Generalized where Ni , the number of elements in Ui may or may Regression Estimator (GREG) given by not be known. We define y to be y if jU∈ , and 0 ˆˆˆβ%%β ij j i YYi,,, GR=+−X ′′ i GREG() i HTX HT i , GREG where otherwise. An indicator variable aij is similarly m ˆ defined: it is equal to one if jU∈ i and 0 otherwise. Xx′ = ∑∑ j , XxHT′′= ∑ k/π k , and jU∈ i s Note that Y can be written asYy==∑∑ ya. i=1 i iijjij β% jU∈∈ jU iGREG, is the set of regression coefficient obtained by Small area estimation is categorized into two types of regressing y on x . That is estimators: direct and indirect estimators. A direct ij j estimator is one that uses values of the variable of ⎛⎞−1 % wwyjjjxx′ jjij x interest, y, only from the sample units in the domain of β = ⎜⎟∑∑, i, GREG ⎜⎟ss interest. However, a major disadvantage of such ⎝⎠ccjj estimators is that unacceptably large standard errors may where c is a specified constant ( c >0 ). result: this is especially true if the sample size within j j the domain is small or nil. An indirect estimator uses values of the variable of interest from a domain and/or The straight GREG is estimator is not efficient, and it time period other than the domain and time period of is better to use regression estimators that use auxiliary interest. Three types of indirect estimators can be data available as close possible to the small areas of identified. A domain indirect estimator uses values of interest. One such estimator is the domain–specific the variable of interest from another domain but not GREG that uses auxiliary data at the domain level. It is * ββˆˆˆˆ from another time period. A time indirect estimator given by YYXiGR,,,,,=+−X i′′ iGREG iHT iHT iGREG uses values of the variable of interest from another () −1 time period but not from another domain. An estimator βˆ ⎛⎞ where iGREG, = ⎜⎟∑∑wc jjjxx′ / j wyc jjj x / j. that is both domain and time indirect uses values of the ⎝⎠ssii variable of interest from another domain and another time period. An estimator that is approximately p-unbiased as the overall sample size increases but uses y-values outside An alternative is to use estimators that borrow strength the domain is the modified direct estimator given by across small areas, by modeling dependent on ˆˆˆββˆˆ independent variables across a number of small areas: YYXi,,, SR=+−X i′′ GREG() i HT i HT GREG where they are called indirect estimators. Indirect estimators −1 will be quite good (i.e.: indirectly increase the effective βˆ = ∑∑wcxx′ / wyc x / .This GREG()ss jjj j jjj j sample size and thus decrease the standard error) if the estimator is also referred to in Woodruff (1966), and models obtained across small areas still hold at the Battese, Harter, and Fuller (1988) as the “survey small area level. Departures from the model will result regression estimator”. in unknown biases. There is a wide variety of indirect estimators available, and a good summary is provided

3446 Section on Survey Research Methods

Hidiroglou and Patak (2004) compared a number of composite estimator is most insensitive when the mean the direct estimators. One of their conclusions was that square errors of the two component estimators do not the direct estimators would be best if the domains of differ greatly. Simple weighting factors for the interest coincided as closely as possible with the composite estimators that depend on the realized design strata. domain size were given by Drew, Singh and Choudhry (1982), and by Hidiroglou and Särndal (1985) 3.2 Indirect Estimation Small area estimators are split into two main types, Some of the most widely used indirect estimators have depending on how models are applied to the data been the synthetic estimator, the regression-adjusted within the small areas: these two types are known as synthetic, the composite estimator, and the sample- area level and unit level. Small area estimators are dependent estimator. based on area level computations if models link small area means of interest (y) to area-specific auxiliary The synthetic estimator uses reliable information of a variables (such as x sample means). They are based on direct estimator for a large area that spans several unit level computations if the models link unit values small areas, and this information is used to obtain an of interest to unit-specific auxiliary variables. Area indirect estimator for a small area. It is assumed that based small area estimators are computed if the unit the small areas have the same characteristics as the level area data are not available. They can also be large area: Gonzalez (1978) provides a good account computed if the unit level data are available by how these estimators were obtained, and used to obtain summarizing them at the appropriate area level. unemployment statistics at levels lower than those planned in the survey design. The National Center for 3.2.1 Area Model Health Statistics (1968) in the United States pioneered the use of synthetic estimation for developing state One of the most widely used area based level small estimates of disability and other health characteristics area estimator was given by Fay and Herriot (1979) from the National Health Interview Survey (NHIS). small. Population totals ( Yyij= ∑ ) or means Sample sizes in most states were too small to provide jU∈ i reliable direct state estimates. (YYNiii= / ), where Ni is the number of elements in

small area Ui , can be estimated. The Fay-Herriot Levy (1971) used mortality data to compute average methodology is usually presented as an estimator of relative errors of synthetic estimates for States. He the small area population mean Y for a given small used the regression-adjusted synthetic estimator to ac- i count for local variation by combining area-specific area Ui where im=…1, , . The Fay-Herriot estimator covariates with the synthetic estimator. These for small area Ui is a linear combination of a direct covariates attempted to attenuate the magnitude of ˆ potential relative bias associated with the synthetic estimator (say YiDIR, ) and a synthetic estimator estimator. ˆ (say YiSYN, ). The direct estimator of the population The potential bias associated with indirect estimators ˆ ˆˆ mean Y is given by YYNiDIR,,,= iDIR/ iDIR where Yˆ can be attenuated by combining them with the i iINDIR, ˆ % ˆ % YwyiDIR, = ∑ j j and NwiDIR, = ∑ j . The ˆ js∈ js∈ direct estimators YiDIR, via a weighted average. The i i % resulting combined estimator is given by weight wj associated with the j-th unit can be the % design weight wj (i.e. wwjj= ) or a final weight that ˆˆ ˆ YYi,, COMB=+−φφ i i DIR(1 i ) Y i , INDIR reflects any adjustment (i.e.: non-response, calibration, or a product thereof) made to the design weight. * where φφii ()01≤≤. The optimal φi is determined by The synthetic portion is estimated as the product of a ˆ minimizing the MSE of YiCOMB, . The resulting given auxiliary population mean row-vector composite estimator has a mean square error which is (say Zziji′′= ∑ / N ) for the i-th small area of smaller than that of either component estimator. jU∈ i ˆ Schaible (1978) noted that the composite estimator is interest times an estimated regression vector (say β , insensitive to poor estimates of the optimum weight. FH This insensitivity depends on the relative sizes of the where FH stands for Fay-Herriot). The auxiliary data mean square errors of the component estimators. The row-vector z′j is known for all units in the population

3447 Section on Survey Research Methods

βˆ and ˆ EXP wy w is the simple estimator small areaUi . The regression vector FH is computed ψ ijjj= ∑∑ jU jU across a number of small areas in such a way that the ∈∈ii ˆ of the mean involving the design weights wj . The model linking the variable of interest (the meanY ) iDIR, computations required to obtain the normal regression auxiliary data also holds at the small area level. The estimator do not involve estimating any variance Fay-Herriot estimator of a given population mean Yi is components. estimated as: % ˆ % 3.2.2 Unit Model YY1 Z′β (3.1) iFH,,=+−γγ i iDIR() i i FH The unit model originates with Battese, Harter and The two components (direct estimator and synthetic Fuller (1988). They used the nested error regression estimator) of (3.1) are weighted γ i and ()1−γ i where model to estimate county crop areas using sample % survey data in conjunction with satellite information. 22 β γ ivvi=+σσψ/(). The regression vector FH and γ i Their model is given by depend on the population variance ψ of the direct iDIR, β yveij=++x ij′ i ij (3.5) ˆ 2 estimator YiDIR, and the model varianceσ v . Although iid iid 22 ˆ where they assumed that veivije~(0,σσ ); ~(0, ) , the sampling variance of Y is easy to compute, it iDIR, i=1,…,m and jn= 1,K , . The small areas of interest in may be unstable if the domain sizes are small. This is i Battese, Harter and Fuller (1988) were 12 counties repaired with a smoothing of the estimated variances. (m=12) in North-Central Iowa. Each county was divided ˆ We denote the smoothed variances as ψ iDIR, . The into area segments and the areas under corn and 2 ˆ soybeans were ascertained for a sample of segments by estimated model variance σˆ and β are computed v FH interviewing farm operators. The number of sampled recursively. Details of the required computations for segments in a county ni ranged from 1 to 6. Auxiliary obtaining ˆ 2 can be found in of Rao (2003, pp. 118- σ v data were in the form of numbers of pixels (a term used 119). The estimated regression vector βˆ and the for "picture elements" of about 0.45 hectares) classified FH as corn and soybeans were also obtained for all the area ˆ factor γ i are given by: segments, including the sampled segments, in each −1 ⎡⎤ˆ county using LANDSAT satellite readings. ⎡⎤DDZZ′ Z′ Y βˆ ⎢⎥∑∑ii ⎢⎥iiDIR, (3.2) FH = 22 ⎢⎥ii11ˆˆˆˆ⎢⎥ The resulting sample mean using (3.5) is given by ⎣⎦==ψσiDIR,,++ v⎣⎦ ψσ iDIR v i yve=++x′ β (3.6) and ii.. ii . where y , x′ and e are the means of the associated n γˆˆ=+σψˆˆ22/ σ (3.3) ii.. i. i iviDIRv(), (y, x) observations and e-residuals. Battese et al. (1988)’s objective was to estimate the conditional respectively. population mean given the realized cluster (county) ˆ The Fay-Herriot estimator Y can also be expressed effect. Under the assumption of model (3.5), the iFH, conditional population mean is given by as: β Yvii..=+X ′ i (3.7) ˆˆβˆˆˆ β FH YYiFH,,=+Z i′′ FHγ i iDIR −Z i FH . (3.4) where Y , X ′ are the population means of the () ii.. associated Ni observations ( yij, x ij )in the i-th This form of the Fay-Herriot estimator is very similar % sampled cluster Ui . The corresponding predictor yi. to the “normal” regression estimator for the county mean crop area per segment is X β% v% ii′..+ ni ˆˆˆˆ 1 % YY=+−Z′′β Z β (3.5) where vn% =−− ∑ yx′ β γ with i,, REG i REG() i EXP i REG i. i() ij ij i j=1 given in Cochran (1977), where the estimated regression vector is given by −1 ⎛⎞mmnnii −1 β% (3.8) ⎛⎞⎛⎞DD BHF=−⎜⎟∑∑xx ij ij′′γγ i xx i.. i ∑∑ x ijyy ij − i x i .. i ˆ EXP ()() β ˆˆˆ ⎝⎠ij==11 ij == 11 REG= ⎜⎟⎜⎟∑∑ZZ i′′ i/ψψψ i,, EXP Z i i/ i EXP ⎝⎠⎝⎠ii==11

3448 Section on Survey Research Methods

22 12−1 and n− . γσσivvie=+() σ 2 2 2 2 Replacing σ and σ by σˆ and σˆ , we obtain a v e v e β β% The resulting best linear unbiased prediction (BLUP) survey-weighted estimator of (say YR ). The % estimator is yy% =+−γγXx′′β for the i- ˆ i,... BHF i i() i i i BHF resulting “pseudo-EBLUP” estimator YiPR, is given by th small area. However, the variance components ˆ YyX′′βˆˆˆ x β . Note that the self- 2 2 i, PR=+ i PRγ iw() iw − iw PR σ v and σ e are not known. Battese et al (1988) use the well-known-method of fitting-of-constants to estimate benchmarking property means that the sum of the them. The resulting estimator of the i-th area sample estimated small area totals is equal to the direct mean is known as the EBLUP estimator, because the estimator of the overall total Y. That is, variance components were estimated. m ˆ ˆ ′βˆ ∑ NYiiPR, =+ Y wXX − w w Prasad and Rao (1990) derived an approximation to i=1 () om()−1 for the model based mean squared error of the Battese-Harter-Fuller estimator, and also obtained its ˆˆmmnˆ i % where YNYYwyw==∑∑∑ i i, PR w ij ij and estimator to om()−1 as well. Prasad-Rao (1999) were iij===111 ˆ the first to include the survey weights in the unit level Xw is similarly defined. model: they labelled their estimator as a pseudo- 4. Applications EBLUP estimator of the small area mean Yi . The

Prasad-Rao estimator of Yi is given by 4.1 Canadian Community Health Survey: Area model % % Yy=+X′′βˆ γ −x β (3.9) The Canadian Community Health Survey CCHS is a i, PR i PR iw() iw iw PR cross-sectional health survey carried out by Statistics Canada since 2001.The survey operates on a two-year 222% 2 collection cycle. The first year of the survey cycle where γσσσiw=+ v/ v e∑ w j with ()js∈ i "x.1" is a large sample (130,000 persons), general % % *** ywyiw= ∑ j j ; wwij= ij/ ∑ w ij and wij are population health survey, designed to provide reliable js∈ i js∈ i estimates at the health region (sub-provincial areas β% calibrated weights, and PR is given by defined in terms of Census results), provincial and national levels. This portion of the survey collects −1 information related to health status, health care ⎛⎞mm β% utilization and health determinants for the Canadian PR= ⎜⎟∑∑γγ iwxx iw iw′ iw x iwy iw (3.10) ⎝⎠ii==11 population. The second year of the survey cycle "x.2" has a smaller sample (30,000 persons) and is designed Prasad and Rao (1999) also provided model based to provide provincial and national level results on expressions for the MSE of their estimator when it specific focused health topics. included the estimated variance components σ 2 andσ 2 . The CCHS is based on a multiple frame (two frames) v e sampling design of that uses. The first one, used as the

primary frame, is the area frame designed for the The sum of small area estimates do not necessarily add Canadian Labour Force Survey. This survey is up to the corresponding direct estimator. You-and Rao basically a two-stage stratified design that uses (2002) proposed an estimator of β that ensures self- probability proportional to size without replacement at benchmarking of the small area estimates to the each stage. Face to face interviews take place with corresponding direct estimator. Their estimator is individuals selected from that frame. The second frame given by uses a list frame of telephone numbers in some of the Health Regions for cost reasons. Individuals selected βˆˆβ in that frame are interviewed by telephone. Yyi, YR=+X i′′ YRγ iw iw −x iw YR (3.10) () where The area frame uses the Labour Force Frame. This resulting sample is a two-stage stratified cluster. −1 ⎛⎞mmnnii Sampling in that frame is carried out in three steps. β% ⎜⎟∑∑wwy%%xx x′ ∑∑ x x . YR=− ij ij() ijγγ iw iiw.. ij() ij − iw iiw ij Firstly, a list of the dwellings that were or had been in ⎝⎠ij==11 ij == 11

3449 Section on Survey Research Methods

scope to the Labour Force sample is identified. Secondly, a sample of dwellings was selected from this list. The households in the selected dwellings then formed the sample of households. The majority (88%) of the targeted sample was selected from the area frame. Lastly, respondents are randomly selected from households in this frame. Although a single individual is normally randomly selected from each household, the requirement to over sample youths results in a second member of a number of households to be selected as well. Face-to-face interviews are carried out with the selected respondents.

The telephone frame is mainly based on a stratified version (Health Regions) of the Canada Phone directory. Simple random sampling takes place within Figure 4.1: Health areas in British Columbia each of the resulting strata. Random digit dialling is carried out in five HRs and the three Territories. ψˆ DIR denotes the estimated variance for pˆ DIR under the r,a r,a sampling design, the associated estimated design effect The direct estimator of a population total Yi for a given is given by deffDIRˆ DIR /ˆˆ p DIR1 p DIR / n . The ˆ DIR% * r ,a=−ψ r ,a() r ,a() r ,a r ,a domain i is given by Ywyijj= ∑ where js∈ i smoothed design effect over all I=200 domains is * w% represents the overall weight that incorporates the DIR j given by def= ∑ deffDIR / I . The estimated multiple frame nature of the sampling design, non- i i ˆ DIR ˆ DIR response adjustments at each stage, where appropriate, coefficient of variation, cv() pr,a , for pi for a given and the calibration (age groups 12 to 19, 20 to 29, 30 DIR to 44, 45 to 64 and 65 or older for each sex within each domain i is defˆˆ pDIR1 p DIR / n / ˆp DIR . ()r,a()− r,a r,a r,a health region and province). More details of this The common mean model is the simplest one that can sampling design are available in Béland (2002). be implemented using the Fay-Herriot (1979)

methodology. This model assumes that the proportion Estimates of various population parameters can be of alcohol abuse is the same within each of the twenty produced for different domains. In the present Health Regions for a given age-sex group: that is, the example, taken from Hidiroglou, Singh and Hamel (2007), our parameter of interest is the proportion of linking model is given by Pr,a=+β aν r,a where Pr,a is alcohol abuser within the previously stated domains the unknown population proportion of interest, belonging to the province of British Columbia using and βa is the common mean across the health regions the two year (2000-2001) CCHS sample. The for the a-th age-sex group. The corresponding associated sample had 18,302 observations with sampling model is given by pPeˆ DIR =+. The domain sample size ranging from 20 to 238 for the 200 r,a r,a r,a domains. Figure 4.1 provides an idea of how the resulting small area estimate for the ra-th domain is Health Regions are delineated in British Columbia. ˆˆEBLUPˆˆ DIR ˆ given by ppr ,a=+−γ r ,a r ,a()1 γβ r ,a r ,a , where

σˆ 2 The i-th domain is a cross-classification of health γˆ = v (see Rao 2003, p. 116). The r,a % DIR 2 regions r ( r,,=…1 20) and age-sex groups a ψ r,a+σˆ v ( a,,=…1 10). The direct estimator of proportion of ppˆˆDIR1 DIR DIR DIR DIR r,a()− r,a DIR DIR DIR % % ˆˆ ψ r,a term given by ψ r,a = def alcohol abuse is given by pY/Nˆ r,a= r,a r,a nr,a ˆ DIR% * where Nwr,a= ∑ j . Given that, for domain ra, is obtained using the smoothed design effect js∈ r,a DIR def∑ deffDIR / I over the I=200 domains. The = i i 2 σˆ v term is obtained from the Fay-Herriot methodology: computational details for 2 estimatingσˆ v can be found in Rao (2003, p. 118). The ˆ EBLUP estimated coefficient of variation cv() pr,a for

3450 Section on Survey Research Methods

estimators could be unacceptably large measures of mse pˆ EBLUP ˆ EBLUP ()r,a error. pr,a is given by EBLUP , where ˆpr,a Rubin et al. (2007) investigated whether Small Area σψˆ 2 % DIR mseˆ p EBLUP = vr,a represents the estimated Estimation (SAE) procedures could be used to produce ()r,a % DIR 2 ψ r,a+σˆ v estimates for AWE with reasonably good estimated EBLUP mean squared errors for lower levels, namely industry leading term of MSEˆ p . Figure 4.2 is a graph ()r,a groups at the North American Industry Classification between the estimated coefficients of variation System (NAICS4) level 4 and geography at the level of resulting for the direct and indirect estimation province, that is, the "NAICS 4 x province" domains. The Average Weekly Earnings for a population

domain i (Ui ) is given by

YEy/Eiijijij= ∑∑ jU∈∈ii jU Est where y is the average weekly earnings and E is cv% ij ij the average number of employees within the j-th establishment within that domain.

A Monte Carlo study was carried out to evaluate the properties of the GREG estimator and a number of Observed Proportion SAE estimators. The y-values for the population used for the study were created for twelve months for Figure 4.2: Estimated coefficients of variation for the twelve months representing the January to December direct (blue) and EBLUP (red) estimators of proportion 2005 calendar year. In sample y-values were kept as is,

and the kept as is and the y-values for the out-of- 4.2 Canadian Survey of Employment Payroll and sample units were synthesized using the nearest Hours: Unit model neighbour using the average number of employment

and average monthly earnings (available for the whole The Canadian Survey of Employment, Payrolls and population). Some 100o samples were then Hours (SEPH) collects and publishes on a monthly independently sampled from each of the twelve basis, estimates of payrolls, employment, paid hours generated populations, preserving the longitudinal and earnings at detailed industrial and geography aspect of SEPH (i.e.: sample rotation of one-twelfth of levels. Estimators for average weekly earnings (AWE) the sample on a monthly basis). Summary statistics have been produced since the early nineties by SEPH. (r) These estimates have been produced via the based on the specific estimators, yi,EST , used of the i- generalized regression (GREG) estimator using a th small area (i=1,…,I ) computed from the Monte combination of survey and payroll deduction Carlo, included the average relative bias (ARB), (administrative) data provided to Statistics Canada by IR 11 (r) the Canada Tax department. The GREG estimator is ∑∑yYi,EST− i , and the average root IRY() approximately design unbiased (ADU). ir==11i relative mean square error (ARMSE) , 05. SEPH is currently being redesigned to redefine 11IR⎛⎞2 ⎜⎟(r) primary domains of interest, as well as incorporate ∑∑⎜⎟()yYi,EST− i . improvements on the use of the administrative data. IRYir==11⎝⎠i The resulting sample, estimated to be between 11,00 to 20,00 establishments (depending on budget Estimators considered in the Rubin et al. (2007) constraints) will be allocated to the newly defined simulation included the GREG, the Prasad-Rao (1999) strata, defined as cross-classifications of geography pseudo-EBLUP unit level, and the You-Rao (2002) (provinces) and industry (NAICS3), so that the pseudo-EBLUP area level SAE estimators given in resulting GREG estimates for AWE satisfy coefficients Section 3.0. The GREG estimator is given by of variation. The design strata are also referred to %%βˆˆβ yEwEyi,GREG=+∑∑ ijx ij′′ ij ij ij −x ij (4.1) model groups since the GREG estimators are Usii() computed at these levels as well. Estimates below this with xij′ = ()1,x ij . Here xij is the average monthly level can be obtained using domain estimation. As the sample associated will be relatively small (or non- earnings associated with the j-th sampled βˆ existent), the reliability associated with the GREG establishment within domain Ui , and is the

3451 Section on Survey Research Methods

regression estimator resulting from the (Undercount) and the gross number of persons iid erroneously included in the final Census count yex βˆ e~(,0)2 /E model ij=+ ij′ ij , with ijσ e ij . (Overcount). The sample size of the RRC is designed to produce reliable direct estimates for the provinces Figure 3 and 4 provide the ARB and ARMSE (including the two Territories),and eight age - sex respectively for construction domains in Canada for groups, with age categories are less than 19, 20 to 2005. The GREG estimator has the smallest ARB 29, 30 to 44, and 45 and over at the national level. amongst the three estimators. The Prasad-Rao (1999) The cross tabulation of these two marginal tabulations is the best estimator in terms of ARMSE. This is results in m= 96 (12*8) cells. These cells are reasonable on account that the You-Rao (2002) considered as small areas because they have too few estimator loses efficiency on account of its observations to sustain reliable direct estimates. The benchmarking property. objective is to use small area techniques to improve the reliability of the cell estimates. Dick (1995) applied the Fay-Herriot methodology for this purpose.

For the i-th cell (small area), we define the following quantities. The true (but unknown) Census count is

denoted asTi , and the corresponding observed Census

count as Ci . This means that the difference (TCii − ) is the missed unknown net undercoverage count

( M i ). This net undercoverage count is estimated by ˆ the RRC for the i-th small area is M i . The true count Figure 4.3: Average absolute relative bias for Ti can be expressed as the product of the observed construction domains in Canada for 2005 count C and the true adjustment factor i θiiiiii =+()MC / C = TC/ . The true adjustment factor can be estimated directly as yMCCˆ / . However, the direct estimator iiii =+() ˆ M i may not be reliable. The problem is cast into a Fay-Herriot context as follows.

The sampling model can be written as yeiii =+θ

where we assume that Eepi()= 0 and Figure 4.4: Average relative root mean square error Ve ψ , where ψ is assumed to be for construction domains in Canada for 2005 pi()= i i known. The linking model is given by β 4.3 Canadian Census of Population under coverage θνii =+z′ i, where zi is a set of auxiliary iid 2 The Census of Canada is conducted every five years. variables, and viv~(0,σ ) . One objective is to provide the Population Estimates Program with accurate baseline counts of the number The resulting Fay-Herriot estimator is given of persons by age and sex for specified geographic as θγˆˆ=+−z′′β ˆ y z β ˆ where areas. However, not all persons are correctly iFH, i FH i() i i FH enumerated. Two errors that occur are undercoverage - 22 γˆivvi=+σσψˆˆ/(). exclusion of eligible persons - and over coverage - erroneous inclusion of persons. This undercoverage The sampling variances are not known, but can be varies between 2 and 3 %. estimated as from ˆ v y given the sampling plan ψ i = ()i A special survey, known as the Reverse Record Check for the RRC. As these variances are for domains, they (RRC), with a sample size of 60,000 persons, estimates will be tend to be variable. Dick (1995) smoothed them the net number of persons missed by the Census. This net number combines two types of coverage errors: the gross number of persons missed by the Census

3452 Section on Survey Research Methods

by using log v(MCˆ ) log where it is ()iii=+αβ() + η iid 2 assumed that ηi N ()0,ζ .

The smoothed estimate of variance for the i-th small area is v(% MCˆ ) expˆ ˆ log . Hence, the ii=+()αβ () smoothed variance of yMC1/ˆ is iii=+ % % ˆ 2 Figure 4.6: Comparison of Direct and FH CVs ψ iii= v(MC ) / . (Source: You and Dick 2004)

Replacing the unknown ψ by ψ% leads to i i In terms of the CV comparison given in figure 4.6, the %% % 2 % z β % y z β % β HB approach achieves a large CV reduction when the θγiFH, =+− i′′ FH i() i i FH where σ v and FH sample sizes are small. As sample size increases, the are solved iteratively using the algorithm given in the CV reduction decreases. As the sample size increases, appendix. the CVs of the direct and HB estimates quite similar.

State which variables used and for Census (2011). For 4.4 Labour Force Survey further details see You, Rao and Dick (2002)

Unemployment rates are produced on a monthly The above methodology was used to estimate the 2001 basis in Canada by the Labour Force Survey (LFS). Canadian Census undercoverage. The final z-variables used in the linking model (4.3) were Yukon, Nunavet, The LFS samples some 53,000 households based on a Male 20 to 29, Male 30 to 44, Female 20 to 29, British stratified multi-stage design. The survey reduces Colombia renters, Ontario renters and North West response burden by having one-sixth of its sample Territories renters. replaced each month. For a detailed description of the LFS design, see Gambino, Singh, Dufour, Figure 1 displays the direct and FH estimates of Kennedy and Lindeyer (1998). The published undercoverage ratios by the domain sample sizes. provincial and national estimates unemployment Figure 2 displays the corresponding coefficients of rates are a key indicator of economic performance in variation (CV) of the direct and FH HB estimates. Canada.

Unemployment rates at levels lower than the provincial level are also of great interest. For instance, the unemployment rates for Census Metropolitan Areas (CMAs, i.e., cities with Population more than 100,000) and Census Agglomerations (CAs, i.e., other urban centers) receive scrutiny at local governments. However, many of the CAs do not have a large enough sample to produce adequate direct estimates. Their estimates need to be produced using SAE Figure 4.5: Comparison of Direct and HB Estimates techniques. You, Rao and Gambino (2003) used a (Source: You and Dick 2004) cross-sectional and time series model to estimate

unemployment for such small areas: their methodology Figure 4.5 supports the conclusion that the FH borrowed strength both across time and small areas. approach leads to smoothed estimates, particularly for the domains with relatively small sample sizes. When Let y denote the direct LFS estimate of the true sample size is small, some direct net undercoverage it θit estimates are negative due to the fact that the unemployment rate of the ith CA (small area) at overcoverage estimates are larger then the time t, for i =1, ..., m, t =1, ..., T, where m is the total undercoverage estimates. The FH method “corrected” number of CAs and T is the (current) time of the negative values. All the FH net undercoverage interest. Assume that the sampling model is estimates are positive.

3453 Section on Survey Research Methods

yit=+θ it e it , i = 1 ,...,m; t = 1 ,...,T

0,9 where eit ’s are sampling errors. Since the CAs can 0,8 Direct Est Fay-Herriot Space-Time 0,7 be treated as strata, the eit ’s are uncorrelated between 0,6 themselves for a given time period t. However, the 0,5 rotation results in a significant level of overlap for the 0,4 Est. cv Est. sampled households. This is reflected in the linking 0,3 0,2 model given by θβit=++xvu it′ i it where the error structure of the u ’s is assumed to follow an AR(1) 0,1 it 0 iid process, represented as uu ; 0, 2 Figure 4.8: Comparison of coefficients of variation of it=+ i,1 t− εε it it () σ unemployment rates using Direct, Fay-Herriot, and

space-time estimates for June 1999 The error structure of the e ’s is assumed known, and it as this is not the case, the sample based estimates need Acknowledgements: The author would like to to be smoothed. You, Rao and Gambino (2003) used acknowledge Jon Rao, Peter Dick and Susana the Hierarchical Bayes (HB) procedure to estimate the Rubin-Bleuer. required parameters in the error and linking equation. They compared numerically three estimators of the References unemployment rates in June 1999. These estimators were the direct estimator (Direct Est), a small area Australian Bureau of Statistics (2006). A Guide to estimator based only on the current cross-sectional Small Area Estimation - Version 1.1. Internal data (the Fay-Herriot), and one using both the cross- ABS document. sectional and longitudinal data (Space-time). Battese, G.E., Harter, R.M., Fuller, W.A. (1988). An Error-Components Model for Prediction of Crop Figure 4.7 displays these LFS estimates for the June Areas Using Survey and Satellite Data, Journal of 1999 unemployment rates for the 62 CAs across the American Statistical Association, 83, 28-36. Canada. The 62 CAs appear in the order of Brackstone, G. J. (1987). Small area data: policy issues population size with the smallest CA (Dawson and technical challenges. In R. Platek, J. N. K. Rao, Creek, BC, population is 10,107) on the left and the C. E. Sarndal, and M. P. Singh, eds., Small Area largest CA (Toronto, Ont., population is 3,746,123) Statistics, pp. 3-20. John Wiley & Sons, New York. on the right. The Fay-Herriot model tends to shrink Béland, Yves Canadian Community Health Survey the estimates towards the average of the (2002). Methodological overview. Health report, Statistics Canada, Catalogue no. 82-003-XPE unemployment rates. The space-time model leads to (0030182-003-XIE.pdf), Vol. 13, No. 3, ISSN moderate smoothing of the direct LFS estimates. For 0840-6529. the CAs with large population sizes and therefore Dick, P. (1995). Modelling Net Undercoverage in the large sample sizes, the direct estimates and the HB 1991 Canadian Census, , estimates are very close to each other; for smaller 21, 45-54. CAs, the direct and HB estimates differ substantially Drew, D., Singh, M.P., and Choudhry, G.H. (1982). for some regions. Evaluation of Small Area Estimation Techniques for the Canadian Labour Force Survey, Survey Methodology, 8, 17-47. 16 Direct Est Fay-Herriot Space-Time Fay, R.E. and Herriot, R.A. (1979). Estimation of 14 Income for Small Places: An Application of 12 James-Stein Procedures to Census Data. Journal 10 of the American Statistical Association, 74, 269- 8 6 277. 4 Fuller, W.A. (1999). Environmental Surveys Over Unemployment rate(%) Unemployment 2 Time, Journal of Agricultural, Biological and 0 Environmental Statistics, 4, 331-345. Gambino, J.G., Singh, M.P., Dufour, J., Kennedy, B. Figure 4.7: Comparison of unemployment rates using and Lindeyer, J. (1998). Methodology of the Direct, Fay-Herriot, and space-time for June 1999 Canadian Labour Force Survey, Statistics Canada, Catalogue No. 71-526.

3454 Section on Survey Research Methods

Gonzalez, M.E., and Hoza, C. (1978), Small-Area Singh, M.P., Gambino, J., Mantel, H.J. (1994). Issues Estimation with Application to Unemployment and Strategies for Small Area Data, Survey and Housing Estimates, Journal of the American Methodology, 20, 3-22. Statistical Association, 73, 7-15. Woodruff, R.S. (1966), Use of a Regression Technique Hidiroglou M.A. and Singh A., and Hamel M. (2007). to Produce Area Breakdowns of the Monthly some thoughts on small area estimation for the National Estimates of Retail Trade, Journal of the Canadian community health survey (CCHS). American Statistical Association, 61, 496-504. Internal Statistics Canada document. You, Y., and Rao, J.N.K. (2002). A Pseudo-Empirical Hidiroglou, M.A. and Särndal, C.E., (1985). Small Best Linear Unbiased Prediction Approach to Domain Estimation: A Conditional Analysis, Small Area Estimation Using Survey Weights, Proceedings of the Social Statistics Section, Canadian Journal of Statistics, 30, 431-439. American Statistical Association, 147-158. You, Y., Rao, J.N.K. and Dick, J.P. (2002) Hidiroglou, M.A. and Patak, Z. (2004). Domain Benchmarking hierarchical Bayes small area estimation using . Survey estimators with application in census Methodology, 30, 67-78. undercoverage estimation. Proceedings of the Levy, P.S. (1971). The Use of Mortality Data in Survey Methods Section 2002, Statistical Society Evaluating Synthetic Estimates, Proceedings of of Canada, 81 - 86. the Social Statistics Section, American Statistical You, Y, Rao, J.N.K., and Gambino, J.G. (2003). Association, pp. 328-331. Model-based unemployment rate estimation for Prasad, N.G.N., and Rao, J.N.K. (1990), The the Canadian Labour Force Survey: A hierarchical Estimation of the Mean Squared Error of Small- Bayes approach, Survey Methodology, 29, 25-32. Area Estimators,. Journal of the American You, Y and Dick, P. (2004). Hierarchical Bayes Statistical Association, 85, 163-171. Small Area Inference to the 2001 Census Undercoverage Estimation. Proceedings of the Prasad, N.G.N. and Rao, J.N.K. (1999). On robust ASA Section on Government Statistics, 1836- small area estimation using a simple random 1840. effects model. Survey Methodology, 25, 67-72 . Rao, J.N.K. (2003). Small Area Estimation. New York: . Wiley. Rao, J.N.K. and Choudhry, H. (1995). Small Area Estimation: Overview and Empirical study. Business Survey Methods, Edited by Cox, Binder, Chinnappa, Christianson, Colledge, Kott, Chapter 27. Rubin-Bleuer, S., Godbout S and Morin Y (2007). Evaluation of small domain estimators for the Canadian Survey of Employment, Payrolls and Hours. Paper presented at the third International Conference of Establishment Surveys July 2007 Statistical of Society Meetings. Schaible, W.A. (1978). Choosing Weights for Composite Estimators for Small Area Statistics, Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 741-746. Singh A.C. and Verret F. (2006). Mixed Linear Nonlinear Aggregate level and Matt Type for formulas? Models for Small Area Estimation for Binary count data from Surveys. Proceedings of the Statistics Canada Symposium. Singh, A.C. (2006). Some problems and proposed solutions in developing a small area estimation product for clients. ASA Proc. Surv. Res. Meth. Sec.

3455 Random effects model Section on Survey Research Methods

Appendix: Fay-Herriot computational summary Description Computation gY where Y is the small area population mean for small area; 1. Model a smooth function of Yi θii= () i i-th i=1,…,m ˆ ˆ 2. Direct estimate of θi θˆ = gY where Y is the observed direct estimate ii( ) i 3. Auxiliary data K ′ ziiipi= ()zz12,,, z 4. Linking model: Connect the 2 2 θi ′β θii=+z v i; vi i.i.d under model ()0,σ v ; σ v =model variance 5. Sampling model ˆ θθiii=+e ;sampling errors ei independent Eepii()θ = 0 and sampling

varianceVepii()θψ= i (assumed known) 6. Combine 6 and 7 ˆ β θii=++z ′ ve ii: Fay-Herriot model 2 Method of moments: 7. Estimation of σ v m 2 222% 2 Solve hmpσθσψσ=−∑ ˆ z ′β / +=− for σ via iteration ()viiviv( ()) () v i=1 21()rr+ 2 () ⎡⎤mph2 h′ 2() rconstraining to 21()r + 0 , σσvv=+−−⎣⎦() σ v* () σ v σ v ≥

m 2 2 22% where h′ σθ=−∑ ˆ −z ′β / ψσ + is an approximation to the * ()viiiv( ) () i=1 2 derivative of h()σ v . (see p. 118, Rao (2003))

8. Optimal model-based Fay-Herriot FH θγθγˆˆ=+−ˆˆ1 z ′′β ˆˆˆˆˆ =+z β γθ ˆ −z ′β =+z ′β vˆ where βˆ is the estimator iiiiiiiiiii() ( ) weighted least squares estimator of β . Now ⎡⎤⎡⎤mm−1 ββˆˆ% ˆˆˆ22′ 2 ==()σψσθψσviiiviiiv⎢⎥⎢⎥∑∑zz// ( + ) z ( + )where ⎣⎦⎣⎦ii==11 ˆ ˆˆ22 γ iviv=+σψσ/ () (see p. 116 Rao (2003)) 2 9. MSE of θˆ Leading term of MSEˆˆ E where the expectation is with iFH, ()(θθθiFH,,=− iFH i ) 2 respect to the Fay-Herriot model; see step 8; g1iv()σγψ= ii shows the ˆ ˆ −1 efficiency of θiFH, over direct estimator θi is γ i for large number of areas 22 m. If γσψσiviv=+=/ ()1/2 , then efficiency is 200% or gain in efficiency is 100%. 10. Scenarios for large efficiency 2 Sampling variance ψ i large or model variance σ v small relative to ψ i gains 11. Nearly unbiased estimator of mse ˆ : See equation (7.1.26), p. 129, Rao (2003); easily ()θiFH, MSE θˆ ()iFH, programmable

ˆ −1 12. Estimation of small area mean Yi Ygˆˆ K iFH,,,==()()θθ iFH iFH 2 ˆ ˆ ⎡⎤ˆˆ 13. MSE estimator of YiFH, mse Y= K′ θθ mse ; may not be nearly unbiased. ()iFH,,,⎣⎦() iFH () iFH Empirical Bayes (EB) and hierarchical Bayes( HB) methods are better suited for handling non-linear cases , K ˆ , see p. 133 , Rao (2003) ()θiFH,

3456