Catalogue no. 12-001-XIE

Survey Methodology

June 2006 How to obtain more information

Specific inquiries about this product and related or services should be directed to: Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, K1A 0T6 (telephone: 1 800 263-1136).

For information on the wide range of data available from Statistics Canada, you can contact us by calling one of our toll-free numbers. You can also contact us by e-mail or by visiting our website.

National inquiries line 1 800 263-1136 National telecommunications device for the hearing impaired 1 800 363-7629 Depository Services Program inquiries 1 800 700-1033 Fax line for Depository Services Program 1 800 889-9734 E-mail inquiries [email protected] Website www.statcan.ca

Information to access the product

This product, catalogue no. 12-001-XIE, is available for free. To obtain a single issue, visit our website at www.statcan.ca and select Our Products and Services.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner and in the official language of their choice. To this end, the Agency has developed standards of service that its employees observe in serving its clients. To obtain a copy of these service standards, please contact Statistics Canada toll free at 1 800 263-1136. The service standards are also published on www.statcan.ca under About Statistics Canada > Providing services to Canadians. Statistics Canada Business Survey Methods Division

June 2006

Published by authority of the Minister responsible for Statistics Canada

© Minister of Industry, 2006

All rights reserved. The content of this electronic publication may be reproduced, in whole or in part, and by any means, without further permission from Statistics Canada, subject to the following conditions: that it be done solely for the purposes of private study, research, criticism, review or newspaper summary, and/or for non-commercial purposes; and that Statistics Canada be fully acknowledged as follows: Source (or “Adapted from”, if appropriate): Statistics Canada, year of publication, name of product, catalogue number, volume and issue numbers, reference period and page(s). Otherwise, no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form, by any means—electronic, mechanical or photocopy—or for any purposes without prior written permission of Licensing Services, Client Services Division, Statistics Canada, Ottawa, Ontario, Canada K1A 0T6.

July 2006 Catalogue no. 12-001-XIE ISSN 1492-0921 Frequency: semi-annual Ottawa

Cette publication est disponible en français sur demande (no 12-001-XIF au catalogue).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued cooperation and goodwill. 4 Longford: Sample Size Calculation for Small•Area Estimation Vol. 32, No. 1, pp. 87­96 Stati stics Canada, Catalogue No. 12­001

Sample Size Calculation for Small­Area Estimation

Nicholas Tibor Longford 1

Abstract We describe a general approach to setting the sampling design in surveys that are planned for making inferences about small areas (sub­domains). The approach requires a specification of the inferential priorities for the areas. Sample size allocation schemes are derived first for the direct estimator and then for composite and empirical Bayes estimators. The methods are illustrated on an example of planning a survey of the population of Switzerland and estimating the mean or proportion of a variable for each of its 26 cantons.

Key Words: Efficiency; Inferential priority; Sample size allocation; Small­area estimation.

1. Introduction designed for estimating national quantities but, sometimes almost as an afterthought, are used for inferences about Sampling design is a key device for efficient estimation small areas. This would be appropriate if the sampling and other forms of inference about a large population when designs optimal for small­area and national inferences were the resources available do not permit collecting the relevant similar. We illustrate in this paper that this is not the case information from every member of the population. In this and that sampling design can be effectively targeted for context, efficiency is interpreted as the optimal combination small­area estimation, taking into account the goal of of a sampling design and an estimator of a population quan­ efficient estimation of national quantities. To avoid the tity θ. By optimum we understand minimum mean squared trivial case, we assume that the areas have unequal popu­ error, although the development presented in this paper can lation sizes. We apply the methods to the problem of be adapted for other criteria. The pool of the possible planning inferences about the 26 cantons of Switzerland; sampling designs is delimited by the resources, and these their population sizes range from 15,000 (Appenzell­ are usually expressed in terms of a fixed sample size. This is Innerrhoden) to 1.23 million (Zürich). The population of not always appropriate because the designs may not entail Switzerland is 7.26 million. identical average costs per subject. However, within a Literature on the subject of planning surveys for small­ limited range of designs, this issue can be ignored. area estimation is rather sparse. An important contribution is The problem of setting the sampling design for the Singh, Gambino and Mantel (1994). In one of the ap­ purpose of efficient estimation of a single quantity is well proaches they discuss, the planned sample size for the understood, and solutions are available for many commonly Canadian Labor Force Survey is split into two parts. One encountered settings. Most of them involve a univariate part is allocated optimally for the purpose of national constrained optimisation problem. Setting the sampling (domain) estimation and the remainder optimally for small­ design for estimating several quantities represents a quan­ area estimation. For the latter goal, equal subsample sizes tum leap in complexity, because the problem involves are allocated to each area when the areas have equal within­ several factors, typically one for each quantity. It is essential area variances, the finite population correction can be to optimise the design simultaneously for all the factors, ignored and the areas have equal survey costs per subject, because the goals of efficient inference about the target but also when the targets of inference are area­level means. quantities may be in conflict. For example, in small­area When the targets are population totals, equal allocation to estimation, a more generous allocation of the sample size to the areas is not efficient, because it handicaps estimation for one area has to be compensated by a less generous allo­ more populous areas. Even when proportions or rates (per­ cation to one or several other areas. centages) are estimated, the within­area variances depend on Small­area statistics have become an important research the population proportion, although the dependence is weak topic in survey methods in the last few decades (Fay and when all the proportions are distant from zero and unity. For Herriot 1979; Platek, Rao, Särndal and Singh 1987; Ghosh more recent developments in sampling design for small­area and Rao (1994), Longford 1999; and Rao 2003), stimulated estimation, see Marker (2001). by increasing interest of government agencies, the adver­ The next section describes the proposed approach based tising and marketing industry and the financial and in­ on minimising the weighted sum of the sampling variances surance sector. At present, many large­scale surveys are (mean squared errors) of the planned estimators, with the

1. Nicholas Tibor Longford, Departament d’Económia i Empresa, Universitat Pompeu Fabra, Ramón Trias Fargas 25­27, 08005 Barcelona, Spain. E­mail: [email protected]. Survey Methodology, June 2006 5

weights specified to reflect the inferential priorities. It is ∂v P d = const. applied first to direct estimation of the area­level quantities. d ∂n d Then it is extended to incorporate the goal of national estimation, and, finally, to composite estimation in section An analytical expression for the optimal subsample sizes 2 3. The concluding section 4 contains a discussion. nd cannot be obtained in general, but when vd = σ d / nd , The remainder of this section introduces the notation as in simple random sampling within areas, the solution is used in the rest of the paper. We assume that area­level proportional to σ d Pd , that is,

population quantities θd ,d = 1, ..., D , are estimated by ˆ † σ d P d θ d with respective mean squared errors (MSE) vd that are n = n . d σ P + ... + σ P functions of the within­area subsample sizes nd; v d = 1 1 D D v( n ). , n d d The overall sample size is denoted by and is 2 2 assumed to be fixed. The population sizes are denoted by When the within­area variances σ d coincide, σ1 = ... = 2 2 N (overall) and N (for area d ). For brevity, we denote σD = σ , this simplifies further; the optimal sample sizes d 2 Τ are proportional to Pd and do not depend on σ . = n (n1 , ..., n D ) . Most population quantities θ are functions of a single variable, such as its mean, total, and the In most contexts, it is difficult to elicit a suitable set of priorities Pd , and so it is more constructive to propose a like. The variable may be recorded in the survey directly, or Τ constructed from one or several such variables. Although convenient parametric class of priorities P = (P1 , ..., PD ) our development is not restricted to such quantities, the and illustrate their impact on the sample size allocation. We q motivation is more straightforward with them. An estimator propose the priorities Pd = N d for 0 ≤ q ≤ 2. For q = 0, inference is equally important for every area. With in­ of θ d is said to be direct if it is a function of only the creasing q, relatively greater importance is ascribed to variable concerned on subjects in area . d 2 more populous areas. When vd = σ / nd , the optimal We assume that each direct estimator considered is † unbiased. This is not particularly restrictive, as most direct sample size allocation for q = 2, nd = n Nd / N , is propor­ estimators are naive estimators or are closely related to tional to the population sizes in the areas, and so the same them. We assume that the sample sizes for the areas are sampling design is optimal for national and area­level under the control of the survey designer. This is the case in inferences. For q > 2 the sample size allocation is even stratified sampling designs in which the strata coincide with more generous to the most populous areas, at the expense of the areas. In section 4, we discuss sampling designs in less populous areas. As this is counterintuitive in the context which such control cannot be exercised; they are of small­area estimation, the choice of an exponent q > 2 is particularly relevant for divisions of the country into many probably never appropriate. A negative priority exponent q (hundreds of) areas. would be suitable for a survey that aims to focus on the least populous areas. Of course, such a design is very inefficient for estimating the national quantity θ, especially when the 2. Optimal Design for Direct Estimation areas have widely dispersed population sizes.

We resolve the conflict between the goals of efficient The inferential priorities Pd may be functions of quantities other than N . For example, the sizes of certain estimation of the area­level quantities θ d by choosing the d area­level sampling design that minimises the weighted sum subpopulations of focal interest, such as an ethnic minority of the sampling variances (MSEs), in the area, may be used instead of Nd , Pd may be defined D differently in the country’s regions, or the formula for them minn ∑ Pd v d , (1) may be overriden for one or a few areas. d = 1 In some publications of survey analyses, an estimate is subject to the constraint of fixed overall sample size reported only when it is based on a sufficiently large sample Τ n = n 1D; 1 D is the vector of unities of length . D The size or its coefficient of variation (the ratio of the estimated coefficients Pd are called inferential priorities. Greater standard error and the estimate) is smaller than a specified value of Pd (in relation to the values Pd ′ ,) d′ ≠ d implies a threshold. If a ‘penalty’ for not reporting a quantity is greater urgency to reduce vd , because the contribution of specified, it can be incorporated in the definition of the area d to the sum in (1) in magnified more than for the inferential priorities. The difficulty that may arise is that the other areas. objective function in (1) is discontinuous and the standard The optimisation problem in (1) is solved by the method approaches to its optimisation are no longer applicable. The of Lagrange multipliers, or simply by substituting penalty has to be set with care. If it is too low it is

n1 = n − n2 −... − nD , so that the problem then involves ineffective; if it is set too high the solution will prefer − 1 D functionally unrelated variables. The solution reporting estimates for as ma ny areas as possible, but each satisfies the condition with sample size or precision that narrowly exceeds the set

Statistics Canada, Catalogue No. 12-001 6 Longford: Sample Size Calculation for Small•Area Estimation

threshold. See Marker (2001) for an alternative approach to efficiency of the national estimator. Consider the stratified this problem. estimator Figure 1 illustrates the impact of the priority exponent q D on the sample size allocation for a survey planned in ˆ 1 ˆ θ = ∑ N d θ d Switzerland, with the aim of estimating the population N d =1 means of a variable in its 26 cantons, assuming a common 2 ˆ within­canton variance σ . The planned overall sample size of the national mean θ of a variable, where θ d are is = 10,000 n . The curves in either panel connect the unbiased estimators of the within­canton means of the same optimal sample sizes for each exponent ; q they are drawn variable. Assuming stratified sampling with simple random ˆ on the linear scale (on the left) and on the log scale (on the sampling within strata (cantons), with θ d set to the within­ right). The population sizes are marked on the horizontal bar stratum sample means, at the bottom of each plot. On the log scale, the curves are 1 D N 2 linear. The log scale is useful also because the population var (θˆ ) = d (1− f )σ 2 , 2 ∑ d d sizes of the cantons are more evenly distributed on it. N d =1 nd For = 0, q each canton is allocated the same sample size, 10,000/26= 385, and for = 2 q the allocation is where fd = nd / N d is the finite population correction. proportional to the canton’s population size. For inter­ Figure 2 displays the function that relates the standard ˆ mediate values of , q sample sizes of the least populous error var (θ ) to the priority exponent q, calculated 2 cantons are boosted in relation to proportional allocation assuming σ = 100. The standard error is a decreasing ( = 2), q at the expense of reduced allocation to the most function of q ; it decreases more steeply at q = 0 than at populous cantons. The subsample sizes depend very little on q = 2, where it is quite flat. For q = 2, the goals of canton­ q for cantons with population of about 250,000, level and national estimation are in accord, and ˆ ˆ approximately 3% of the national population size. var (θ) = 0.100. For q = 0, var (θ) = 0.143; in this setting, optimality of the small­area estimation exerts a 2.1 The Priority for National Estimation considerable toll on national estimation, equivalent to As the canton­level subsample sizes differ from the halving the sample size (0.143/ 0.100 B 2). For negative proportional allocation for priority exponent < 2, q optimal q, the toll is even greater. canton­level estimation is accompanied by a loss of 00 0

q= 2 2,

0 q= 1.75 000 50 , 1 1,

)

q= 1.5

ns

) o

500

nt ns

a

o q = 0

q= 1.25

0 (c nt

a e q = 0.25 00

z , (c i

1 q= 1 s

e

q = 0.5 z e 200 i l

s

p

q= 0.75

e

q = 0.75 m l

a p

m q= 0.5 q = 1

a ubs 100

S

q= 0.25 q = 1.25 ubs 500

S

q= 0

50 q = 1.5

q = 1.75

q = 2 0 20

0 500 1,000 10 20 50 200 500 Population (in ’000) Population (in ’000) Log scale Figure 1. The sample size allocation to the Swiss cantons for a range of priority exponents q. The population sizes of the cantons are marked on the horizontal bar at the bottom of each plot.

Statistics Canada, Catalogue No. 12-001

Survey Methodology, June 2006 7

2 2 where Pd′ = Pd + GP+ Nd / N . The optimal sample sizes for the areas are

* σ d P d′ nd = n . σ1 P1′ + ... + σ D PD ′

This corresponds to an adjustment of the priorities Pd by 2 2 Standard error GP+ Nd / N . Note that this adjustment is neither additive nor multiplicative. The priority is boosted more for the more populous areas. As a consequence, the area­level subsample

0.00 0.05 0.10 sizes are dispersed more when the relative priority for 0.0 0.5 1.0 1.5 2.0 national estimation is incorporated and the area­level Priority exponent q priorities are unchanged. The finite population correction Figure 2. The standard error of the national estimator θˆ of the has no impact on n* because it reduces each sampling mean of a variable, as a function of the exponent q d for priorities of the canton­level estimation. variance vd and v by a quantity that does not depend on n. The priority G can be set by insisting that the loss of efficiency in estimating the national quantity θ does not Thus, the need for efficiency of the national estimator exceed a given percentage or that at most a few (or none) of

can be addressed by increasing the priority exponent. For the absolute differences |Pd′ − Pd | or log­ratios |log (Pd′ / Pd ) | example, the parties with rival inferential interests may are very large. However, the analytical problem is simple to negotiate about how much loss in efficiency of θˆ can be solve, so the survey management can be presented by the afforded, and the priority exponent would then be set to sampling designs that are optimal for a range of values of G . match this loss. Alternatively, this loss may be considered The dependence of the subsample size on the exponent by applying the optimal design for area­level estimation. If q and relative priority G is plotted in Figure 3 for the least it is regarded as excessive, q is increased until a balance is and most populous cantons, Appenzell­Innerrhoden and struck between the losses of efficiency for national and Zürich, in the respective panels A and C. Panels B and D small­area estimation. plot the same curves as A and C, respectively, on the log An unsatisfactory feature of these approaches is that they scale. Ignoring the goal of national estimation corresponds compromise the original purpose of the priorities − P to to G = 0 and ignoring the goal of small­area estimation to reflect the relative importance of the inferences about the very large values of G . Throughout, we assume that distinct small areas. This drawback is addressed by n = 10,000 and σ2 = 100, common to all cantons. ˆ associating θ with a priority, denoted by , G relative to For each exponent q < 2, the sample­size curve nd (G ) small­area estimation, and considering optimal estimation of decreases for the less populous and increases for the more

the set of D area­level targets θ d together with the national populous cantons toward the proportional representation

target θ. Thus, we minimise the objective function nd = nNd / N , which corresponds to q = 2. On the linear scale, the increase is quite rapid for Zürich for small q and D G , whereas the reduction for Appenzell­Innerrhoden is ∑ Pd v d()(), n d + GP+ v n d =1 more gradual. As the relative priority G is reduced, the excess sample size is re­distributed from Zürich (and a few ˆ Τ where =var ( θ ) v and P+ = P 1 D . The factor P+ is other populous cantons) to several less populous cantons. introduced to ameliorate the effect of the absolute sizes of Figure 4 plots the ‘national’ standard error var (θˆ )

Pd and the number of areas on the relative priority . G The under the optimal sample allocation for an array of values of ˆ priorities Pd can be interpreted only by their relative sizes, q and G . The diagram shows that the standard error of θ as, for any constant c> 0, Pd and cPd correspond to is reduced radically by a small increase of G in the vicinity identical sets of priorities for small­area estimation in (1). of G = 0, whereas for larger values of G it is affected only When the sampling design within each area is simple slightly. For each G , higher priority exponent q is random and θˆ is the standard stratified estimator, the associated with higher precision of θˆ . minimum is attained when

2 P d′ σd 2 = const, nd

Statistics Canada, Catalogue No. 12-001

8 Longford: Sample Size Calculation for Small•Area Estimation

Appenzell­Innerrhoden Log scale

q = 0 q = 0

q = 0.2

q = 0.4 q = 0.2 q = 0.6

q = 0.8 q = 0.4 q = 1 100 200 300 q = 0.6 q = 1.2 Sample size Sample size q = 0.8 q = 1.4

q = 1 q = 1.6

100 200 300 q = 1.2 q = 1.4 q = 1.8 q = 1.6 q = 1.8 q = 2 q = 2 0 100 200 300 400 0.5 2.0 5.0 20.0 100.0 500.0 Relative priority G Relative priority G A B

Zurich Log scale

q = 2 q = 2

q = 1.8 q = 1.8 q = 1.6

q = 1.6 q = 1.4

q = 1.2 q = 1.4 q = 1 q = 1.2 q = 0.8 q = 1

Sample size Sample size q = 0.6 q = 0.8

q = 0.6 q = 0.4 q = 0.4 q = 0.2 q = 0.2 q = 0 q = 0 500 1,000 1,500 500 1,000 1,500 0 100 200 300 400 0.5 2.0 5.0 20.0 100.0 500.0 Relative priority G Relative priority G C D ˆ Figure 3. The optimal sample sizes for the direct estimator θ d for combinations of priority exponents q and relative priorities G for the least and most populous cantons.

q = 0 3. Composite Estimation

4 0.1

The resources available for the conduct of a survey are q = 0.2

used most effectively by the optimal combination of a r .13

0 sampling design and estimator(s), and so the sampling

q = 0.4

erro

design and (the selection of) the estimator should be, in

rd

a q = 0.6 d

.12 ideal circumstances, optimised simultaneously. This prob­ 0

tan

S q = 0.8 lem is difficult to solve formally in most settings, although

q = 1 some estimators are more efficient than their competitors in .11

0 q = 1.2 a wide range of designs. Composite estimators (Longford q = 1.4 q = 1.6 q = 1.8 1999, 2004) are one such class. They are convex combina­ .10

0 tions of the direct small­area and national estimators, 0 100 200 300 400 Relative priority G % ˆ ˆ θd =(1 −bd ) θ d + b d θ , (2) Figure 4. The standard error of the national estimator for the allocation that is optimal under an array of priorities given by q and . G

Statistics Canada, Catalogue No. 12-001

Survey Methodology, June 2006 9

with area­specific coefficients bd that are estimates of the This equation does not have a convenient closed­form % optimum. The composition θ d exploits the similarity of the solution, but iterative schemes can be applied to solve it. areas; it is particularly effective when the areas have a small The value of n1 determines the remaining sample sizes nd , 2 − 1 2 between­area variance σB = D ∑ d (θd − θ ) , where and so optimisation corresponds to a one­dimensional − 1 θ = D ∑ d θ d . This variance is defined over the D pop­ search. If the provisional sample sizes n based on a set Τ ulation quantities θ d and is unaffected by the sampling value of n1 are too large, n 1D > n, n1 is reduced and the 2 design. In practice, σ B has to be estimated. When planning other sample sizes nd are calculated by solving (4). Note 2 2 a survey, estimates from other surveys of the same or a that the solution depends on the variances σ d and σ B . The related population have to be used, and the uncertainty problem is simplified somewhat when the areas have a 2 2 2 2 about σ B addressed. This can be done by sensitivity anal­ common variance σ = σ1 = ... = σ D . Then the solution of ysis, exploring the optimal designs for a range of plausible (4) depends on the variances only through the ratio 2 2 2 2 values of σ B . ω = σB / σ because σ is a multiplicative factor and has If the deviations ∆d = θd − θ were known the optimal no impact on the optimisation. * 2 coefficient bd in (2) would be, approximately, bd = σ d / By way of an example, suppose q = 1 and G = 10 in 2 2 (σd + nd ∆ d ). As ∆ d is not known (otherwise θ d would be planning a survey of the population of Switzerland with 2 estimated with high precision by θ + ∆ d ), we replace ∆ d n = 10,000, and ω = 0.10 is assumed. As the initial 2 by its average over the areas, equal to σ B , yielding the solution, we use the allocation optimal for direct estimation 2 2 coefficient bd =1/(1+ nd ωd ), where ωd = σB / σ d is the with the same values of q and G . One iteration updates the 2 variance ratio. The variance σ B also has to be estimated, sample size for each canton and, within it, the updating for but when there are many areas it is estimated with precision all but the arbitrarily selected reference canton d = 1 is also 2 much higher than most ∆ d are. iterative. The reference canton’s provisional subsample size If the coefficients bd are estimated with sufficient preci­ determines the current value of the constant on the right­ % sion the composite estimator θ d is more efficient than the hand side of (4). Then equation (4) is solved, iteratively, for ˆ ˆ two constituent estimators θ d and θ. Ignoring the un­ each canton d = 2, ..., D, using the Newton method. In the certainty about the within­ and between­area variances, as application, the number of these iterations was in single well as the national mean θ and the correlation between the digits for each canton. Finally, the subsample size for the national and area­level (direct) estimators, the average MSE reference canton is adjusted by the 1/ D ­multiple of the % of θ d is difference between the current total of the subsample sizes and the target total n. The updating of the cantons is itself 2 % σ B iterated, but only a few iterations are required to achieve aMSE(θd ) = , (3) 1 + nd ωd convergence; for example, all the changes in the subsample sizes were smaller than 1.0 after three iterations, and smaller 2 where ‘aMSE’ denotes the MSE in which ∆ d is replaced by than 0.01 after eight iterations. The convergence is fast 2 σ B , its average over the areas. The aMSE in (3) is also an because the starting solution is close to the optimum; the approximation to the conditional variance of the EBLUP largest difference between the two subsample sizes is for estimator of the area­level mean based on the two­level Zürich, 20.0 (from 1199.5 at the start to 1219.5 after eight (empirical Bayes) model (Longford 1993, Goldstein 1995, iterations). For Appenzell­Innerrhoden, the sample size is Marker 1999, and Rao 2003). See Ghosh and Rao (1994) reduced from 81.6 to 73.4. Change by less than unity takes for an authoritative review of application of these models to place for five cantons with population sizes in the range small­area estimation. 228,000–278,000. Note that the subsample sizes would in For the composite estimators of the area­level means, we practice be rounded, and possibly adjusted further to search for the sample allocation that minimises the objective conform with various survey management constraints. function No priority for national estimation D % ∑ Pd aMSE (θd ) + GP+ v. If national estimation has no priority, G = 0, equation (4) d = 1 has the explicit solution The solution satisfies the condition q / 2 * nω + D N d 1 q 2 2 2 n = − , N σ ω N σ d (q ) d B d + GP d d = const. ωU ω 2 + 2 2 (4) (1+ nd ωd ) N nd

Statistics Canada, Catalogue No. 12-001 10 Longford: Sample Size Calculation for Small•Area Estimation

(q ) q / 2 q / 2 where UNN=1 +... + D . This allocation is related to using different line types, is represented for a canton by a † the allocation nd , d= 1, ..., D , that is optimal for direct graph of the optimal sample size as a function of the estimation of θ d by the identity variance ratio ω . The limit of this function for ω→ +∞, equal to the sample size optimal for direct estimation, is 1  DN q / 2  n* = n † + d − 1 . marked by a bar at the right­hand margin of the panel. For d d  (q )  ω  U  ω= 0, the sampling design optimal for estimation of the national mean θ is obtained. Panels A and B at the top are Hence, when q > 0, the allocation optimal for composite for the overall sample size = 10,000 n and panels C and D estimation is more dispersed than for direct estimation. The for = 1,000. n (q) 2/ q break­even population size is NT = (U / D ) ; areas The diagram shows that the optimal sample sizes are * * with population sizes Nd < N T have smaller subsample nearly constant in the range ω∈( ω , + ∞ ); ω increases sizes for composite than for direct estimation, and areas with , qG and 1/ . n This is a consequence of the relatively with greater population sizes have greater subsample sizes. large sample size , n which ensures that the subsamples of * (For q = 0, nd ≡ n / D ). The amount of extra dispersion is most cantons are too large for any substantial borrowing of inversely proportional to ω. strength across the cantons to take place, unless the cantons For ω = 0, the equations for the optimal sampling design are very similar ().ω< ω* Most shrinkage coefficients θ lead to a singularity. In this case, each d is estimated bd =+1/(1 nd ω ) are very small. When = 10,000 n is efficiently by the national estimator θˆ , and so the design planned, for small values of ω, the optimal sample size optimal for composite estimation coincides with the design increases steeply for the least populous canton and drops * that is optimal for the national estimator (nd = nNd / N ). precipitously for the most populous canton. Dispersion of For q > 0, the optimal allocation yields negative sample the optimal sample sizes increases with q and , G * sizes nd when converging to the optimal allocation for estimating the θ, 2 / q national mean which corresponds to ω= 0. In contrast, U (q )  the optimal sample sizes are discontinuous at ω= 0 when N d <   . (5) nω + D  = 0; G the solutions diverge to −∞ for the least populous cantons. This (formal) solution is not meaningful. A negative In panels C and D, for = 1,000, n the dependence of the solution should come as no surprise because the aMSE in sample sizes on ω persists over a wider range of ω

(3) is an analytical function for nd > −1/ ωd . For small because there is a greater scope for borrowing strength ω> 0, the aMSE is a shallow decreasing function of the across the cantons with the smaller sample sizes. The * sample size nd . A negative nd indicates that a (small) optimal sample sizes are not monotone functions of ω ; for canton is not worth sampling because of its low inferential the least populous cantons there is a dip at small values of

priority Pd . Although additional sample size for a more ω . The dip is more pronounced for small G and large , q populous canton d′ may yield a smaller reduction of aMSE that is, when the disparities of the cantons’ priorities are than it would for a small canton , d its impact is magnified greater and inference about the national mean is relatively

by the larger priority Pd ′ . unimportant. This phenomenon, somewhat exaggerated by the log­scale of the vertical axis, is similar to the case Positive priority for the national mean discussed for = 0. G Because of the disparity in the

priorities Pd , a small reduction of aMSE for a more The aMSE in (3) ignores the uncertainty about the populous canton is preferred to a greater reduction for a less national mean θ, and this becomes acute when one of the populous canton. The dip is present also when = 10,000, n cantons is not represented in the sample. This deficiency of but it is so shallow and narrow as to be invisible with the (3) can be compensated for by setting the relative priority resolution of the graph. Note that the horizontal axes in G to a positive value. panels C and D have three times wider range of values of ω Figure 5 summarises the impact of the relative priority than in panels A and B. G and the priority exponent q on the optimal sample sizes In the context of the planned survey, it was agreed that of the least and most populous cantons, together with canton ω is unlikely to be smaller than 0.05. Therefore, the sample Thurgau which has the 13 th (median) largest population size, size calculations could be based on the direct estimator. 228,000. Each setting of , q indicated in the title, and , G

Statistics Canada, Catalogue No. 12-001

Survey Methodology, June 2006 11

Priority q = 0.5; n=10,000 Priority q = 1; n=10,000

0 0 ,00 ,00 2 2 Zurich

Zurich 0 Thurgau 0 0 0 5 5 e e z z i i

s s Thurgau

e e pl pl

Appenzell­Innerrhodden

0 m m 0 0 a a 0 1 S 1 S

0 0 Appenzell­Innerrhodden

5 5

G = 1

0 0 G = 10 2

2

G = 100

0 0

1 1

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

Variance ratio Variance ratio

A B

Priority q = 0.5; n=1,000 Priority q = 1; n=1,000

0 0

0 0

1 1

Zurich Zurich

Thurgau 0 0

5 5

e e z z i i s s

Thurgau 0 0 e e 2 2 pl pl

m m 0 0 a a

1 Appenzell­Innerrhodden 1 S S

5 5 Appenzell­Innerrhodden

G = 1

2 G = 10 2 G = 100

1 1

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Variance ratio Variance ratio C D Figure 5.The sample sizes optimal for composite estimation of the population means for three cantons for a range of variance ratios ω, priority exponents = 0.5 q and = 1.0 q and relative priorities = 1, 10 G and 100. The overall sample sizes are 10,000 (panels A and B) and 1,000 (panels C and D).

4. Discussion in particular. The method can be extended to more complex estimators, but then the values of further parameters are The method described in this paper identifies the optimal required. A more constructive approach regards the optimal design for the artificial setting of stratified sampling with sampling design for the simplified setting as an approx­ simple random sampling within homoscedastic strata. imation to the sampling design that is optimal for the more Specifying the priorities for small­area and national esti­ realistic setting. Even if the optimal sampling design were mation is a key element of the method. In practice, the identified, it could not be implemented literally, because of priorities may be difficult to agree on, and some of the imperfections in the sampling frame and (possibly) infor­ assumptions made may be problematic, the assumptions of mative and unevenly distributed nonresponse. However, the equal within­stratum variances and simple random sampling approach can be applied, in principle, to any small­area

Statistics Canada, Catalogue No. 12-001

12 Longford: Sample Size Calculation for Small•Area Estimation

estimator that has an analytical expression for the exact or For this, an analytically simple solution that can be executed approximate MSE. This includes all estimators based on many times, for a range of settings, is preferred to a more empirical Bayes models, to which the composite estimator complex solution, the properties of which are more difficult is closely related. Sampling weights can be incorporated in to explore. sample size calculation if they, or their within­area Multivariate composite estimators exploit the similarity distributions, are known, subject to some approximation, in not only across areas, but also across (auxiliary) variables, advance. Sample size calculation for a single (national) time, subpopulations, and the like (Longford 1999 and quantity entails the same problem. 2005). The aMSEs of these estimators depend on the scaled Although the numerical solution of the problem for variance matrix Ω , the multivariate counterpart of ω . composite estimation with a positive priority G is simple Sample size calculation for this method is difficult to and involves no convergence problems, it is advantageous implement directly because both variances and covariances to have an analytical solution, so that a range of scenarios in Ω are essential to the efficiency of the estimators. A can be explored. The proximity of the solutions for the more constructive approach matches the matrix Ω with a direct and composite estimation suggests that the allocation ratio ω that can be interpreted as the similarity of the areas optimal for direct estimation may be close to optimum also after adjusting for the auxiliary information, as in empirical for composite estimation with realistic values of ω, say, Bayes methods. ω> 0.05. When control over the sample sizes allocated to the areas Various management and organisational constraints are is not possible sample size calculation is still meaningful as another obstacle to the literal implementation of an analyt­ a guide for how the sample sizes should be allocated on ically derived sampling design. In household surveys, it is average. In general, a unit reduction of the sample size is often preferable to assign an (almost) full quota of addresses associated with greater loss of precision than a unit increase. to each interviewer, and so sample sizes that are multiples of Therefore, designs in which the sampling (replication)

the quota are preferred. These and numerous other con­ variance of the subsample sizes nd ( d fixed) is smaller are straints can be incorporated in the optimization problem, better suited for small­area estimation. In designs with large although they are often difficult to quantify or the designer clusters, such variances are large because, at an extreme, an may not be aware of them because of imperfect communi­ area may not be represented in the survey in some cation. Improvisation, after obtaining the sampling design replications and may be over­represented several times in that is optimal for a simpler setting, may be more practical. others. Using smaller clusters is in general preferable for Also, priorities, or expert opinion about them, may change small­area estimation if this does not inflate the survey costs over time, even while the survey is being conducted and and a fixed overall sample size can be maintained. analysed. Estimates that are associated with standard errors or coefficients of variation greater than a specified threshold are often excluded from analysis reports. Intention to do this Acknowledgements can be reflected in sample size calculation by regarding θˆ

as the estimator of θ d , that is, by setting the associated I am grateful to the Deputy Editor and referees for 2 ˆ MSE to the corresponding aMSEσB + var ( θ ) or to suggesting several improvements but mainly for leading me another (large) constant. to discover an error in an earlier version of the manuscript. Although we propose a particular class of priorities for Discussions with the Polish team in the EURAREA project the small areas, no conceptual difficulties arise when are acknowledged. another class is used instead. It may depend on several population quantities, not only the population size. In principle, the priorities can also be set for the areas individ­ References ually, although this is practical only when the number of Fay, R.E., and Herriot, R.A. (1979). Estimates of income for small areas is small. The formula­based and individually set places: An application of James­Stein procedures to data. priorities can be combined by adjusting the priorities, such Journal of the American Statistical Association, 74, 269­277. as PN= q , for a few areas to reflect their exceptional role d d Ghosh, M., and Rao, J.N.K. (1994). Small area estimation: An in the analysis. appraisal. Statistical Science, 9, 55­93. Sensitivity analysis, exploring how the sampling design is changed as a result of altered input, is essential for Goldstein, H. (1995). Multilevel Statistical Models. Second Edition. understanding the impact of uncertainty about the estimated Edward Arnold, London, UK. parameters (the variance ratio ω in particular) and the Longford, N.T. (1993). Random Coefficient Models. Oxford arbitrariness, however limited, in how the priorities are set. University Press, Oxford.

Statistics Canada, Catalogue No. 12-001 Survey Methodology, June 2006 13

Longford, N.T. (1999). Multivariate shrinkage estimation of small­ Marker, D.A. (2001). Producing small area estimates from national area means and proportions. Journal of the Royal Statistical surveys: methods for minimizing use of indirect estimators. Society, Series A, 162, 227­245. Survey Methodology, 27, 183­188. Longford, N.T. (2004). Missing data and small area estimation in the Platek, R., Rao, J.N.K., Särndal, C.­E. and Singh, M.P. (Eds.) (1987). UK Labour Force Survey. Journal of the Royal Statistical Society, Small Area Statistics. New York: John Wiley & Sons. Series A, 167, 341­373. Rao, J.N.K. (2003). Small Area Estimation. New York: John Wiley & Longford, N.T. (2005). Missing Data and Small­Area Estimation. Sons, Inc. Modern Analytical Equipment for the Survey Statistician. Springer­Verlag, New York. Singh, M.P., Gambino, J. and Mantel, H.J. (1994). Issues and strategies for small area data. Survey Methodology, 20, 3­22. Marker, D.A. (1999). Organization of small area estimators using a generalized framework. Journal of Official Statistics, 15, 1­24.

Statistics Canada, Catalogue No. 12-001