11 Sample Surveys II: Stratification
Total Page:16
File Type:pdf, Size:1020Kb
MAS/MASOR/MASSM - Statistical Analysis - Autumn Term 11 Sample Surveys II: Strati¯cation 11.1 Introduction Strati¯ed Random Sampling is characterized by the following: ² the population is divided into a number of parts called strata, say into K strata; ² a simple random sample is drawn independently from each part; ² unbiased estimates of the population mean are obtained using an appropriately weighted sum of the sample means from each stratum. Strati¯cation is employed for a number of reasons: i) di®erences between the population strata means do not make a contribution to the sampling error of the estimate yst. Sampling error arises solely from variations among the sampling units that are in the same stratum. Thus, if we form strata in an appropriate way, then we can expect a gain in precision over simple random sampling. ii) we can choose the size of the sample taken from each stratum - sampling fractions need not be the same in each stratum. This gives scope to do an e±cient job allocating resources to the sampling within strata. 11.2 Notation Let Yij be the value of the variable Y for the j-th member of the i-th stratum in the population. Let yij be the value of the variable y for the j-th member taken from the i-th stratum from within the sample. Also, let Ni be the total number of sampling units in the i-th stratum, in which case, the PK total population size is given by N = i=1 Ni. The following table summarizes some of the other quantities of interest: 1 Population Sample Size N n Strata sizes Ni ni Strata weights Wi = Ni=N wi = ni=n P P P P K Ni K ni Total T = i=1 j=1 Yij t = i=1 j=1 yij Mean Y = T=N y = t=n P P Ni ni Strata Totals Ti = j=1 Yij ti = j=1 yij Strata Means Y i = Ti=Ni yi = ti=ni PK PNi 2 PK Pni 2 2 i=1 j=1(Yij ¡Y ) 2 i=1 j=1(yij ¡y) Variances S = N¡1 s = n¡1 PN Pn i (Y ¡Y )2 i (y ¡y )2 Strata Variances S2 = j=1 ij i s2 = j=1 ij i i Ni¡1 i ni¡1 Also, de¯ne fi = ni=Ni to be the sampling fraction for the i-th stratum, and, f = n=N to be the overall sampling fraction. 11.3 Design E±ciency of Strati¯ed Sampling De¯ne XK yst = Wiyi i=1 to be the strati¯ed (simple) sample mean. By exploiting results derived under simple random sampling, and noting the independence of the samples drawn from di®erent strata, we deduce that XK XK 1 XK 1 XK XNi E[y ] = W E[y ] = W Y = N Y = Y = Y: (1) st i i i i N i i N ij i=1 i=1 i=1 i=1 j=1 and XK XK S2 var(y ) = W 2var(y ) = W 2 i (1 ¡ f ) (2) st i i i n i i=1 i=1 i by recalling that, under simple random sampling (SRS), S2 var(y) = (1 ¡ f): n 2 Compare this with the overall sample average: XK 1 XK y0 = w y = n y i i n i i i=1 i=1 In general, this will di®er from yst unless ni=Ni = n=N, i.e. ni / Ni for i = 1;:::;K: the sample sizes are proportional to the stratum sizes, which is proportional allocation (see later); such an allocation also ensures that y0 is unbiased for Y . The design e±ciency is determined by comparing var(yst) with var(y). Indeed, de¯ne Deff = var(yst)=var(y): If the within-stratum sampling fractions are equal to each other, then the gain in e±ciency 2 2 depends on how small the fSi g are in comparison to S . Indeed, since, in this case (see later), 1 ¡ f XK N var(y ) = i S2 st n N i i=1 then ( ) (1 ¡ f) 1 XK var(y) ¡ var(y ) = S2 ¡ N S2 : st n N i i i=1 However, ( ) ( ) 1 XK XNi 1 XK XNi S2 = (Y ¡ Y )2 = (Y ¡ Y + Y ¡ Y )2 N ¡ 1 ij N ¡ 1 ij i i i=1 j=1 i=1 j=1 µ ¶ XK N ¡ 1 XK N = i S2 + i (Y ¡ Y )2: N ¡ 1 i N ¡ 1 i i=1 i=1 Ni¡1 Ni If the fNig are su±ciently large, then N¡1 ¼ N¡1 , and so ( ) 1 XK XK S2 ¼ N S2 + N (Y ¡ Y )2 : N i i i i i=1 i=1 Thus (1 ¡ f) XK (1 ¡ f) XK var(y) ¡ var(y ) ¼ N (Y ¡ Y )2 = W (Y ¡ Y )2 st nN i i n i i i=1 i=1 which is strictly positive unless the fY ig are all equal to each other. Thus (under the proviso that the fNig are su±ciently large), the strati¯ed sample mean is usually more e±cient than the simple random mean. A more precise analysis can be found in the next section. 3 11.4 Choice of Strata fractions There are a number of possible strategies available for selecting the strata sampling fractions. We outline a few of these below, along with their relative e±ciencies. 11.4.1 Equal Allocation Set ni = n=K, for i = 1;:::;K, i.e. equal 'case loads': this particular mechanism may be considered to be administratively convenient. 11.4.2 Proportional Allocation ni n Set ni proportional to Ni, i.e. ni / Ni, which implies that fi = = = f, and so ni = n£Wi, Ni N i=1;:::;K. Denote the corresponding estimate of Y by yst;p. Then XK S2 var(y ) = W 2 i (1 ¡ f ) st;p i n i i=1 i (1 ¡ f) XK 1 ¡ f = W S2 = S2 n i i n w i=1 2 where Sw is the average within-strata variance. Then Deff (yst;p) = var(yst;p)=var(y) S2 S2 = w (1 ¡ f)= (1 ¡ f) = S2 =S2: n n w This shows that e±ciency is achieved upon the basis of homogeneity within strata. 11.4.3 Optimum Allocation based on Cost Problem: Minimize the variance of the estimate subject to a cost constraint. Let c0 be some ¯xed cost, and de¯ne ci to be the cost per sampled unit in stratum i. Then the total cost satis¯es XK c0 + nici · C (3) i=1 where C is the available budget. 4 Thus we seek to minimize XK W 2S2(1 ¡ f ) var(y ) = i i i st n i=1 i subject to the constraint (3). To solve this problem, form the Lagrangian: XK L(n1; : : : ; nK ) = var(yst) + ¸(c0 + nici ¡ C): i=1 Now 2 2 @L Wj Sj = ¡ 2 + ¸cj @nj nj which, upon equating to zero, yields 2 2 Wj Sj =cj ¸ = 2 : nj This implies that WjSj nj / p : cj Actually, if the costs across all the strata were equal, then the optimal allocation would be such that nj / WjSj: we denote the resulting estimate by yst;o. It can be shown that à !2 1 XK 1 XK var(y ) = W S ¡ W S2: st;o n i i N i i i=1 i=1 11.4.4 Disproportionate sampling Disproportionate sampling is particularly worthwhile when the variable being measured has a highly skewed or asymmetrical distribution. It is often the case that such populations contain a few sampling units that have large values for this variable and many units with small values. Examples include: family income, house prices, cases tried by courts, sales turnover. We explore this allocation rule via the following example. Example 11.1 (An application of 'Disproportionate Sampling') Consider a population of 1019 tertiary colleges and universities in the U.S.A. Data on student enrollment for a certain academic year were as follows, strati¯ed by size of institution. 5 Stratum No. of students No. of Stratum Stratum Std. per institution institutions Total Mean Dev. i Ni Ti Y i Si 1 <1000 661 292671 443 236 2 1000-3000 205 345302 1684 625 3 3000-10000 122 672728 5514 2008 4 ¸10000 31 573693 18506 10023 N = 1019 T = 1884394 These data might be used to plan a sample designed to give a quick estimate of total registra- tion in some future year. Assume that the sampling costs per unit are the same in all strata and that we are aim- P4 ing at a total sample size of n = i=1 ni = 250. Then, we could attempt to invoke the optimal allocation rule, i.e. that nj / NjSj. Under this rule, n4 would be 31 £ 10023 250 £ ¼ 92: (661 £ 236) + (205 £ 625) + (122 £ 2008) + (31 £ 10023) However, N4 = 31: thus our 'disproportionate' allocation procedure says that we take a 100% sample from stratum 4 and then use the rule ni / NiSi to distribute the remaining sample over the other strata. Stratum i ni Sampling rate fi 1 65 10% 2 53 26% 3 101 83% 4 31 100% Total 250 6 11.5 Estimation of Totals A number of unbiased estimators for T can be formulated: these di®er in their variability according to the sampling scheme from which they are generated. Under simple random sampling, we use Tb = Ny, and so E[Tb] = N £ E[y] = NY = T var(Tb) = N 2(1 ¡ f)S2=n: Under strati¯ed random sampling with proportional allocation, we use b Tst;p = Nyst;p: Now b E[Tst;p] = E[Nyst;p] = N £ E[yst;p] = NY = T using (1), which is still applicable under proportional allocation.