<<

Stat472/572 : Theory and Practice Instructor: Yan Lu

1 Chapter 3: Stratified Sampling Example: 1000 male and 100 female in population.

• Now take an SRS of size 55 from the population. Possibly we got a without female. —-Most people would not consider such a sample to be rep- resentative of the population, since men and women might re- spond differently on the item of interest

• Use stratified sample, we can take 50 male and 5 female —-a sample with no or few males cannot be selected, protected from the possibility of obtaining a really bad sample —-increases the precision of the estimators 2 Stratified Sampling

• Divide population into H subpopulations, called strata. The strata do not overlap and they constitute the whole population

• Each sampling unit belongs to exactly one stratum

• Draw an independent probability sample from each stratum

• Pool the information to obtain overall population estimates

3 Figure 1: Stratification

4 Example 3.2: Agriculture (Refer to Example 2.5)

• In Example 2.5, we generated a random sample. But some areas were overrepresented, and others not represented at all

• part of the large variability arises because counties in the western United States are larger, and thus tend to have larger values of y, than counties in the eastern United States

• Taking a stratified sample can provide some balance in the sample on the stratifying variable

• We use the four regions of the United States: Northeast (NE), North

Central (NC), South (S), and West (W) strata, and sample about 10% of the

counties in each stratum.

5 Figure 2: Boxplot of from example 3.2. The thick line for each region is the of the sample data from that region; the other horizontal lines in the boxes are the 25th and 75th . The Northeast region has a relatively small median and small ; the West region, however, has a much higher median and variance. The distribution of farm acreage appears to be positively skewed in each of the regions. Millions of Acres 0.0 0.5 1.0 1.5 2.0

NC NE S W Region6 Stratum # of counties in stratum # of counties in sample

Northeast 220 21

North Central 1054 103

South 1382 135

West 422 41

Total 3078 300

7 Table 1: for each stratum

region stratum size sample size average variance

Northeast 220 21 97,629.8 7,647,472,708

North Central 1045 103 300,504.2 29,618,183.543

South 1382 135 211,315.0 53,587,487,856

West 422 41 662,295.5 396,185,950,266

• We took an SRS in each stratum, for Northeast region

tˆ1 = (220)(97, 629.81) = 21, 478, 558.2 µ ¶ 21 7, 647, 472, 708 V (tˆ ) = (220)2 1 − = 1.594316 × 1013 1 220 21

8 Table 2: Estimates of the total number of farm acres and estimated variance of the total for each of the four strata

region estimated total estimated variance of the total

Northeast 21, 478, 558.2 1.59432 × 1013

North Central 316, 731, 379.4 2.88232 × 1014

South 292, 037, 390.8 6.84076 × 1014

West 279, 488, 706.1 1.55365 × 1015

Total 909, 736, 034.4 2.5419 × 1015

9 Table 3: Comparison between SRS and stratified random sampling for agriculture data

sample size tˆ SE

SRS 300 916,927,110 58,169,381

Stratification 300 909,736,034 50,417,248

• Observations within many strata tend to be more homogeneous than observations in the population as a whole. Reduction in variance in the individual strata often leads to a reduced variance for the population estimate

estimated variance from stratified sample, with n = 300 2.5419 × 1015 • = = 0.75 estimated variance from SRS, with n = 300 3.3837 × 1015

• If these were the population , we would expect that we would need only (300)(0.75) =

225 observations with a stratified sample to obtain the same precision as from an SRS of

300 observations.

10 Comments:

• Reduce variability by eliminating possible bad samples

• May want data of known precision for subgroups

• Lower cost, convenient

• Usually reduce variability when estimating the whole popula- tion

11 Theory of Stratified Sampling: strata 1 2 ··· H PH popn size N1 N2 ··· NH h=1 Nh = N PH sample size n1 n2 ··· nH h=1 nh = n

popn total t1 t2 ··· tH

• Take an SRS of size nh from stratum H

12 • tstr = t1 + t2 + ··· + tH •

tˆstr = tˆ1 + tˆ2 + ··· + tˆH

= N1y¯1 + N2y¯2 + ··· NH y¯H

Vˆ (tˆstr) = Vˆ (tˆ1) + Vˆ (tˆ2) + ··· + Vˆ (tˆH ) µ ¶ XH n N 2s2 = 1 − h h h Nh nh h=1

13 • tˆ y¯ = str str N P H tˆ = h=1 h N P H N y¯ = h=1 h h N XH N = h y¯ N h h=1 Weighted average of stratum

14 • Confidence intervals for stratified samples —If either(1) the sample sizes within each stratum are large —or (2) the sampling design has a large number of strata According to (Krewski and Rao 1981), an approximate 100(1 − α)% confidence interval for the popula-

tion y¯U is

y¯str ± zα/2SE(¯ystr)

Some survey software packages use the of a t dis- tribution with n − H degrees of freedom rather than the per- centile of the normal distribution

15 Population quantities Sample quantities yhj: value of jth unit in stratum h Nh X P Nh th = yhj tˆh = yhj = Nhy¯h j=1 nh j∈Sh PH PH PH t = th tˆstr = tˆh = Nhy¯h h=1 h=1 h=1 PNh yhj j=1 1 X y¯hU = y¯h = yhj Nh nh j∈Sh PH PNh yhj H t j=1 tˆ X N y¯ = = h=1 y¯ = str = h y¯ U N N str N N h h=1 Nh 2 2 2 P (yhj − y¯hU ) 2 P (yhj − y¯h) Sh = sh = j=1 Nh − 1 j∈Sh nh − 1 16 tˆstr = tˆ1 + tˆ2 + ··· + tˆH

= N1y¯1 + N2y¯2 + ··· NH y¯H ˆ ˆ ˆ ˆ V (tˆstr) = V (tˆ1) + V (tˆ2) + ··· + V (tˆH ) µ ¶ XH n N 2s2 = 1 − h h h Nh nh h=1 tˆ XH N y¯ = str = h y¯ str N N h h=1 µ ¶ µ ¶ 1 XH n N 2 s2 ˆ ˆ ˆ h h h V (¯ystr) = 2 V (tstr) = 1 − N Nh N nh h=1

17 Properties of the estimators:

• E[tˆstr] = t

• E[¯ystr] =y ¯U

• Vˆ (tˆstr) is an unbiased estimator of V (tˆstr)

• Vˆ (¯ystr) is an unbiased estimator of V (¯ystr)

18 XH E[tˆstr] = E[ Nhy¯h] h=1 XH = NhE(¯yh) h=1 XH XH = Nhy¯hU = th = t h=1 h=1 tˆ E[¯y ] = E[ str ] str N t = N =y ¯U

19 Stratified sampling for proportions Special case of mean when   1 if the unit has the characteristic yi =  0 otherwise

20 y¯h =p ˆh 2 nh sh = pˆh(1 − pˆh) nh − 1 XH N pˆ = h pˆ str N h h=1 XH µ ¶ µ ¶2 nh Nh pˆh(1 − pˆh) Vˆ (ˆpstr) = 1 − Nh N nh − 1 h=1 XH tˆstr = Nhpˆh h=1 2 Vˆ (tˆstr) = N Vˆ (ˆpstr)

21 Example 3.4. The American Council of Learned Societies (ACLS) used a stratified random sample of selected ACLS societies in seven disciplines to study publication patterns and computer and library use among scholars who belong to one of the mem- ber organizations of the ACLS. The data is shown in the follow- ing table.

22 Discipline Membership # mailed valid returns female

Nh nh members(%) Literature 9100 915 636 38

Classics 1950 633 451 27

Philosophy 5500 658 481 18

History 10850 855 611 19

Linguistics 2100 667 493 36

Political Science 5500 833 575 13

Sociology 9000 824 588 26

Totals 44000 5385 3835

• Want to estimate the percentage and number of female members of the major societies in those seven disciplines

23 • Ignoring the nonresponse, assume no duplicate memberships X7 N pˆ = h pˆ str N h h=1 9100 9000 = × .38 + ··· + × .26 44000 44000 = .2465 v u uX7 µ ¶ µ ¶2 t nh Nh pˆh(1 − pˆh) SE(ˆpstr) = 1 − Nh N nh − 1 h=1 = .0071

The estimated total number of female members in the societies is

tˆstr = 44000 × .2465 = 10847

with

SE(tˆstr) = 44000 × .0071 = 312

24 Review: Stratified random sampling Strata 1 2 ··· H PH Population size N1 N2 ··· NH h=1 Nh = N PH Sample size n1 n2 ··· nH h=1 nh = n

Population total t1 t2 ··· tH

25 Population quantities Sample quantities yhj: value of jth unit in stratum h same Nh X P Nh th = yhj tˆh = yhj = Nhy¯h j=1 nh j∈Sh PH PH PH t = th tˆstr = tˆh = Nhy¯h h=1 h=1 h=1 PNh yhj j=1 1 X y¯hU = y¯h = yhj Nh nh j∈Sh PH PNh yhj H t j=1 tˆ X N y¯ = = h=1 y¯ = str = h y¯ U N N str N N h h=1 Nh 2 2 2 P (yhj − y¯hU ) 2 P (yhj − y¯h) Sh = sh = j=1 Nh − 1 j∈Sh nh − 1 26 Properties of the estimators:

• E[tˆstr] = t

• E[¯ystr] =y ¯U Confidence intervals for stratified samples —If either(1) the sample sizes within each stratum are large —or (2) the sampling design has a large number of strata According to central limit theorem (Krewski and Rao 1981), an approximate

100(1 − α)% confidence interval for the population mean y¯U is

y¯str ± zα/2SE(¯ystr)

Some survey software packages use the percentile of a t distribution with

n − H degrees of freedom rather than the percentile of the normal distrib-

ution

27 Using Weights Sampling weights: the number of units in the population represented by each sample member (h, j), h: stratum, j: elements. XH tˆstr = Nhy¯h h=1 XH X Nh = yhj nh h=1 j∈Sh XH X = whjyhj

h=1 j∈Sh

Nh where whj = nh PH P whjyhj h=1 j∈S y¯ = h str PH P whj h=1 j∈Sh 28 Example: Suppose a population has 2000 units, 1600 of them are males (stratum 1), and 400 are females (stratum 2). If the sample has 400 units, 200 units from each stratum, then, 200 1 1 π1j = = and w1j = = 8 1600 8 π1j 200 1 1 π2j = = and w2j = = 2 400 2 π2j • each man in the sample represents 8 men in the population

• each woman in the sample represents 2 women in the popula- tion

29 • πhj = nh/Nh

• whj = Nh/nh

H H H XH X P P P P Nh • tˆstr = tˆh = Nhy¯h = yhj = whjyhj h=1 h=1 h=1 j∈Sh nh h=1 j∈Sh µ ¶ PH PH n S2 ˆ ˆ 2 h h • V (tstr) = V (th) = Nh 1 − h=1 h=1 Nh nh PH P whjyhj H P N h=1 j∈S • y¯ = tˆ /N = h y¯ = h str str h H h=1 N P P whj h=1 j∈Sh µ ¶ PH N 2 n S2 ˆ 2 h h h • V (¯ystr) = V (tstr)/N = 2 1 − h=1 N Nh nh

30 Comments:

• Let πhj be the probability of selecting unit j from stratum h. Then whj =

1/πhj = Nh/nh

P P P P N XH • H w = H h = N = N h=1 i∈Sh hj h=1 i∈Sh h nh h=1 —-The whole sample represents the entire population and sum of the weights is equal to the population size P P • tˆ = H w y str h=1 j∈Sh hj hj P P P P • y¯ = H w y / H w str h=1 j∈Sh hj hj h=1 j∈Sh hj

31 Back to the previous example. Suppose a population has 2000 units, 1600 of them are males (stratum 1), and 400 are females (stratum 2). If we randomly select 160 males from stratum 1 and 40 women from stratum 2, 160 1 1 π1j = = and w1j = = 10 1600 10 π1j 40 1 1 π2j = = and w2j = = 10 400 10 π2j # of sampled units in each stratum is proportional to the size of the stratum. We call this allocation method proportional alloca- tion

32 Proportional Allocation: # of sampled units in each stratum is proportional to the size of the stratum nh n n = , nh = Nh Nh N N

nh n 1 N πhj = = and whj = = Nh N πhj n Sample is self-weighting P XH XH y Nh Nh j∈Sh hj y¯str = y¯h = N N nh h=1 h=1 XH 1 X 1 XH X = y = y n hj n hj h=1 j∈Sh h=1 j∈Sh =y ¯

33 Variances: ³ n ´ 1 X N V (¯y ) = 1 − h S2 prop str N n N h h ³ n ´ N X V (tˆ ) = 1 − N S2 prop str N n h h h

34 ANOVA Table SSB df Sum of Squares

H Nh P P 2 Between strata SSB H − 1 (¯yhU − y¯U ) h=1 j=1 H P 2 = Nh(¯yhU − y¯U ) h=1 H Nh P P 2 Within Strata SSW N − H (yhj − y¯hU ) h=1 j=1 H P 2 = (Nh − 1)Sh h=1 H Nh P P 2 Total corrected SSTO N − 1 (yhj − y¯U ) h=1 j=1 = (N − 1)S2 SSTO = SSB +SSW 35 Comparison between SRS and proportional allocation à ! µ ¶ XH XH n S2 ˆ 2 h h V (tstr) = V Nhy¯h = Nh 1 − Nh nh h=1 h=1 XH ³ ´ XH ³ ´ n N 2 2 n N 2 = 1 − NhSh = 1 − NhSh N nNh N n h=1 h=1 " # ³ n ´ N XH = 1 − SSW + S2 N n h h=1

³ n ´ S2 V (tˆ ) = 1 − N 2 srs N n ³ n ´ N 2 1 = 1 − (SSW + SSB) N n N − 1 ³ n ´ N ≈ 1 − (SSW + SSB) N n

36 Proportional stratification is more efficient, if XH 2 Sh < SSB h=1 H P 2 where SSB = Nh(¯yhU − y¯U ) . h=1

This is usually true, since the large population sizes of the strata will force Nh(¯yhU − 2 2 y¯U ) > Sh Comments

• In general, the variance of the estimator of t from a stratified sample with proportional allocation will be smaller than the variance of the estimator of t from SRS with the same number of observations

• The more unequal the stratum means y¯hU , the more homogeneous the within stratum

units, the more precision you will gain by using proportional allocation.

37 Optimal Allocation Example: Want to take a sample of American corporations to estimate the amount of trade with Europe

• The variation among large corporations would be greater than the variation among small ones —-often, large units are more variable than small units

• Need to sample a higher percentage of the large corporations

• Proportional allocation won’t work well in this situation

—-Proportional allocation has same percentage of sampling within each stratum

2 —-If the variances Sh are similar, proportional allocation is a good choice

2 —-If the variances Sh vary substantially, we may want to take more samples from the

strata with larger variances

38 Cost function XH c = c0 + chnh h=1 where c0 is the overhead costs, such as maintaining an office, ch is the cost of sampling an observation in stratum h

• Want to minimize V (tˆstr) for a fixed cost c or minimize c for a fixed V (tˆstr) µ ¶ XH n S2 ˆ 2 h h V (tstr) = Nh 1 − Nh nh h=1 XH 2 XH 2 Sh 2 = Nh − NhSh nh h=1 h=1 —–Same as minimize XH 2 2 Sh Nh nh h=1 39 à ! XH 2 XH 2 Sh f = Nh + λ c0 + chnh − c nh h=1 h=1 2 2 ∂f −Nh Sh = 2 + λch = 0 ∂nh nh N S n = √h h h c λ P h by the fact that h nh = n we have 1 n √ = PH √ λ l=1 NlSl/ cl à √ ! NhSh/ ch nh,opt = n × PH √ l=1 NlSl/ cl

40 NhSh nh,opt ∝ √ ch

We take a larger sample from stratum h if

• The stratum size Nh is large

• The variance within the stratum Sh is large

• The sampling within the stratum ch is inexpensive

41 Ã √ ! NhSh/ ch nh,opt = n × PH √ l=1 NlSl/ cl

Neyman allocation: ch’s are all equal à ! NhSh nh,Neyman = n × PH l=1 NlSl

n Let a = Pl=H Recall l=1 NlSl à ! NhSh nh,Neyman = n × PH l=1 NlSl

42 so that nh,Neyman = a × NhSh µ ¶ XH n N 2S2 ˆ h h h V (tstr,Neyman) = 1 − Nh nh h=1 µ ¶ XH aN S N 2S2 = 1 − h h h h Nh aNhSh h=1 XH N S = (1 − aS ) h h h a h=1 Ã ! XH PH n NhSh l=1 NlSl = 1 − P Sh H n h=1 l=1 NlSl PH PH NhSh NlSl XH = h=1 l=1 − N S2 n h h h=1

43 µ ¶ XH n N 2 ˆ h h 2 V (tstr,Prop) = 1 − Sh Nh nh h=1 XH ³ n ´ N = 1 − N S2 N n h h h=1 XH N XH = N S2 − N S2 n h h h h h=1 h=1

44 XH XH XH XH XH 2 2 NhSh NlSl = NhSh + 2 NiNjSiSj h=1 l=1 h=1 i=1 j>i XH XH XH XH 2 2 2 2 2 NNhSh = NhSh + NiNj(Si + Sj ) h=1 h=1 i=1 j>i ˆ ˆ V (tstr,Neyman) ≤ V (tstr,prop) Relative precision of stratification and srs ˆ ˆ ˆ V (tstr,Neyman) ≤ V (tstr,Prop) ≤ Vsrs(t)

45 Example 3.9, Dollar stratification is often used in accounting. The recorded book amounts are used to stratify the population. If auditing the loan amounts for a financial institution

2 stratum 1 might consist of all loans of more than $1 million, Sh will be much larger in this stratum, need a higher for this stratum

stratum 2 might consist of loans between $500,000 and $999,999 ···

smallest stratum of loans less than $10,000

• Optimal allocation is often an efficient strategy for such a stratification

— If the goal of the audit is to estimate the dollar discrepancy between the audited amounts

and the amounts in the institution’s books, an error in the recorded amount of one of the

$3,000,000 loans is likely to contribute more to the audited difference than an error in the

recorded amount of one of the $3,000 loans. In a survey such as this, you may even want

to use sample size N1 in stratum 1.

46 Some design issues of stratified random sampling

• Allocating observations to strata nh n —-Proportional allocation: = Nh N —-Optimal allocation: Neyman allocation: ch’s are all equal    NhSh  n = n   h,Neyman PH  NlSl l=1 • Sample size

• Defining strata: variables and number of strata

47 Determining sample size µ ¶ XH n S2 ˆ 2 h h V (tstr) = Nh 1 − Nh nh h=1 XH 2 XH 2 Sh 1 n 2 2 ≤ Nh · = Nh Sh = v/n nh n nh h=1 h=1 2 • v depends on stratum size Nh, variances Sh, and on the relative sample sizes nh/n

• v can be thought of as the “average” variability per observation unit in a stratified random sample with the specified allocation p 95 % CI: tˆstr ± zα/2 v/n p 2 2 zα/2 v/n = e, n = zα/2v/e

48 Defining Strata:

1. Variables for stratification

• Highly associated with variables of interest —–For estimating total business expenditures on advertising, we might stratify by number of employees or size of the busi- ness and by the type of product or service —–For farm income, we might use the size of the farm as a stratifying variable, since we expect that larger farms would have higher incomes

• Known for all sampling units in the population

49 2. Number of strata:

• Depends upon many factors such as the difficulty in construct- ing a sampling frame with stratifying information, and the cost of stratifying

• Formulas in literature

• Pilot study

• General rule: the more information you have about the pop- ulation, the more strata you should use. You should use an SRS when little prior information about the target population is available.

50 Recall: Relative precision of stratification and SRS

ˆ ˆ ˆ V (tstr,opt) ≤ V (tstr,prop) ≤ Vsrs(t)

1. Stratified sampling provides higher precision than SRS, why conduct SRS?

• Stratification adds complexity to the survey, which may not be worth a small gain in precision

• Need information which units and how many units belong to each stratum

2. When stratified sampling is efficient?

• SSB is large (strata means differ greatly)

• SSW is small (variability within stratum is small)

51 Example: National Pesticide Survey (NPS) US Environmental Protection Agency (EPA) sampled drinking wells to esti- mate the prevalence of pesticides and nitrate between 1988 and 1990.

• Want a sample that was representative of drinking water wells in the United States

• Want to guarantee that wells in the sample would have a wide of levels of pesticide use and susceptibility to ground-water pollution

• Want to study two categories of wells: (1)Community water systems (CWS)

—systems of piped drinking water with at least 15 connections and/or 25 or

more permanent residents with at least one working well

and (2) rural domestic wells

—supplying occupied housing in rural areas, not on government property

52 1. Frame issue: how many drinking wells exist in the United States?

• For CWS, list with addresses is in the Federal Reporting Data System (FRDS), maintained by EPA, There are approximately 51,000 CWSs.

• The 1980 census data is used to estimate number of rural do- mestic wells. There are about 13 million rural domestic wells.

53 2. Stratification issue: EPA choose stratification design, which variables are used to construct strata?

• EPA developed criteria for separating the population of CWS wells and rural domestic wells into four categories of pesticide use and three relative ground-water vulnerability measures. This design ensures that the range of variability that exists nationally with respect to the agricultural use of pesti- cides and ground-water vulnerability is reflected in the sample of wells.

• Pesticide use obtained from —marketing research —proportion of county in agricultural use

• Ground-water vulnerability measures (by DRASTIC)

• Four categories of pesticide use: high, moderate, low, uncommon; Three

categories of groundwater vulnerability: high, moderate, low gives 12 strata 54 Table 4: Strata for National Pesticide Survey

Stratum pesticide use groundwater vulnerability number of

(estimated by DRASTIC) counties

1 high high 106

2 high moderate 234

3 high low 129

4 moderate high 110

5 moderate moderate 204

6 moderate low 267

7 low high 193

8 low moderate 375

9 low low 404

10 uncommon high 186

11 uncommon moderate 513

12 uncommon low 416

55 3. Design considerations —For CWS, assume 0.5% of wells contain pesticides; choose n so that the probability of detection is 90%. —For rural wells, there were some subgroups of particular in- terest; assume a 1% rate and 97% probability of detection. —n = 564 public, 734 private Rural wells

56 4. Rural wells —-Each county (N = 3137) categorized according to the strati- fication variables. —-Sample counties; —-Characterize pesticide use and groundwater vulnerability for subcounty areas. —-No subcounty areas selection for CWS wells

57 Model-based inference for stratified sampling

• The one-way ANOVA model with fixed effects provides an un- derlying structure for stratified sampling.

yhj = µh + ²hj (1)

2 where ²hj are independent with mean 0 and variance σh.

• The estimator of µh is y¯h, the average in stratum h

58 Estimators and Properties:

NPh • Th = yhj: the total in stratum h j=1

PH • T = Th: the overall total h=1

• Note that both Th and T are random variables X Nh • The best linear unbiased estimator for Th is Tˆh = yhj. nh j∈Sh

• EM [Tˆh − Th] = 0

µ ¶ 2 ˆ 2 2 nh σh • EM [(Th − Th) ] = Nh 1 − Nh nh

59 By the fact that observations in different strata are independent under the model   ( )2 XH 2 EM [(Tˆ − T ) ] = EM  (Tˆh − Th)  h=1   XH XH X 2 = EM  (Tˆh − Th) + (Tˆh − Th)(Tˆk − Tk) h=1 h=1 k6=h " # XH 2 = EM (Tˆh − Th) h=1 µ ¶ XH 2 2 nh σh = Nh 1 − Nh nh h=1

60 Comments:

2 2 • The theoretical variance σh can be estimated by sh

• Adopting the model in (1) results in the same estimation for t and its as found under theory.

• If a different model is used, however, then different estimators are obtained.

61