Stat472/572 Sampling: Theory and Practice Instructor: Yan Lu

1 Chapter 4 Ratio and Regression Estimation

Two quantities yi and xi are measured on each sample unit

• yi: response variable, xi: auxiliary variable, or subsidiary variable PN PN • Let ty = yi and tx = xi and their ratio be i=1 i=1 t y¯ B = y = U tx x¯U Example 4.1: suppose the population consists of agricultural ﬁelds of different sizes. Let

yi = bushels of grain harvested in ﬁeld i

xi = acreage of ﬁeld i then B = average yield in bushels per acre

y¯U = average yield in bushels per ﬁeld

ty = total yield in bushels 2 If an srs is taken, natural estimators for B, ty, and y¯U are: y¯ Bˆ = , tˆ = Btˆ y¯ˆ = Bˆx¯ x¯ yr x r U • ratio estimation take advantage of the correlation of x and y in the population; the higher the correlation, the better they work. deﬁne the population correlation coefﬁcient of x and y to be PN (xi − x¯U )(yi − y¯U ) R = i=1 (N − 1)SxSy

—-Sx is the population standard deviation of the xi’s

—- Sy is the population standard deviation of the yi’s —- R is simply the pearson correlation coefﬁcient of x and y for the N units

3 Why use ratio estimation?

1. want to estimate a ratio Example: interested in the percentage of pages in Good Housekeeping magazines that contain at least one advertisement

• Take an SRS of 10 issues

• let xi be the total number of pages in issue i

• let yi be the total number of pages in issue i that contain at least one advertisement P yi • Bˆ = iP∈S xi i∈S

4 2. Want to estimate a population total, but population size N is unknown

• tˆy = Ny¯, but N is unknown t • N = x x¯U

tx • tˆy = y¯ x¯U t • tˆ = x y¯ = Btˆ yr x¯ x

5 Example: Apple Juice from Apples

For a juice company, the price they are paid for apples in large shipments is based on the amount of apple juice from the load.

• need to determine the amount of apple juice in the whole load prior to extraction.

• We can sample n apples and ﬁnd y1, ··· yn, the amount of apple juice in those apples.

• Ny¯ is hard to get in this case because N is hard to count. —–total weight of apple is easy to get. —–We will use the relationship between weight of the load and the weight of the apple juice one obtains —–Let x be the weight of each apple in the sample, x¯ is the average weight of each apple in the sample.

—–Number of apples is estimated by tx/x¯

The total weight tx is easy to get for the entire shipment. We can thus estimate the total apple juice by:

y¯ tˆ = t . yr x¯ x

6 Example: Want to estimate the total # of ﬁsh in a haul that are longer than 12 cm

• Take an SRS, estimate the proportion and multiply by the total # of ﬁsh N. But N is unknown

• Take an SRS, consider the fact that having a length of more than 12 cm (y) is related to weight (x), introduce an auxiliary

variable xi: weight of ﬁsh

• yi: ﬁshes longer than 12cm, xi: weight of ﬁsh, tx: total weight

of haul, ty =? y¯ • tˆ = Btˆ = t yr x x¯ x 7 3. Often, used to increase the precision of estimated means and totals

Let yi be the # of persons in commune i

and xi be the # of registered births in commune i want to estimate # of persons in France

• Randomly select 30 communes

tx • Estimate 1 = Ny¯ = y¯ x¯U # of communes in France × average number of persons in the 30 communes y¯ tx • Estimate 2= Btˆ = t = y¯ x x¯ x x¯

8 • y¯ and x¯ are positively correlated. The sampling distribution y¯ of will have less variability than the sampling distribution of x¯ y/¯ x¯U . As a result, ratio estimation (estimate 2) have smaller MSE, i.e. MSE(estimate 2) < MSE(estimate 1)

9 4. Adjust estimates from the sample, so that they reﬂect demo- graphic totals Example: An SRS of 400 students taken at a university with 4000 students may contain 240 women and 160 men, with 84 of the sampled women and 40 of the sampled men planning to follow careers in teaching, want to estimate number of students who plan to be teachers

• Estimate 1: using only the information from the SRS, 124 Ny¯ = 4000 × = 1240 400

10 • Estimate 2: knowing that the college has 2700 women and 1300 men, a better estimate is 84 40 × 2700 + × 1300 = 1270 240 160 —Ratio estimation is used within each gender —In the sample, 60% are women, but 67.5% of the population are women, so we adjust the estimate of the total number of students planning a career in teaching accordingly

11 5. Adjust for nonresponse

Example: a sample of businesses, yi: amount spent on health insurance by business i

xi # of employees in business i, xi known want to estimate total insurance expenditures

• Estimate 1: Ny¯ —companies with few employees are less likely to respond to the survey

—yi is proportional to xi

—Estimate 1 overestimate the population total ty

12 y¯ • Estimate 2: t x x¯ tx — < N, since companies with many employees are more x¯ likely to respond to the survey —Thus a ratio estimate of total heath care insurance expenditures may help to compensate for the nonresponse of companies with few employees

13 Example: SRS from the U.S. Census of Agriculture, ﬁle agsrs.dat contains data from an SRS of n = 300 of N = 3078 counties. Suppose we know the population total for 1987, but only have 1992 information on the SRS of 300 counties. Want to

estimate the population total tˆy and mean y¯ˆ.

• Estimate 1: using only the SRS information from 1992,

tˆy,srs = 3078¯y = 916, 927, 110

14 Figure 1: the plot of acreage, 1992 vs. 1987, for an srs of 300 counties. the line in the plot goes through the origin and has slope ˆb = 0.9866. note that the variability about the line increases with x. The plot of acreage, 1992 vs. 1987, for an SRS of 300 counties Millions of Acres Devoted to Farms (1992) to Farms Millions of Acres Devoted 0 500000 1000000 1500000 2000000

0 500000 100000015 1500000 2000000 Millions of Acres Devoted to Farms (1987) • Estimate 2: ratio estimation

—yi = total acreage of farms in county i in 1992

—xi = total acreage of farms in county i in 1987

—For 1987, tx = 964, 470, 625, x¯U = 964, 470, 625/3078 = 313343.3 — y¯ 297897.0467 Bˆ = = = .986565 x¯ 301953.7233

y¯ˆr = Bˆx¯U = 309, 133.6

tˆyr = Btˆ x = .986565 × 964470625 = 951, 513, 191

16 Comments:

• when the same quantity is measured at different times, the response of interest at an earlier time often makes an excellent auxiliary variable

• x¯ is smaller than x¯U . This means that our SRS of size 300 slightly underestimates the true population mean of the x’s

• x and y are positively correlated, we have reason to believe

that y¯ may also underestimate the population value y¯U

• Ratio estimation gives a more precise estimate of y¯U by ex-

panding y¯ by the factor x¯U /x¯

17 Ratio estimators are usually biased for estimating y¯U and ty

• SRS, y¯ is unbiased

—calculate y¯S for each possible SRS S, the average of all of the sample

means from the possible samples is the population mean y¯U y¯ • y¯ˆ = x¯ r x¯ U —The estimation bias in ratio estimation arises because y¯ is multiplied by

x¯U /x¯

—If we calculate y¯ˆr for all possible SRS S, the average of all the values of

y¯ˆr from the different samples will be close to y¯U , but will usually not equal

y¯U exactly

• For large samples, the sampling distributions of both y¯ and y¯ˆr are approximately normal

• Ratio estimators are biased but usually with smaller variance 18 Bias of Bˆ · ¸ y¯ y¯U Bias[Bˆ] = E[Bˆ − B] = E − x¯ x¯U · ¸ y¯ x¯ y¯ = E × U − U x¯U x¯ x¯U · µ ¶ ¸ y¯ x¯ − x¯ y¯ = E × 1 − U − U x¯U x¯ x¯U · ¸ y¯(¯x − x¯ ) = −E U x¯U x¯ . . 1 ≈ 2 [BV (¯x) − Cov(¯x, y¯)] x¯U ³ ´ n 1 2 = 1 − 2 (BSx − RSxSy) N nx¯U where R is the correlation between x and y.

19 Bias of y¯ˆr

Bias[y¯ˆr] = E[y¯ˆr − y¯U ] = E[Bˆx¯U − Bx¯U ] 1 =x ¯U E[Bˆ − B] ≈ [BV (¯x) − Cov(¯x, y¯)] x¯U ³ ´ n 1 2 = 1 − (BSx − RSxSy) N nx¯U

Bias of y¯ˆr is small if

• the sample size n is large

• the sampling fraction n/N is large

• Sx is small

• the correlation R is close to 1

Note: if all x’s are the same value (Sx = 0), then the ratio estimator is the same as the

SRS estimator y¯ and the bias is zero

20 MSE of Bˆ y¯ x¯ E[(Bˆ − B)2] = E[( − B )2] x¯ x¯ y¯ − Bx¯ = E[( )2] x¯ y¯ − Bx¯ x¯ − x¯ = E[( )(1 − U )]2 x¯U x¯ y¯ − Bx¯ y¯ − Bx¯ = E[( )2 + ( )2 x¯U x¯U x¯ − x¯ x¯ − x¯ × (−2 U + ( U )2)] x¯ x¯ y¯ − Bx¯ ≈ E[( )2] x¯U 1 2 = 2 E[(¯y − Bx¯) ] x¯U where the approximation is from the fact that the second and third term is negligible relative to the ﬁrst term.

21 Let

• di = yi − Bxi ˆ • ei = di = yi − Bxˆ i • d¯=y ¯ − Bx¯

ˆ 1 2 MSE(B) ≈ 2 E[(¯y − Bx¯) ] x¯U 1 1 X = E[ (y − Bx )]2 x¯2 n i i U i∈S 1 ¯ = 2 V (d) x¯U 2 1 n Sd = 2 (1 − ) x¯U N n So 1 n s2 MSEd (Bˆ) ≈ (1 − ) e x¯2 N n

22 Variance of Bˆ In large sample, the bias of Bˆ is typically small relative to V (Bˆ), MSE(Bˆ) ≈ V (Bˆ)

³ ´ 2 ˆ 1 n Sd V (B) ≈ 2 1 − x¯U N n 1 ³ n ´ s2 Vˆ (Bˆ) ≈ 1 − e x¯2 N n

23 Variance of y¯ˆr ˆ 2 ˆ ˆ V (y¯ˆr) =x ¯U V (B) ³ n ´ s2 ≈ 1 − e N n

Variance of Vˆ (y¯ˆr) is small if

• the sample size n is large

• the sampling fraction n/N is large

• the deviations yi − Bxi are small

• the correlation R is close to 1

24 Compare ratio estimation to SRS estimation ³ n ´ S2 V (y¯ˆ ) ≈ 1 − d r N n ³ n ´ S2 V (¯y ) = 1 − y srs N n 2 2 ˆ If Sd < Sy then V (y¯r) < V (¯ysrs), ratio estimation is more efﬁcient

25 XN XN 2 2 2 (N − 1)Sd = (yi − Bxi) = (yi − y¯U +y ¯U − Bxi) i=1 i=1 XN 2 = (yi − y¯U + Bx¯U − Bxi) i=1 XN XN 2 2 = (yi − y¯U ) + (Bx¯U − Bxi) i=1 i=1 XN + 2 (yi − y¯U )(Bx¯U − Bxi) i=1 2 2 2 = (N − 1)Sy + (N − 1)B Sx

− 2(N − 1)BRSxSy PN (xi − x¯U )(yi − y¯U ) i=1 where R is the population correlation coefﬁcient, R = (N − 1)SxSy

26 PN (y − Bx )2 ³ n ´ 1 i i V (y¯ˆ ) ≈ 1 − i=1 r N n N − 1 ³ n ´ 1 = 1 − {S2 + B2S2 − 2BRS S } N n y x x y ³ n ´ S2 V (¯y ) = 1 − y srs N n Ratio estimation is more efﬁcient if

2 2 2 2 Sy + B Sx − 2BRSxSy < Sy i.e. 2 2 B Sx < 2BRSxSy

BSx < 2RSy

27 y¯U Sx < 2RSy x¯U S S x < 2R y x¯U y¯U CV(¯x) R > 2CV(¯y) Coefﬁcient of variation Sd(¯y) CV(¯y) = y¯U Sd(¯x) CV(¯x) = x¯U Usually absolute values of CV(¯x) and CV(¯y) don’t make big difference.

Ratio estimation is more efﬁcient if R > 1/2

28 Review Ratio Estimation: Inspired by regression through origin y = Bx

Two variables yi and xi are measured on each sample unit

• yi: response variable, xi: auxiliary variable y¯ • Bˆ = x¯

• tˆyr = Btˆ x, y¯ˆr = Bˆx¯U

• Ratio estimation take advantage of the correlation of x and y in the population; the higher the correlation, the better they work.

Sometimes, data appear to be evenly scattered about a straight line that does not go through the origin —the data look as though the usual straight-line regression model

y = B0 + B1x

would provide a good ﬁt

29 Example of ratio estimation R handout

• Ratio estimator

—yi = total acreage of farms in county i in 1992

—xi = total acreage of farms in county i in 1987

—For 1987, tx = 963, 464, 412 — y¯ 297897.0467 Bˆ = = = .986565, and R = 0.995806 x¯ 301953.7233 ˆ tˆyr = Btx = .986565 × 963, 464, 412 = 950, 520, 496

SE(tˆyr) = 5, 540, 376

• SRS estimator

tˆy = Ny¯ = 916, 927, 110

SE(tˆy) = 58, 169, 381, this is almost 10 times as large as the SE from

ratio estimation (SE(tˆyr) = 5, 540, 376)

30 • Coefﬁcient of Variation (CV) comparison Recall, Coefﬁcient of Variation (CV)

when y¯U 6= 0, p V (¯y) SE(¯y) CV (¯y) = , CVd(¯y) = E(¯y) y¯

—-Measure of relative variability —-Does not depend on the unit of measurement —-CV (tˆ) = CV (¯y)

—-If the CV of y¯ is small, that is, if y¯U is estimated with high relative precision, the bias is small relative to the square root of the variance. —- A small CV (¯y) also means that y¯ is stable from sample to sample.

31 Table 1: Comparisons

Ratio estimation SRS estimation

SE of tˆ 5,540,376 58,169,381 5, 540, 376 58, 169, 381 Estimated CV = 0.0058 = 0.0634 950, 520, 496 916, 927, 110

• Including the 1987 information through the ratio estimator has greatly in-

creased the precision. If all quantities to be estimated were highly corre-

lated with the 1987 acreage, we could dramatically reduce the sample size

and still obtain high precision by using ratio estimators rather than Ny¯.

32 Review Regression: A person’s muscle mass is expected to decrease with age. To explore this relationship in women, a nutritionist randomly selected 15 women from each 10-year age group, beginning with age 40 and ending with age 79 with a total number of 60 women.

Scatter Plot of Muscle Mass ex.data$mass 50 60 70 80 90 100 110 120

40 50 60 70

ex.data$age

33 Regression Estimation Normal error regression model:

Yi = β0 + β1Xi + ²i

• Yi: response of the ith trial

• Xi: a known constant, the level of the predictor variable in the ith trial

• β0 and β1: parameters iid 2 • ²i ∼ N(0, σ ) for i = 1, 2, ··· , n

• E(Yi) = β0 + β1Xi

34 Least square estimators:

• Consider the deviation of Yi from its expected value

[Yi − (β0 + β1Xi)]

• Least Square Measure: Xn 2 Q = (Yi − (β0 + β1Xi)) i=1

• Objective: to ﬁnd estimators b0 and b1 for β0 and β1 respectively, for which Q is minimum ˆ ˆ • β0 = b0, β1 = b1 ˆ • Regression line: E(Y ) = b0 + b1X

35 Plot of the regression line that describes the statistical relation between muscle mass and age Line of Relationship ex.data$mass 50 60 70 80 90 100 110 120

40 50 60 70

ex.data$age

36 Regression in Simple Random Sampling (SRS) Want to estimate population mean and population total Assumptions:

• The relationship between E(y) and x is a straight line

E(y) = B0 + B1x

• The population mean of x’s, x¯U is known

37 population quantities estimators PN P (xi − x¯U )(yi − y¯U ) (xi − x¯)(yi − y¯) B = i=1 Bˆ = i∈S P 1 PN 1 (x − x¯)2 2 i (xi − x¯U ) i∈S i=1 ˆ ˆ B0 =y ¯U − B1x¯U B0 =y ¯ − B1x¯

• B1 and B0 are the least squares regression slope and intercept calculated from all the data in the population respectively

• The regression estimator of y¯U is ˆ ˆ y¯ˆreg = B0 + B1x¯U ˆ ˆ =y ¯ − B1x¯ + B1x¯U ˆ =y ¯ + B1(¯xU − x¯)

38 Properties of the Estimators Notations

• di = yi − (B0 + B1xi)

• ei = yi − (Bˆ0 + Bˆ1xi) called residuals PN (xi − x¯U )(yi − y¯U ) i=1 • R = , population correlation coefﬁ- (N − 1)SxSy cient of x and y

39 Bias of y¯ˆreg ˆ bias(y¯ˆreg) = −cov(B1, x¯) Proof:

bias(y¯ˆreg) = E[y¯ˆreg − y¯U ] ˆ ˆ = E[B0 + B1x¯U − y¯U ] ˆ ˆ = E[¯y − B1x¯ + B1x¯U − y¯U ] ˆ = E[¯y − y¯U ] − E[B1(¯x − x¯U )] ˆ = −cov(B1, x¯)

y¯ˆreg is biased for y¯U

• If the regression line goes through all of the points (xi, yi) in the population, ˆ ˆ then the bias is zero since B1 = B1 for every sample, so cov(B1, x¯) = 0

40 ³ n ´ S2 MSE of y¯ˆ : MSE(y¯ˆ ) = 1 − d reg reg N n

di = yi − (B0 + B1xi)

= yi − (B0 + B1xi − B1x¯U + B1x¯U )

= yi − (B0 + B1x¯U + B1(xi − x¯U ))

= yi − [¯yU + B1(xi − x¯U )]

2 MSE(y¯ˆreg) = E(y¯ˆreg − y¯U ) ˆ 2 = E[¯y + B1(¯xU − x¯) − y¯U ] ˆ 2 = E{y¯ − [¯yU + B1(¯x − x¯U )]} ≈ Var(d¯) ³ n ´ S2 = 1 − d N n

41 Another expression of MSE of y¯ˆreg Sy Notice B1 = R · and Sx XN (y − y¯ − B [x − x¯ ])2 S2 = i U 1 i U d N − 1 i=1 2 2 = Sy (1 − R ) ³ n ´ 1 MSE(y¯ˆ ) = 1 − S2(1 − R2) reg N n y MSE(y¯ˆreg) is small if

• n is large, n/N is large

• The correlation R is close to either -1 or +1 42 Variance of y¯ˆreg For large SRSs

• Bias is often negligible in large samples

• The MSE for regression estimation is approximately equal to the variance

43 Estimator for the total tˆyreg X X tˆyreg = yi + yi i∈S i/∈S X X ˆ ˆ = yi + (B0 + B1xi) i∈S i/∈S X X ˆ ˆ = yi + (N − n)B0 + B1(tx − xi) i∈S i∈S If n << N ˆ ˆ tˆyreg ≈ NB0 + B1tx ˆ ˆ = NB0 + B1Nx¯U ˆ ˆ = N(B0 + B1x¯U ) = Ny¯ˆreg

44 Conﬁdence Intervals: r ³ n ´ s2 SE(y¯ˆ ) = 1 − e reg N n r ³ n ´ s2 SE(tˆ ) = N 1 − e yreg N n A 100(1 − α)% CI for y¯U is r ³ n ´ s2 y¯ˆ ± t (α/2) 1 − e reg n−2 N n A 100(1 − α)% approximate CI for t is r ³ n ´ s2 tˆ ± t N 1 − e yreg n−2,α/2 N n 45 Use R output Example 4.9 (pages 139-141): Want to estimate the number of dead trees in an area

• divide the area into 100 square plots

• count the number of dead trees on a photograph of each plot

• photo counts can be made quickly, but sometimes a tree is misclassiﬁed or not detected

• select an SRS of 25 of the plots for ﬁeld counts of dead trees

• the population mean number of dead trees per plot from the photo count is

11.3

46 SRS and Ratio estimation using weights SRS

• wi = N/n P wiyi • y¯ = i∈PS wi i∈S P • tˆy = wiyi = Ny¯ i∈S P P N N • wi = = n · = N i∈S i∈S n n

47 Ratio Estimation

y¯ tˆy tˆyr = tx = tx x¯ tˆx X X µ ¶ tx tx = wiyi = wi · yi tˆ tˆ i∈S x i∈S x X gi=tx/tˆx = wigiyi i∈S ∗ X wi =wigi ∗ = wi yi i∈S

∗ • wi depend upon values from the sample P • The weight adjustments gi calibrate the estimates on the x variable. Since i∈S wigixi =

tx, the adjusted weights force the estimated total for the x variable to equal the known pop-

ulation total tx. The factors gi are called the calibration factors.

48 Example 4.6, Census of Agriculture data used in Examples 4.2 and 4.3 continued,

• For each observation

gi = tx/tˆx = 964, 470, 625/929, 413, 560 = 1.037719554.

• tˆx < tx, each observation’s sampling weight is increased by a small amount

• The sampling weight for the SRS design is

wi = 3078/300 = 10.26

• The ratio adjusted weight for each observation is

∗ wi = wigi = (10.26)(1.037719554) = 10.64700262

49 • X X wigixi = 10.64700262xi = 964, 470, 625 = tx i∈S i∈S

• X X wigiyi = 10.64700262yi = 951, 513, 191 = tˆyr i∈S i∈S

• The adjusted weights, however, no longer sum to N = 3078 X wigi = (300)(10.64700262) = 3194 i∈S

• The ratio estimator is calibrated to the population total tx of the x variable, but is no longer calibrated to the population size N.

50 Regression Estimation (SRS) ˆ ˆ tˆyreg = N[B0 + B1x¯U ]

= NBˆ0 + NBˆ1x¯U

= Bˆ0N + Bˆ1tx

= (¯y − Bˆ1x¯)N + Bˆ1tx

= tˆy − Bˆ1tˆx + Bˆ1tx = tˆ + Bˆ (t − tˆ ) y 1 x Px X (xi − x¯)(yi − y¯) i∈S P ˆ = wiyi + 2 · (tx − tx) (xi − x¯) i∈S i∈S

51 P X yi(xi − x¯) i∈PS ˆ = wiyi + 2 · (tx − tx) (xi − x¯) i∈S Pi X wiyi(xi − x¯) i∈PS ˆ = wiyi + 2 · (tx − tx) wi(xi − x¯) i∈S  i  X (x − x¯)(t − tˆ )  Pi x x  = wi 1 + 2 yi wi(xi − x¯) i∈S X i = giyi i∈S  

(xi − x¯)(tx − tˆx) where  P  called g-weight gi = wi 1 + 2 wi(xi − x¯) i

52 when yi = xi   X (x − x¯)(t − tˆ ) ˆ  Pi x x  tyreg = wi 1 + 2 xi wi(xi − x¯) i∈S i P 2 X wi(xi − x¯) iP∈S ˆ = wixi + 2 · (tx − tx) wi(xi − x¯) i∈S i∈S

= tˆx + (tx − tˆx)

= tx

53 Comparison

• Both ratio and regression estimation provide a way of using an auxiliary variable that is highly correlated with the variable of interest

• The ratio and regression estimators discussed in this Chapter are special cases of a generalized regression estimator

• Ratio estimation is especially useful in cluster sampling

• For an SRS of size n, the estimators are given in the following table

54 Estimator for Mean Estimator for Total ei

SRS y¯ tˆy yi − y¯

Ratio Bˆx¯U Btˆ x yi − Bxˆ i

Regression Bˆ0 + Bˆ1x¯U N(Bˆ0 + Bˆ1x¯U ) yi − Bˆ0 − Bˆ1xi

=y ¯ + Bˆ1(¯xU − x¯) = N[¯y + Bˆ1(¯xU − x¯)] ³ n ´ s2 Estimated variance for y¯ˆ : 1 − e U N n ³ n ´ s2 Estimated variance for tˆ: N 2 1 − e N n

55 Estimation in Domains

• Domain: subpopulation

• Want separate estimates for subpopulations Example: Want to estimate average income for women in an SRS. —–Number of women in sample is a random variable —- We do not know which persons in the population belong to which domain until they are sampled, though. Thus, the number of persons in an SRS who fall into each domain is a random variable, with value unknown at the time the survey is designed.

• Estimating domain means is a special case of ratio estimation.

56 Suppose there are D domains

• Ud: the index set of the units in the population that are in domain d

• Sd: the index set of the units in the sample that are in domain d, for d = 1, 2, ..., D

• Nd: the number of population units in Ud

• nd: the number of sample units in Sd Suppose we want to estimate the mean salary for the domain of women, P y i∈Ud i total salary for all women in population y¯Ud = = Nd number of women in population

A natural estimator of y¯Ud is P y i∈Sd i total salary for women in sample y¯d = = nd number of women in sample

where nd is a random variable

57 Let:   1 if women yi: income for person i, xi =  0 otherwise   yi if women ui = xiyi (income for women) =  0 otherwise

PN • tx = xi = Nd: total # of women in population, x¯U = Nd/N i=1 PN • tu = ui: total income for women in population i=1

• y¯Ud = tu/tx = B: average income for women in population P • y¯d =u/ ¯ x¯ = Bˆ: average income for women in sample d, where u¯ = xiyi/n, P i∈S x¯ = xi/n = nd/n i∈S

58 1 ³ n ´ s2 Vˆ (Bˆ) = 1 − e x¯2 N n 1 ³ n ´ 1 1 X = 1 − · (u − Bxˆ )2 x¯2 N n n − 1 i i i∈S 1 ³ n ´ 1 1 X = 1 − · (y x − Bxˆ )2 x¯2 N n n − 1 i i i i∈S 1 ³ n ´ 1 1 X = 1 − · (y − y¯ )2 x¯2 N n n − 1 i d i∈Sd ³ ´ x¯=nd/n n n nd − 1 2 = 1 − 2 · syd N nd n − 1 ³ n ´ s2 ≈ 1 − yd N nd s ³ ´ 2 n syd SE(¯yd) ≈ 1 − N nd

59 General case of domain estimation: Estimates in different subpopulations

• mean for a subpopulation is a ratio

• sample size of domain is a random variable

• Bˆ= sum of yi’s in domain/total # of observations in domain

= y¯d

60 Totals in domains

• Nd is known, tˆyd = Ndy¯d

nd • N is unknown, Nˆ = N · d d n P nd i∈S ui tˆyd = N · · = Nu¯ n nd r ³ n ´ s2 SE(tˆ ) = N 1 − u yd N n

61 Example: In the SRS of size 300 from the Census of Agriculture, 39 counties are in western states. What is the estimated total number of acres devoted to farming in the west?

• Sample mean of the 39 counties y¯d = 598, 680.6

• Sample standard deviation of the 39 counties is syd = 516, 157.7 sµ ¶ 300 300 38 516, 157.7 • SE(¯yd) = 1 − · · × √ = 77, 637 3078 39 299 39 • An approximate 95% conﬁdence interval for the mean farm acreage for counties in the western United States is [445897, 751463] 77, 637 • CVd(¯y ) = = 0.1297 d 598, 681

62 For estimating the total number of acres devoted to farming in the West, suppose we do not know how many counties in the population are in the western United States. Deﬁne  1 if county i is in western U.S. xi =  0 otherwise Deﬁne ui = yixi, then X 3078 tˆ = tˆ = · u = 239, 556, 051 yd u 300 i i∈S Standard error is sµ ¶ 300 273005.4 SE(tˆyd) = 3078 1 − × √ = 46, 090, 460 3078 300

63 Poststratiﬁcation: stratiﬁcation after selection of the sample

• stratiﬁcation is a design

• poststratiﬁcation is an analysis method

64 Example: Want to stratify a public opinion survey according to gender of respondents. But, if the poll is conducted by sampling telephone numbers, respondents can not be put into male or female stratum before they are contacted.

65 Example: Want to estimate the average amount spent on food in a month. One desirable stratiﬁcation variable might be household size since large households might be expected to have higher food bills than smaller households. From U.S census data, the distribution of household size in the region is known

# of persons in household percentage of household

1 25.75

2 31.17

3 17.50

4 15.58

5 10.00

66 Poststratiﬁcation:

• Take an SRS, record the amount spent on food as well as the household size for each household in your sample

• If n is large, the sample is likely to resemble a stratiﬁed sample with proportional allocation: we would expect about 26% of the sample to be one-person households, about 31% to be two- person households, and so on

• Consider different household-size groups to be different domains, we can use ratio estimation to estimate the average amount spent on groceries for each domain

67 Let n1, n2, ··· , nH be the numbers of units sampled in the various household- size groups (domains), nh is random and y¯1, ··· , y¯H be the sample means for the groups

Let xih = 1 if observation i is in poststratum h and 0 otherwise

Let uih = yixih PN txh = xih = Nh i=1 PN tuh = uih = population total of variable y in poststratum h i=1 For each poststratum h, estimate the total in the poststratum by P N ˆ P N tˆuh = · uih and Nh = · xih i∈S n i∈S n t N tˆ = xh · tˆ = h · tˆ = N · y¯ uhr ˆ uh ˆ uh h h txh Nh

68 Poststratiﬁed estimator of the population total is XH XH N XH tˆ = tˆ = h · tˆ = N y¯ ypost uhr ˆ uh h h h=1 h=1 Nh h=1 ratio estimation is used within each poststratum to estimate the population total in that poststratum.

The poststratiﬁed estimator of y¯U H P Nh y¯post = · y¯h h=1 N where Nh/N known, nh ≥ 30 and n large Approximately proportional allocation ³ ´ PH 2 ˆ n Nh sh V (¯ypost) ≈ 1 − · , N h=1 N n when the expected sample sizes in each poststratum are large.

69 Ratio Estimation with Stratiﬁed Samples Combined ratio estimator

• First the strata are combined to estimate tx and ty

• Then ratio estimation is applied

ˆ ˆ ˆ ty,str tˆyrc = Btx, where B = tˆx,str XH XH X tˆy,str = Nhy¯h = whjyhj

h=1 h=1 j∈Sh and XH XH X tˆx,str = Nhx¯h = whjxhj

h=1 h=1 j∈Sh

with whj = Nh/nh

70 " # XH X MSE(tˆyrc) ≈ V (tˆy,str − Btˆx,str) = V whj(yhj − Bxhj)

h=1 j∈Sh

Ã ! µ ¶2 XH X ˆ tx,str ˆ V (tˆyrc) = V whjehj tˆx,str h=1 j∈Sh µ ¶2 tx,str ˆ = V (tˆe,str) tˆx,str µ ¶2 tx,str ˆ ˆ2 ˆ ˆ d = [V (tˆy,str) + B V (tˆx,str) − 2BCov(tˆy,str, tˆx,str)] tˆx,str ˆ where ehj = yhj − Bxhj.

71 Separate ratio estimator

• Ratio estimation is applied separately in each stratum

• Then the strata are combined XH XH tˆyh tˆyrs = tˆyhr = txh · tˆ h=1 h=1 xh with XH Vˆ (tˆyrs) = Vˆ (tˆyhr) h=1

72 Comments:

• Separate ratio estimator can improve efﬁciency if the tˆyh/tˆxh vary from stratum to stratum

• Separate ratio estimator should not be used when strata sample sizes are small because each ratio is biased, and the bias can propagate through the strata

• Poststratiﬁcation is a special case of the separate ratio estimator

• The combined estimator has less bias when the sample sizes in some of

the strata are small

—-when the ratios vary greatly from stratum to stratum, however, the com-

bined estimator does not take advantage of the extra efﬁciency afforded by

stratiﬁcation as does the separate ratio estimator 73