<<

STAT3014/3914 Applied Stat.- C3-Ratio & reg est. 3 Ratio and regression estimators

3.1 Motivating examples Frequently, we are interested in measuring the ratio of a matched pair of variables. This occurs when the sampling unit comprises a group or cluster of individuals, and our interest is in the population per individual. For example, to estimate average income/adult in the population in a household survey, we record for the ith household (i = 1, ··· , n) the number of adults who live there, xi, and the household income, yi. Then the parameter, average income per adult in the population,

N P Y household income i R = = i=1 N total no. of adults P Xi i=1 can be estimated by the ratio estimator n P yi i=1 y¯ Rb = r = n = . P x¯ xi i=1

Relationship between estimates Ratio Mean Total R −→×X Y −→×N Y

R −→×X Y

SydU STAT3014 (2015) Second semester Dr. J. Chan 34 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

3.2 Two characteristics per unit in SRS

Theorem: If Xi and Yi are a pair of numerical characteristics defined on every unit of the population, andy ¯ andx ¯ are the corresponding from a SRS without replacement of size n , then " #  n  1 PN (Y − Y¯ )(X − X¯ )  n  S Cov (¯x, y¯) = 1 − i=1 i i = 1 − xy N n N − 1 N n (1) and Pn (y − y¯)(x − x¯) PN (Y − Y¯ )(X − X¯ ) E i=1 i i = i=1 i i . (2) n − 1 N − 1

Proof. Consider Ui = Xi +Yi and the corresponding sample values are ui = xi + yi. Clearly " #  n  S2  n  1 PN (X − X¯ + Y − Y¯ )2 Var (¯u) = 1 − U = 1 − i=1 i i N n N n N − 1 " #  n  1 PN (X − X¯)2 + PN (Y − Y¯ )2 + 2 PN (X − X¯)(Y − Y¯ ) = 1 − i=1 i i=1 i i=1 i i N n N − 1 " # 2 PN (X − X¯)(Y − Y¯ )  n  = Var (¯x) + Var (¯y) + i=1 i i 1 − . n N − 1 N

Since Var (u¯) = Var (¯x +y ¯) = Var (¯x) + Var (¯y) + 2Cov(¯x, y¯), (1) is proved. (2) can be proved in a similar way.

Theorem: For large sample, (a) E(r) − R ≈ 0, approximately unbiased, " # 1  n  1 PN (Y − RX )2 1  n  S2 (b) Var(r) ≈ 1 − i=1 i i = 1 − r . X¯ 2 N n N − 1 X¯ 2 N n

SydU STAT3014 (2015) Second semester Dr. J. Chan 35 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Proof: (a) Recall E(y¯) = Y¯ , E(x¯) = X¯ and Var(x¯) = O(n−1) (order of n−1). Thus for large sample, y¯ E(y¯) E(r) = E ≈ = R. x¯ X¯ (b) Note that y¯ y¯ − Rx¯ r − R = − R ≈ . x¯ X¯ Thus, for large sample, 1 E(d¯2) Var(d¯) Var(r) = E[(r − R)2] ≈ E[(y¯ − Rx¯)2] = = X¯ 2 X¯ 2 X¯ 2 ¯ where d = y¯−Rx¯ is the sample mean of di = yi−Rxi, i = 1, ··· , n, drawn from the population of Di = Yi − RXi, i = 1, ··· ,N with Y¯ E(d¯) = E(y¯ − Rx¯) = E(¯y) − RE(¯x) = Y¯ − RX¯ = Y¯ − X¯ = 0. X¯

For a SRS of di,  n  S2 Var(d¯) = 1 − r N n where N N 1 X 1 X S2 = (D − D¯ )2 = (Y − RX )2. r N − 1 i N − 1 i i i=1 i=1 Hence " N # 1  n  1 1 X 1  n  S2 Var(r) ≈ 1 − (Y − RX )2 = 1 − r , X¯ 2 N n N − 1 i i X¯ 2 N n i=1

" n # 1  n  1 1 X 1  n  s2 var(r) ≈ 1 − (y − rx )2 = 1 − r X¯ 2 N n n − 1 i i X¯ 2 N n i=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 36 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. 2 1. Ordinary: x not related to y ¯ ¯ n  sy Yb = y¯ & var(Yb) = 1 − N n y 6 yi s Solid line: yi − y¯ © y¯ s P (y −y¯)2 s 2 i i sy = n−1 s

s -x

2 2. Ratio: x positively related to y ¯ X ¯ n  sr Yb r = y¯ x¯ & var(Yb r) = 1 − N n y 6 yi  Solid line: zi = yi − rxi rxs i

rX ¯ 2 s Y P z s 2 i i 2 6 sr = n−1 < sy sy = rx 2 2 2 = sy − 2rρsˆ xsy + r sx s(a = 0,b = r) - x X

2 Calculation of sr: n 1 X s2 = (y − rx )2 r n − 1 i i i=1 n 1 X y¯ = [(y − y¯) − r(x − x¯)]2 since y¯ − rx¯ = y¯ − x¯ = 0 n − 1 i i x¯ i=1 " n n n # 1 X X X = (y − y¯)2 − 2r (x − x¯)(y − y¯) + r2 (x − x¯)2 n − 1 i i i i i=1 i=1 i=1 2 2 2 2 2 2 = sy − 2r sxy + r sx = sy − 2r ρsˆ xsy + r sx

n n n ! 1 X X X or s2 = y2 − 2r x y + r2 x2 . r n − 1 i i i i i=1 i=1 i=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 37 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Remark:

2 2 1. If Xi and Yi are positively related, we have sr  sy. Hence Xi can be used as an auxiliary variable which provides additional information and hence improves the precision of the estimate Y¯ . 2. When X is replaced by x if it is unknown, ordinary estimator results. 3. When ratio estimation is used, estimates of and sample size are quite sensitive to data points that do not fit the ideal pattern called influential observation. It is important to plot the data and look for these unusual data points before proceeding with an analysis. y 4. The ‘ratio of means’ Rb = x is biased and can be almost unbiased if n is large. Another ratio estimator is the ‘mean of ratios’ n N R∗ = r∗ = 1 P yi where r∗ = yi is unbiased for R∗ = 1 P yi . b n xi i xi N xi i=1 i=1 However Rb∗ gives equal weight to each cluster which may vary greatly in size. Unlike Rb∗, Rb is weighed by the cluster size which is an advantage over Rb∗.

SydU STAT3014 (2015) Second semester Dr. J. Chan 38 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

3.3 Ratio estimate for population mean and total The ratio estimator of the population total Y is y¯ Ybr = X = rX x¯ Similarly, the ratio estimator of population mean is y¯ Yb¯ = X¯ = rX¯ r x¯

These ratio estimates use extra information of xi, i = 1, ··· , n and the true total and mean X or X¯ , thus improving the precision of ratio estimates over the ordinary estimates Yb = Ny¯ and Yb¯ =y ¯ respectively. From the previous result,

(a) E(Yb¯ r) = XE¯ (r) ≈ XR¯ = Y¯ . Similarly E(Ybr) = XE(r) ≈ XR = Y . 2 2 ¯  n  Sr 2  n  Sr (b) Since Var(Yb r) ≈ 1 − and Var(Ybr) ≈ N 1 − , N n N n 2 2 ¯  n  sr 2  n  sr var(Yb r) = 1 − and var(Ybr) = N 1 − . N n N n

¯ The estimator r for R is generally biased , so Ybr and Yb r are also biased for Y and Y¯ respectively. Bias: y¯  Cov(r, x¯) = E(rx¯) − E(r)E(¯x) = E x¯ − E(r)E(x¯) x¯ so E(¯y) Cov(r, x¯) ρ σ σ E(r) = − = R − r,x¯ r x¯ . E(¯x) E(¯x) X¯

SydU STAT3014 (2015) Second semester Dr. J. Chan 39 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Therefore for any ratio estimates, |bias r| |R − E(r)| ρ σ σ = = r,x¯ x¯ ≤ x¯ = cv(¯x) (3) σr σr X¯ X¯ since |ρr,x¯| ≤ 1. Thus if the CV(x¯) is small, the bias of Rb = r is small relative to SE(r) of Rb. But if n is small, the bias can be large.

Efficiency: The ratio estimator is more efficient than the ordinary estimator, that is var(Yb) > var(Yb r), if cv(x) ρˆ > (4) 2cv(y) where cv(y) is the sample cv for Y defined as s cv(y) = y . y Then  n  1 var(Yb) − var(Yb ) > 0 ⇒ 1 − [s2 − s2] > 0 r N n y r 2 2 2 2 ⇒ [sy − (sy − 2rρsˆ xsy + r sx)] > 0 ⇒ rsx(2ˆρsy − rsx) > 0

⇒ 2ρsˆ y − rsx > 0 since r > 0 & sx > 0 y s cv(x) y ⇒ ρˆ > x = since r = x 2sy 2cv(y) x cv(x) and the equality holds when ρˆ = . 2cv(y)

SydU STAT3014 (2015) Second semester Dr. J. Chan 40 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Example: (7-11) The manager of 7-11 is interested in estimating the total sale in thousands for all of its 300 branches. From last year record, the total sale in thousands for all the 300 branches is 21300. Careful check of this year records are obtained for a SRS of 15 branches with the following results: Branch Last year sale x This year sale y Branch Last year sale x This year sale y 1 50 56 9 100 165 2 35 48 10 250 409 3 12 22 11 50 73 4 10 14 12 50 70 5 15 18 13 150 95 6 30 26 14 100 55 7 9 11 15 40 83 8 25 30

n n n n n X X 2 X X 2 X xi = 926, xi = 117400, yi = 1175, yi = 231815, xiyi = 155753 i=1 i=1 i=1 i=1 i=1 2 sy = 9983.81 The ordinary estimate of the total sale this year in thousands is 1175 Yb = Ny = 300 = 23500 15 with

r 2 s  n sy 15 9983.81 se(Yb) = N (1 − ) = 300 1 − = 7543.72. N n 300 15 The ratio estimate and its se for the total sale this year in thousands are 1175 Ybr = Xr = 21300 = 27027.54 926

SydU STAT3014 (2015) Second semester Dr. J. Chan 41 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

v n n n ! u  u n 1 1 X 2 X 2 X 2 se(Ybr) = Nt 1 − y − 2r xiyi + r x N n n − 1 i i i=1 i=1 i=1 s  15  1  1175 1175  = 300 1 − 231815 − 2 · · 155753 + ( )2 · 117400 300 15 × 14 926 926 = 3226.66 which is much smaller than se(Yb) = 7543.72 thousands. Read Tutorial 11 Q2a,b, Q3a,b.

3.4 Regression estimator

y y Since Yb r = XRb = X x, the line y = mx with slope m = x passes through the origin (0, 0) and (X, Yb r). However, the linear relationship between X and Y may not pass through the origin. A more general estimator, the regression estimator fits a regression line: y = A + Bx = y − Bx + Bx = y + B(x − x) (5) to the sample data where the least square estimate of B is SS PN (y − Y )(x − X) PN x y − NXY S ρS B = xy = i=1 i i = i=1 i i = xy = y . N N 2 2 SSxx P 2 P 2 S Sx i=1(xi − X) i=1 xi − NX x and A = y − Bx.

2 Note: Cov(X,Y ) = Sxy = SSxy/(N − 1), Var(X) = Sx = SSxx/(N − 1), 2 cov(X,Y ) = sxy = ssxy/(n − 1) and var(X) = sx = ssxx/(n − 1). Then the regression estimator of the population mean Y is to substitute x = X to (5) to obtain

Yb reg = y + b(X − x)

SydU STAT3014 (2015) Second semester Dr. J. Chan 42 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. where Pn Pn ssxy i=1(yi − y)(xi − x) i=1 xiyi − nxy sxy b = = Pn 2 = Pn 2 2 = 2 . (6) ssxx i=1(xi − x) i=1 xi − nx sx Since 0 Yb reg = y + b(X − x) ' y + B(X − x) = z 0 the sample mean of the variable zi = yi + B(X − xi), we have

E(Yb reg) ' E[y+B(X −x)]=E(y)+B[X −E(x)]=Y Approx. unbiased and

0 Var(Yb reg) ' Var(z¯ ) = Var[y + B(X − x)] = Var(y − Bx) = Var(¯y) + B2 Var(¯x) − 2B Cov(¯y, x¯) 2 2 2  n  Sy 2 Sy  n  Sx Sy  n  ρSxSy = 1 − + ρ 2 1 − − 2ρ 1 − N n Sx N n Sx N n  n  S2 = 1 − y 1 − ρ2 . N n Hence  n  s2  n  s2(1 − ρˆ2) var(Yb ) = 1 − reg = 1 − y reg N n N n 2 0 where sreg is the sample variance of zi = yi + b(X − xi). The regression estimator for the population total Y is

Ybreg = N[y + b(X − x)] and its variance estimate is

2 2 2 2  n  sreg 2  n  sy(1 − ρˆ ) var(Ybreg) = N 1 − = N 1 − N n N n

SydU STAT3014 (2015) Second semester Dr. J. Chan 43 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Bias:

Bias in Yb¯ reg = E(Yb¯ reg) − Y¯ = E(¯y) + E[b(X¯ − x¯)] − Y¯ = E[b(X¯ − x¯)] = −Cov(b, x¯).

Efficiency: 1. The regression estimator is at least as efficient as the ordinary

estimator, that is var(Yb) ≥ var(Yb reg) since  n  1 var(Yb) − var(Yb ) = 1 − [s2 − s2 ] reg N n y reg  n  1 = 1 − s2ρˆ2 ≥ 0 N n y where the equality holds when ρˆ = 0, i.e. there is no association between Y and X. 2. The regression estimator is more efficient than the ratio estimator,

that is var(Yb r) ≥ var(Yb reg) unless y b = r = x in which case they are equivalent and the regression of y on x is linear through the origin and the variance of y is proportional to x.

 n  1 var(Yb ) − var(Yb ) = 1 − [s2 − s2 ] r reg N n r reg  n  1 = 1 − [s2 − 2rρsˆ s + r2s2 − s2(1 − ρˆ2)] N n y x y x y  n  1 = 1 − (r2s2 − 2rρsˆ s + s2ρˆ2) N n x x y y  n  1 = 1 − (rs − ρsˆ )2 N n x y

SydU STAT3014 (2015) Second semester Dr. J. Chan 44 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.  n  1 = 1 − (rs − bs )2 N n x x  n  s2 = 1 − x (r − b)2 > 0 ⇒ (r − b)2 > 0 N n sxy sxy sxy where ρsˆ y = sy = = 2 sx = bsx. sxsy sx sx

3. Since Yb reg = y + b(X − x), the regression estimator adjusts the y up or down by an amount b(X − x).

(a) When the slope b = 0, the regression estimator Yb reg = y becomes the ordinary estimator Yb. y (b) When the y-intercept a = y − bx = 0 ⇔ b = x = r, the slope b becomes the ratio estimate r and the regression estimator y y y Yb = y + (X − x) = y + X − y = X = Xr = Yb reg x x x r

becomes the ratio estimator Yb r.

Example: (7-11) Estimate the total sale using the regression estimator. Solution: The regression estimate of the total sale this year in thou- sands is n X 926 1175 ss = x y − nxy = 155753 − 15 × × = 83216.33, xy i i 15 15 i=1 n 2 X 926 ss = x2 − nx2 = 117400 − 15 × = 60234.93, xx i 15 i=1 n 2 X 1175 ss = y2 − ny2 = 231815 − 15 × = 139773.33. yy i 15 i=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 45 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. We have ss 83216.33 b = xy = = 1.3815 ssxx 60234.93 and ssxy 83216.33 ρˆ = √ = √ = 0.9069. ssxxssyy 60234.93 × 139773.33 It follows that

Ybreg = N[y + b(X − x)] 1175 21300 926 = 300 + 1.3815 − = 27340.65 15 300 15 as compared with Yb = 23500 and Ybr = 27027.54. The s.e. estimate is

r 2 2  n  sy(1 − ρˆ ) se(Ybreg) = N 1 − N n s  15  9983.81(1 − 0.90692) = 300 1 − = 3178.52 300 15 which is < se(Ybr) = 3226.66 << se(Yb) = 7543.72. This shows that the dropping of zero y-intercept assumption improves the estimate slightly. Note that the y-intercept estimate is 1175 926 a = y − bx = − 1.3815 × ≈ −6.9531 15 15 which is quite close to zero. Read Tutorial 11 Q2c,d, & 3c,d.

SydU STAT3014 (2015) Second semester Dr. J. Chan 46 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

3.5 The Hartley - Ross Estimator Since the ratio estimator r for R is biased, the following leads to an unbiased estimator of R. Theorem: Let Z = f(X,Y ) be a fixed function of two variables. Define Zi = f(Xi,Yi) and zi = f(xi, yi). Then  N − 1 Pn z (x − x¯) PN Z X E z¯ + i=1 i i = i=1 i i . (7) NX¯ n − 1 NX¯

Proof: The LHS is N − 1 Pn z (x − x¯) E(z¯) + E i=1 i i NX¯ n − 1 N N N N X X X X Zi(Xi − X¯ ) ZiXi − X¯ Zi ZiXi N − 1 = Z¯ + i=1 = Z¯ + i=1 i=1 = i=1 . NX¯ N − 1 NX¯ NX¯

For the problem of estimation of R from sample (xi, yi), i = 1, ··· , n, we assume Xi > 0, i = 1, ··· ,N and define the function ∗ zi = f(xi, yi) = yi/xi = ri , i = 1, ··· , n and Zi = Yi/Xi, i = 1, 2, ··· ,N, so from (7) N X Yi Xi  N − 1 n(¯y − x¯r¯∗) Xi Y¯ E r¯∗ + = i=1 = = R NX¯ n − 1 NX¯ X¯ since n n n n X X yi X X yi z (x − x¯) = (x − x¯) = y − x¯ = n(y¯ − x¯r¯∗). i i x i i x i=1 i=1 i i=1 i=1 i Thus the Hartley-Ross estimator as an unbiased estimator of R is

SydU STAT3014 (2015) Second semester Dr. J. Chan 47 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

N − 1 n(y¯ − r¯∗x¯) Rˆ = r¯∗ + hr NX¯ n − 1 for which we need to know X¯ (or X = NX¯ ). This estimator contains a mean of ratio estimate and an adjustment for unbiasness. The Hartley-Ross estimators for mean and total are N − 1 n(y¯ − r¯∗x¯) for the population mean: Yb¯ = X¯ r¯∗ + and hr N n − 1 ∗ ∗ n(y¯ − r¯ x¯) for the population total: Ybhr = Xr¯ + (N − 1) . n − 1

Remarks: n y¯ 1 X yi 1. So far, we have Rb = biased for R, Rb∗ = biased for R & x¯ n x i=1 i N ∗ 1 X yi unbiased for R = and Rbhr unbiased for R. Finally, could N x i=1 i we just use ¯ Rbo = y/¯ X ?

 y¯  E(y¯) Y¯ This is the ordinary estimator E = = = R which does X¯ X¯ X¯ not use the information from the sample {xi} but is unbiased for R.

2. For small samples we might expect the Hartley-Ross estimator to be better. There is no general result on the comparison of the y¯ y¯ N − 1 n(y¯ − r¯∗x¯) of r = , r = , and r =r ¯∗ + for all x¯ o X¯ hr NX¯ n − 1 sample sizes. See Cochran (2nd Ed) Theorem 6.3 §6.15.

SydU STAT3014 (2015) Second semester Dr. J. Chan 48 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

Summary of estimators and variance estimates based on 1 SRS

Ord. Ratio Regression Hartley-Ross

y¯ y¯ N − 1 n(¯y − r¯∗x¯) Ratio R - r¯∗ + X¯ x¯ NX¯ n − 1 1 n s2 1 n s2 (1 − ) y (1 − ) r - - X¯ 2 N n X¯ 2 N n

∗ ¯ y¯ ¯ sxy ¯ ∗ N − 1 n(¯y − r¯ x¯) Mean Y y¯ X y + 2 (X − x) Xr¯ + x¯ sx N n − 1 n s2 n s2 n s2(1 − ρˆ2) (1 − ) y (1 − ) r (1 − ) y - N n N n N n

var(Yb¯ r) < var(Yb¯ ) var(Yb¯ reg) < var(Yb¯ r) ys¯ y¯ ifρ ˆ > x equal if b = r = 2¯xsy x¯

∗ y¯ sxy ∗ n(¯y − r¯ x¯) Total Y Ny¯ X Ny + 2 (X − Nx) Xr¯ + (N − 1) x¯ sx n − 1 n s2 n s2 n s2(1 − ρˆ2) N 2(1 − ) y N 2(1 − ) r N 2(1 − ) y - N n N n N n

n 1 X s2 = ( y2 − ny¯2), y n − 1 i i=1 n n n 1 X X X y¯ s2 = ( y2 − 2r x y + r2 x2) = s2 − 2rρsˆ s + r2s2, r = , r n − 1 i i i i y x y x x¯ i=1 i=1 i=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 49 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. Example: (7-11) Solution: The ratios and their summary are given below: 0 0 i xi yi ri = yi/xi i xi yi ri = yi/xi 1 50 56 1.120 9 100 165 1.650 2 35 48 1.371 10 250 409 1.636 3 12 22 1.833 11 50 73 1.460 4 10 14 1.400 12 50 70 1.400 5 15 18 1.200 13 150 95 0.633 6 30 26 0.867 14 100 55 0.550 7 9 11 1.222 15 40 83 2.075 8 25 30 1.200 Total 19.618 n 1 X 19.618 We have r¯∗ = r∗ = = 1.3079, x¯ = 61.7333 and y¯ = n i 15 i=1 78.3333.

The Hartley-Ross estimate of the total sale this year in thousands is ∗ ∗ n(y¯ − r¯ x¯) Ybhr = Xr¯ + (N − 1) n − 1 15[78.333 − 1.3079(61.7333)] = 21300(1.3079) + (300 − 1) 15 − 1 = 27086.9

Read Tutorial 12 Q1(a).

SydU STAT3014 (2015) Second semester Dr. J. Chan 50 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

Example: In a survey of family size (x1), weekly income (x2) and weekly expenditure on food (y), we want to estimate the average weekly expen- diture on food per family in the most efficient way. A simple random sample of 27 families yields the following data: X X X x1i = 109, x2i = 16277, yi = 2831, ρˆx1,y = 0.925, ρˆx2,y = 0.573 i i i

The sample matrix for y, x1 and x2 is   547.8234 26.5057 1796.5541        26.5057 1.4986 80.1595  .     1796.5541 80.1595 17967.0541

From the data X¯1 = 3.91 and X¯2 = 542.

(a) Estimate the standard errors of the ratio estimators for Y¯ using x1 and using x2. Compare the standard errors with the s.e. for the simple estimate ignoring the covariates. Which estimator has the smallest estimated s.e.? (b) Calculate the best available estimate of the average weekly expen- diture on food per family and give an approximate 95% confidence interval for this average.

Solution:

(a) The standard errors of the ratio estimators for Y¯ using x1 and using x2 are

SydU STAT3014 (2015) Second semester Dr. J. Chan 51 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

P i yi 2831 r1 = P = = 25.97 i x1i 109 s2 = s2 − 2r s + r2 s2 r1 y x1y x1 = 547.8234 − 2(25.97)(26.5057) + 25.972(1.4986) = 181.896 P i yi 2831 r2 = P = = 0.1739 i x2i 16277 s2 = s2 − 2r s + r2 s2 r2 y x2y x2 = 547.8234 − 2(0.1739)(1796.5541) + 0.17392(17967.0541) = 475.7895 r s2 r181.896 se(Yb¯ ) = r1 = = 2.5956 r1 n 27 r s2 r475.7895.193 se(Yb¯ ) = r2 = = 4.1978 r2 n 27 r s2 r547.8234 se(Yb¯ ) = y = = 4.5044 n 27

The first ratio estimator Yb¯ r1 has the lowest s.e. due to the higher cor-

relation ρˆy,x1 = 0.925. The second ratio estimator only has marginal improvement as the correlationρ ˆ = 0.573 is weak but y,xx √ ys¯ x 2831 · 17967.0541 ρx2,y = 0.573 > = √ = 0.4980. 2¯xsy 2 · 16277 · 547.8234 Note that fpc is ignored because the population size N is unknown. (b) The estimate of the average weekly expenditure on food per family

Yb¯ r1 = r1X¯ = 25.97(3.91) = 101.5524 95% CI for Y¯ = 101.5524 ∓ 1.96(2.5956) = (96.4651, 106.6397)

SydU STAT3014 (2015) Second semester Dr. J. Chan 52 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

3.6 Ratio estimate for subpopulation in poststratification

For some Cl, we want to estimate: , N , N X X X 0 X 0 0 Rl = Yi Xi = Yi Xi, = R i∈Cl i∈Cl i=1 i=1 if we define 0 0 (Yi ,Xi) = (Yi,Xi) if i ∈ Cl = (0, 0) if i∈ / Cl . 0 0 Note: X = Xl, i.e. the sum of Xi over all population equals to the sum of Xi over Cl. Hence the natural estimator of ratio and its variance estimate is n P P 0 yi yi 0 2 i∈Cj 1  n  s r0 = i=1 = = r and var(r ) ≈ 1 − rl n P l l ¯ 0 2 P 0 xi (X ) N n xi i∈C i=1 l where 0 0 (xi, yi) = (xi, yi) if i ∈ Cl = (0, 0) if i∈ / Cl, n X0 1 X 1 X X¯ 0 = can be estimated byx ¯0 = x0 = x and N n i n i i=1 i∈Cl N 2 1 X S0 = (Y 0 − R0X0)2 rl N − 1 i i i−1 can be estimated by n 2 1 X 1 X s0 = (y0 − r0x0)2 = (y − r x )2. rl n − 1 i i n − 1 i l i i=1 i∈Cl

SydU STAT3014 (2015) Second semester Dr. J. Chan 53 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

The ratio estimator of mean in Cl and its variance estimate are   0 2 ¯ ¯ ¯ 1 n srl Yb rl = Xl rl and var(Yb rl) ≈ 2 1 − Wl N n since X¯ 2  n  s0 2 var(Yb¯ ) = X¯ 2var(r ) = l 1 − rl rl l l X¯ 02 N n X 02 N 2  n  s0 2 1  n  s0 2 = 1 − rl = 1 − rl . 2 02 2 Nl X N n Wl N n

Similarly, the ratio estimator of total in Cl and its variance estimate are

0 2 2  n  srl Ybrl = Xl rl and var(Ybrl) ≈ N 1 − N n since 2 0 2 2 0 2 2 Xl  n  srl 02 N  n  srl var(Ybrl) = X var(rl) = 0 1 − = X 0 1 − . l X¯ 2 N n X 2 N n Note that these estimators correspond to method 1 in Section 1.5 for poststratification and nl does not come into any of these calculations. Read Tutorial 12 Q1b,c.

SydU STAT3014 (2015) Second semester Dr. J. Chan 54 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

3.7 Ratio Estimation for Stratified SRS

In a stratified SRS, a SRS of a specified sample size nl is taken in each of the L strata with known size Nl, e.g. the 6 states of Australia. There are two types of ratio estimates depending on the order of taking ratio and summing over strata.

1. Take ratios rl =y ¯l/x¯l first and sum over Ybl = Xlrl to obtain Rbs = PL l=1 Ybl/X.

2. Sum over Ybl and Xbl first to obtain Yb and Xb and then take ratio Rbc = Y/b Xb.

3.7.1 The ‘Separate’ Ratio Estimate

L P Suppose the stratum totals Xl, l = 1, ··· ,L are known so X = Xl l=1 is known also. Then

L L L Yb 1 X 1 X 1 X ¯ Rbs = = Ybl = Xl rl = WlXl rl X X X X¯ l=1 l=1 l=1

Xl N Nl Xl 1 y¯l since = = WlX¯l and rl = . Then X X N Nl X¯ x¯l L L L X Xl X Xl Yl 1 X Y E(Rbs) = E(rl) ≈ = Yl = = R, X X Xl X X l=1 l=1 l=1

L   2 Yl 1 X nl s since E(r ) ≈ R = and var(R ) = W 2 1 − srl l l bs 2 l Xl X¯ Nl nl l=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 55 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. where " nl nl nl # 2 2 2 2 1 X 2 X 2 X 2 s = s −2rlsx y +r s = y − 2rl xilyil + r x . rl yl l l l xl n − 1 il l il l i=1 i=1 i=1 Similarly the separate ratio estimate for the mean is

L L   2 ¯ X ¯ ¯ X 2 nl ssrl Yb st,s = WlXl rl and var(Yb st,s) = Wl 1 − Nl nl l=1 l=1

Bias: For large stratum sample sizes, rl will be approximately unbiased for Rl and var(rl) will approximate Var(rl) reasonably well. For moderate and small samples, bias is important, and we should con- sider it here. We know that in a single stratum

|bias rl| σx¯l ≤ ¯ = cv(¯xl) σrl Xl

Consider the bias of Rbs :

|bias (Rbs)| = E(Rbs − R) L ! L X Xl X Xl = E (r − R ) = E(r − R ) X l l X l l l=1 l=1 L L X Xl X Xl = |bias rl| ≤ max |bias rl| X l X l=1 l=1   σx¯lσrl ≤ max |bias rl| ≤ max l l X¯l Hence   max σrl   √ max σrl   |bias (Rbs)| l σx¯l l σx¯l ≤ max ¯ ≤ L   max ¯ s.e.R s.e.(R ) l Xl min σr l Xl bs bs l l

SydU STAT3014 (2015) Second semester Dr. J. Chan 56 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. since v v u L  2 u L uX Xl uX 2 s.e.(Rbs)= t var(rl) ≥ min σrlt pl X l l=1 l=1 v u L 2 r uX  1  L 1 ≥ min σr t ≥ min σr ≥ √ min σr l l L l l L2 l l l=1 L where Xl = plX. The sum of squares of unequal proportions is higher than that from equal proportions in general. This is due to the convexity property of the function f(p) = p2. For example, when L = 2 with cases 1 1 (1 − p, p) and (2, 2), 1 1 1 (1 − p)2 + p2 − 2( )2 = 2p2 − 2p + = (2p − 1)2 ≥ 0. 2 2 2 √ ¯ Therefore the ratio on the LHS can be L times as large as the σx¯l/Xl bound on individual relative biases. Even if the biases are individually small, the overall bias can be large.

3.7.2 The ‘Combined’ Ratio Estimate It is defined as L P Wly¯l ¯ l=1 y¯st Yb Yb R = = = = bc L P x¯st Xb¯ Xb Wlx¯l l=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 57 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. and in contrast to Rbs, it does not require the knowledge of individual Xl’s. Note that   L y¯st y¯st  1 X Y E(Rbc) = E ≈ E ≈ WlE(y¯l) = = R Approx. unbiased x¯st X¯ X¯ X l=1 Theorem: L   PNl ¯ ¯ 2 ! 1 X nl 1 [Yil − Yl − R(Xil − Xl)] Var(R ) ≈ W 2 1 − i=1 . bc 2 l X¯ Nl nl Nl − 1 l=1 Proof: First L y¯st 1 1 X Rbc − R = − R = (y¯st − Rx¯st) = Wl(y¯l − Rx¯l) x¯st x¯st x¯st l=1 L 1 X ¯ 1 ¯ 1 ¯ = Wldl = dst ≈ dst x¯st x¯st X¯ l=1 where dli = yli − Rxli, i = 1, ··· , nl estimates Dli = Yli − RXli and n 1 Xl d¯ = d . Note that typically D¯ =6 0. Hence l n il l l i=1 L   2 1 1 X nl S Var(R ) ≈ Var(d¯ ) ≈ W 2 1 − crl bc 2 st 2 l X¯ X¯ Nl nl l=1 where N N 1 Xl 1 Xl S2 = (D − D¯ )2 = [Y − Y¯ − R(X − X¯ )]2 crl N − 1 li l N − 1 li l li l l i=1 l i=1 = S2 − 2RS + R2S2 , yl xlyl xl and this can be estimated by nl 2 1 X 2 2 2 2 s = [yli − y¯l − rc(xli − x¯l)] = s − 2rcsx y + r s crl n − 1 yl l l c xl l i=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 58 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est. as compared to

" nl nl nl # 2 1 X 2 X 2 X 2 2 2 2 s = y − 2rl xliyli + r x = s −2rlsx y +r s srl n − 1 li l il yl l l l xl l i=1 i=1 i=1 for separate ratio estimator.

There is less risk of bias in Rˆc than in Rˆs. We can show that   |E(Rbc − R)| σx¯ ≤ max l l ¯ s.e.Rbc Xl in contrast to     |E(Rbs − R)| √ maxl σr σx¯ ≤ L l max l l ¯ s.e.Rbs minl σrl Xl for the separate ratio estimator Rbs. Similarly the combine ratio estimate for the mean and its variance are

L   2 ¯ ¯ ¯ X 2 nl scrl Yb st,c = XRbc and var(Yb st,c) = Wl 1 − . Nl nl l=1 and for the total are L   2 2 X 2 nl scrl Ybst,c = XRbc and var(Ybst,c) = N Wl 1 − . Nl nl l=1 Read Tutorial 12 Q2.

SydU STAT3014 (2015) Second semester Dr. J. Chan 59 STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.

Estimators and variance estimates for stratified SRS (Ch.2) Parameter Estimator Variance nl 2 1 P 2 2 Nl Ordinary/naive estimator s = ( y − nly¯ ), Wl = yl nl−1 li l N i=1 L L   2 1 X 1 X nl syl Ratio R R = W y¯ var(R ) = W 2 1 − bst l l bst 2 l X¯ X¯ Nl nl l=1 l=1 L L   s2 ¯ ¯ P ¯ X 2 nl yl Mean Y Yb st = Wly¯l var(Yb st) = Wl 1 − Nl nl l=1 l=1 L L   s2 P 2 X 2 nl yl Total Y Ybst = N Wly¯l var(Ybst) = N Wl 1 − Nl nl l=1 l=1 2 2 2 2 y¯l Separate ratio estimator ssr,l = syl − 2 rlρsˆ xlsyl + rl sxl, rl = x¯l L L   2 1 X 1 X nl ssr,l Ratio R R = W X¯ r var(R ) = W 2 1 − bst,sr l l l bst,sr 2 l X¯ X¯ Nl nl l=1 l=1 L L   s2 ¯ ¯ P ¯ ¯ X 2 nl sr,l Mean Y Yb st,sr = WlXlrl var(Yb st,sr) = Wl 1 − Nl nl l=1 l=1 L L   s2 P ¯ ¯ 2 X 2 nl sr,l Total Y Ybst,sr = N WlXlrl var(Yst,sr) = N Wl 1 − Nl nl l=1 l=1 PL 2 2 2 2 l=1 Wly¯l Combine ratio estimator scr,l = syl − 2 rcρsˆ xlsyl + rc sxl, rst,cr = PL l=1 Wlx¯l L P Wly¯l L   2 l=1 1 X nl scr,l Ratio R R = = r var(R ) = W 2 1 − bst,cr L st,cr bst,cr 2 l X¯ Nl nl P l=1 Wlx¯l l=1 L   s2 ¯ ¯ ¯ ¯ X 2 nl cr,l Mean Y Yb st,cr = Xrst,cr var(Yb st,cr) = Wl 1 − Nl nl l=1 L   s2 ¯ ¯ 2 X 2 nl cr,l Total Y Ybst,cr = NXrst,cr var(Yst,cr) = N Wl 1 − Nl nl l=1

SydU STAT3014 (2015) Second semester Dr. J. Chan 60