<<

Understanding Estimator Bias in Stratified Two-Stage

Khoa Dong1, Tim Trudell1, Yang Cheng1, Eric Slud1,2 1U.S. Bureau, 2University of Maryland

Joint Statistical Meetings Vancouver, CA July 29, 2018

1 Disclaimer

This presentation is intended to inform interested parties of ongoing research and to encourage discussion of work in progress. Any views expressed on statistical, methodological, technical, or operational issues are those of the authors and not necessarily those of the U.S. Census Bureau.

2 Outline 1. Motivation 2. Overview of Current Population (CPS) 3. Problem description 4. CPS variance estimation 5. Simulation results

3 Motivation

• When estimating response rate 푝 and 푣푎푟(푝) for households in CPS non-self-representing (NSR) primary sampling units (PSU), we observed unusually high value of 푣푎푟(푝). • We wanted to better understand the cause of this result.

4 Current Population Survey

• One of the oldest surveys in the U.S. (in operation since 1942) • Measuring national unemployment rate • Monthly of ~72,000 households

5 Primary Sampling Unit

• PSU - either a county or group of contiguous counties • Two types of PSU: • Self-representing SR • Non-self-representing NSR

6 CPS Sample Design

• Two-stage stratified sampling design for NSR PSUs: • First stage: select one PSU per stratum with probability proportional to size (civilian noninstitutionalized population 16+ = CNP 16+) • Second stage: do systematic sampling within selected PSUs • Systematic sampling for SR PSUs

7 CPS Sample Design • Select PSUs once every 10 years • 852 PSUs selected in first-stage (2010 design): 506 SR and 346 NSR • Approximately 80% of CNP 16+ population in SR PSUs

8 Key Labor Force Estimates • Noninstitutionalized civilian labor force statistics: • Unemployment/employment levels • Unemployment rate • Labor force participation rate

9 Problem • Estimate monthly response rate 푝, variance 푣푎푟(푝) for CPS households in NSR PSUs. • The sample is at household level: 1 record for each sampled household in each month.

• Response 푦𝑖 has binary outcome: 1--response and 0-- nonresponse. • Time period: March 17 – March 18

10 NSR Household Response Rates March 17 – March 18

Household Response Rate

11 Estimated Variance for Response and Nonresponse Rates Mar 17 – Mar 18

• Expect to see 풗풂풓 풑 = 풗풂풓(ퟏ − 풑), but they are NOT. • Our chosen variance estimator introduces bias in some way.

12 CPS Variance Estimation • Due to CPS sample design, there is no direct variance estimator formula: • Select only one PSU per NSR stratum • Systematic sampling within PSU • Currently use balanced-repeated (BRR) method for NSR PSUs.

13 CPS Variance Estimation • BRR variance estimator: 푅 1 푣푎푟 푌෠ = ෍(푌෠ − 푌෠)2 푅(1 − 퐾)2 푟 푟=1 where 푌෠푟 = the 푟-th replicate estimate of 푌 푌෠ = the full sample estimate of 푌 푅 = number of replicates 퐾 = perturbation factor; 0 ≤ 퐾 < 1 • BRR requires selecting two PSUs per stratum, but CPS selects only one PSU per stratum  collapse strata to make pseudo-strata. • These pseudo-strata should ideally contain exactly 2 perfectly matched strata.

14 BRR with Pseudo-Strata

• Suppose we want to estimate a population total 푌 using ෠ 퐿 ෠ 푌 = σℎ=1 푌ℎ where 퐿 denotes the number of strata.

• Consider the simple case when 퐿 is even, we estimate the variance of 푌෠ by combining the 퐿 strata into 퐺 groups of two strata each (퐿 = 2퐺).

15 BRR with Pseudo-Strata • Hence, 퐺 퐺

푌෠ = ෍ 푌෠𝑔 = ෍(푌෠𝑔1 + 푌෠𝑔2) 𝑔=1 𝑔=1

푮 푮 ෡ ෡ ෡ ퟐ ퟐ 푽풂풓 풀 = ෍ 푽풂풓(풀품ퟏ) + 푽풂풓(풀품ퟐ) = ෍(흈품ퟏ + 흈품ퟐ) 품=ퟏ 품=ퟏ

16 BRR with Pseudo-Strata

• The 푟-th replicate estimate of 푌: 퐺

푌෠푟 = ෍(1 + (1 − 퐾)훿𝑔푟) 푌෠𝑔1 + (1 − (1 − 퐾)훿𝑔푟)푌෠𝑔2 𝑔=1 where 훿𝑔푟 = 1 if the first stratum in 푔-th group is selected and 훿𝑔푟 = − 1 if the second stratum in 푔-th group is selected.

• 훿𝑔푟 are chosen from entries of a Hadamard matrix. • Rows of a Hadamard matrix are mutually orthogonal: 푅

෍ 훿𝑔푟훿푘푟 = 0 (∀ 푔 ≠ 푘) 푟=1

17 BRR with Pseudo-Strata 퐺

푌෠푟 − 푌෠ = ෍ 1 − 퐾 훿𝑔푟 푌෠𝑔1 − 푌෠𝑔2 𝑔=1 퐺 2 2 2 2 (푌෠푟 − 푌෠) = ෍ 1 − 퐾 훿𝑔푟 푌෠𝑔1 − 푌෠𝑔2 𝑔=1 퐺 퐺 2 + ෍ ෍ 1 − 퐾 훿𝑔푟훿푘푟 푌෠𝑔1 − 푌෠𝑔2 푌෠푘1 − 푌෠푘2 𝑔=1 푘≠𝑔

18 BRR with Pseudo-Strata 푅 푅 퐺 1 1 2 ෍(푌෠ − 푌෠)2 = ෍ ෍ 1 − 퐾 2 푌෠ − 푌෠ 푅(1 − 퐾)2 푟 푅(1 − 퐾)2 𝑔1 𝑔2 푟=1 푟=1 𝑔=1 퐺 퐺 푅 1 + ෍ ෍ 1 − 퐾 2 푌෠ − 푌෠ 푌෠ − 푌෠ ෍ 훿 훿 푅(1 − 퐾)2 𝑔1 𝑔2 푘1 푘2 𝑔푟 푘푟 𝑔=1 푘≠𝑔 푟=1 • Therefore, 퐺 퐺 2 2 2 푣푎푟 푌෠ = ෍(푌෠𝑔1 − 푌෠𝑔2) = ෍(푌෠𝑔1+푌෠𝑔2−2푌෠𝑔1푌෠𝑔2) 𝑔=1 𝑔=1

19 Bias in BRR with Pseudo-Strata

• Taking expectation: 퐺 2 2 2 2 퐸 ෍(푌෠𝑔1+푌෠𝑔2−2푌෠𝑔1푌෠𝑔2) = 푉푎푟 푌෠𝑔1 + 휇𝑔1 + 푉푎푟 푌෠𝑔2 + 휇𝑔2 − 2휇𝑔1휇𝑔2 𝑔=1 퐺 퐺 2 2 2 = ෍(휎𝑔1 + 휎𝑔2) + ෍(휇𝑔1 − 휇𝑔2) 𝑔=1 𝑔=1 = 푉푎푟(푌෠) + 퐵푖푎푠2 2 ෠ ෠ where 휎𝑔ℎ = Var{푌𝑔ℎ} and 휇𝑔ℎ = 퐸{푌𝑔ℎ}. • Bias squared term is positive and would ADD to variance estimate. • Bias squared term would be zero if the pair of PSUs in each group were perfectly matched.

20 How are Strata Collapsed ?

• In CPS, the objective function is a function of: • Unemployment • Civilian labor force • Children 0-17 at or below 200% poverty level

21 Simulation Overview • Use one month CPS (Mar 18) which has pseudo-strata information.

• For each household, generate 푦𝑖 responses iid from Bernoulli distribution with various 푝 = 0.03, 0.06, … , 0.99. • For each 푝: • Run 5,000 sims. • Compute true variance and BRR variance. 퐺 2 • Compute bias squared term σ𝑔=1(휇𝑔1 − 휇𝑔2) • Compare true variance with BRR variance after adjusting for bias.

22 Simulation Computation

෡ 푛 • Total number of households: 푁 = σ𝑖=1 푤𝑖 ෠ 푛 • Full sample estimated response count: 푌 = σ𝑖=1 푤𝑖푦𝑖 ෠ 푛 • Replicate 푟 estimated response count: 푌푟 = σ𝑖=1 푤𝑖푦𝑖푓𝑖푟 where 푓𝑖푟 is either 1.5 or 0.5.

23 Simulation Computation

4 • BRR variance of 푌෠: 퐵푅푅푉푎푟 푌෠ = σ160 (푌෠ − 푌෠)2 160 푟=1 푟 • BRR variance of response rate 푝: 2 1 퐵푅푅푉푎푟 푝 = 퐵푅푅푉푎푟 푌෠ 푁෡ assuming 푁 = 푁෡ is fixed from outside knowledge.

• 휇𝑔ℎ = 푝 × 푁𝑔ℎ

24 BRR Variance vs. True Variance

As 풑 gets close to 1, BRR BRRVar variance estimate is far different from true variance.

TrueVar

25 BRR Variance vs. True Variance

BRRVar Not all of bias can be explained due to: • Strata collapsed based on different set of covariates • Use AHS MOS instead of CPS • AHS MOS not currently updated (from 2010 design) BRRVar - BiasSq

TrueVar

26 Summary • For NSR component: • CPS collapses strata to make pseudo-strata. • There is no perfect of strata  bias in variance estimator. • Bias gets significantly large when 푝 gets close to 1. • Quick fix is to use 푣푎푟(1 − 푝) for large 푝. • CPS is designed for civilian labor force statistics. Expect more bias when estimating variance of other statistics.

27 Questions?

Thank You! [email protected]

28 References 1. David Judkins (1990). “Fay’s method for variance estimation.” Journal of , Vol 6, No. 3, 1990 2. Philip J. McCarthy (1966). “Replication: An Approach to the Analysis of Data from Complex Surveys.” Vital and Health Statistics Series 2 No. 14 3. Robert E. Fay (1984). “Some Properties of Estimates of Based on Replication Methods.” 4. Philip J. McCarthy (1969). “Pseudo-Replication: Half Samples.” Review of the International Statistical Institute, Vol. 37, No. 3, pp. 239-264 5. Yang Cheng (2012). “Overview of Current Population .” Internal Report. 6. Wolter, K.M. (2008). Introduction to Variance Estimation, New York: Spring-Verlag.

29