A measurement error model approach to survey data integration: combining information from two surveys

Jae Kwang Kim 1

Iowa State University

2017 SAE conference, Paris July 11th, 2017

1Joint work with Seho Park Survey data integration

Want to combine information from multiple surveys Three situations 1 Multiple samples for one target population 2 One sample each from multiple populations 3 Multiple samples from multiple populations Small area estimation is a special case of survey data integration, in that multiple sub-populations represent multiple domains.

Kim (ISU) Survey Data Integration 7/11/2017 2 / 25 Motivation

USAID Bureau for Food Security (BFS) sponsors Food and Nutrition Technical Assistance III project (FANTA).

Key technical areas of focus are food security, maternal and child health, agriculture, and livelihoods strengthening.

Kim (ISU) Survey Data Integration 7/11/2017 3 / 25 Motivation

FANTA has two projects: Feed the Future (FTF) and Food for Peace (FFP) development projects. FFP project was conducted by ICF International, and FTF project was conducted by UNC MEASURE. Two surveys were conducted in 2013 from selected departments in : San Marcos, Totonicapan, Quiche, Quezaltenango, and Huehuetenango.

Kim (ISU) Survey Data Integration 7/11/2017 4 / 25 Map of Guatemala

Kim (ISU) Survey Data Integration 7/11/2017 5 / 25 FFP and FTF Projects in Guatemala

Figure: Selected Departments in Guatemala

Kim (ISU) Survey Data Integration 7/11/2017 6 / 25 Overlap Area

Figure: FFP ZOI and FFP Project Implementation Area for Guatemala

Kim (ISU) Survey Data Integration 7/11/2017 7 / 25 Overlap Area

Table: Overlap Area: Departments and Municipalities Department Municipality San Marcos Sibinal Tajumulco Totonicapan Momostenango Santa Lucia La Reforma Huehuetenango Concepcion Huista Jacaltenango San Antonio Huista Todos Santos Quetzaltenango San Juan Ostuncalco Quiche (Santa Maria) Uspantan Cunen

Kim (ISU) Survey Data Integration 7/11/2017 8 / 25 Common Indicators

Two surveys have their own indicators and 11 common indicators were chosen to be studied. The common items are about women’s nutritional status, children’s well-being status, and prevalence of poverty in household.

Kim (ISU) Survey Data Integration 7/11/2017 9 / 25 Common Indicators

Table: Common Indicators Indicator Description Daily Per Capita Expendi- Average daily per capita consumption con- tures (PCE) stant 2010 USD Prevalence of Poverty Prevalence of poverty: percentage of people (PP) living on less than $1.25 USD per capita per day Mean Depth Poverty Average of the differences between total (MDP) daily Prevalence of Households Prevalence of households with moderate or with Hunger (HHS) severe hunger Prevalence of Under- Women that are eligible for BMI (not cur- weight Women rently pregnant and not within 2 months of delivery) who has BMI less than 18.5 Women’s Dietary Diver- Mean number of food groups consumed by sity Score (WDDS) women of reproductive age (15-49 years)

Kim (ISU) Survey Data Integration 7/11/2017 10 / 25 Common Indicators

Table: Common Indicators (Cont’d)

Indicator Description Prevalence of Stunted Prevalence of stunted children under five Children years of age (0-59 months) Prevalence of Wasted Prevalence of wasted children under five Children years of age (0-59 months) Prevalence of Under- Prevalence of underweight children under weight Children five years of age (0-59 months) Prevalence of Children Re- Prevalence of children 6-23 months receiv- ceiving a Minimum Ac- ing a minimum acceptable diet ceptable Diet (MAD) Prevalence of Exclusive Prevalence of exclusive breastfeeding of chil- Breastfeeding (EBF) dren under six months of age

Kim (ISU) Survey Data Integration 7/11/2017 11 / 25 Estimates from two surveys

Table: Daily Per Capita Expenditure

Department FFP/ICF FTF/UNC T-statistics N Mean S.E. N Mean S.E. San Marcos 1419 0.558 0.014 981 1.166 0.018 -23.376 Totonicapan 1654 0.388 0.015 181 0.896 0.039 -5.505 Huehuetenango 877 0.456 0.023 1535 1.140 0.018 -30.587 Quetzaltenango 628 0.695 0.022 60 1.325 0.112 -26.179 Quiche 1288 0.382 0.015 1350 1.045 0.015 -12.179

Kim (ISU) Survey Data Integration 7/11/2017 12 / 25 Estimates from two surveys

Table: Prevalence of Households with Hunger (%)

Department FFP/ICF FTF/UNC T-statistics N Mean S.E. N Mean S.E. San Marcos 1419 3.76 0.50 981 15.35 1.08 -9.733 Totonicapan 1654 11.79 0.87 181 15.01 2.72 -1.125 Huehuetenango 877 8.91 0.91 1535 15.58 0.87 -5.323 Quetzaltenango 628 6.84 0.91 60 9.94 3.96 -0.765 Quiche 1288 7.13 0.74 1350 9.73 0.77 -2.430

Kim (ISU) Survey Data Integration 7/11/2017 13 / 25 Data Structure

Table: Data Structure

XYa Yb Sample A o o Sample B o o

Kim (ISU) Survey Data Integration 7/11/2017 14 / 25 Goal: Synthetic data imputation

Table: Data Structure

XYa Yb Sample A o oo Sample B ooo

Kim (ISU) Survey Data Integration 7/11/2017 15 / 25 Methodology

Steps 1 Specify a measurement error model. 2 Derive prediction model using Bayes theorem. 3 Parameter estimation: EM algorithm. 4 Generating imputed values from the prediction model.

Kim (ISU) Survey Data Integration 7/11/2017 16 / 25 Step 1: Model specification

Assume that Sample A is a gold standard one. That is, Ya = Y . Structural Equation model

Ya ∼ f1(ya | x; θ1).

From the observations in Sample A, we can perform model diagnostics. Measurement error model

Yb ∼ f2(yb | ya; θ2).

Assume nondifferentiability of measurement error model

f (yb | x, ya) = f (yb | ya)

For dichotomous y-variables, measurement error model becomes misclassification model.

Kim (ISU) Survey Data Integration 7/11/2017 17 / 25 Step 2: Prediction model

Prediction model is the model for the counterfactual outcome, conditional on the observed values.

Prediction model for Yb in sample A:

p(yb | x, ya) = f2(yb | ya).

Prediction model for Ya in sample B: Using Bayes formula, we can derive f1(ya | x; θ1)f2(yb | ya; θ2) p(ya | x, yb) = R f1(ya | x; θ1)f2(yb | ya; θ2)dya

The prediction model can be used to obtain the best prediction of Yai for i ∈ Sb.

Kim (ISU) Survey Data Integration 7/11/2017 18 / 25 Step 3: Parameter estimation - EM algorithm

E-step: compute

(t) X Q1(θ1 | data; θˆ ) = wi,a log f1(yai | xi )

i∈Sa X (t) + wi,bE{log f1(Ya | xi ) | xi , ybi ; θˆ }

i∈Sb (t) X (t) Q2(θ2 | data; θˆ ) = wi,aE{log f2(Yb | yai ) | x, yai ; θˆ )

i∈Sa X (t) + wi,bE{log f2(ybi | Ya) | x, ybi ; θˆ )},

i∈Sb where the conditional expectations are computed from the prediction model in Step 2.

M-step: update the parameters by maximizing Q1 and Q2 wrt θ1 and θ2, respectively.

Kim (ISU) Survey Data Integration 7/11/2017 19 / 25 Step 4: Best prediction

Using the measurement error model, we can predict yai by yˆai = E(Ya | xi , ybi ) for i ∈ SB .

A prediction estimation of µ = E(Ya) can be obtained by P P wi,ayai + wi,byˆai µˆ∗ = i∈SA i∈SB P w + P w i∈SA i,a i∈SB i,b Reference: Kim, Berg, and Park (2016). Statistical Matching using fractional imputation. Survey Methodology, 42, 19–40.

Kim (ISU) Survey Data Integration 7/11/2017 20 / 25 Application to FANTA project

1 Model for PCE

yai = xi β + ei

ybi = α0 + α1yai + ui

2 2 where ei ∼ N(0, σe ) and ui ∼ N(0, σu). 2 Model for HHS prevalence

yai ∼ Bernoulli(πi )

ybi ∼ Bernoulli{pyai + q(1 − yai )}

where logit(πi ) = xi β and p, q ∈ (0, 1).

Kim (ISU) Survey Data Integration 7/11/2017 21 / 25 Model Diagnostics for PCE model

Fitted Values Vs Residuals Normal Q-Q Plot 2 2 1 1 0 0 Residuals Sample Quantiles Sample -1 -1 -2 -2

-2 -1 0 1 2 -4 -2 0 2 4

Fitted Values Theoretical Quantiles

Kim (ISU) Survey Data Integration 7/11/2017 22 / 25 Result: PCE Indictor

Department FFP FTF Combined San Marcos 0.558 1.165 0.563 (0.030) (0.038) (0.026) Totonicapan 0.388 0.895 0.331 (0.030) (0.085) (0.028) Quiche 0.382 1.045 0.396 (0.030) (0.031) (0.026) Huehuetenango 0.456 1.140 0.479 (0.044) (0.036) (0.027) Quetzaltenango 0.695 1.325 0.795 (0.044) (0.232) (0.043)

Kim (ISU) Survey Data Integration 7/11/2017 23 / 25 Results for HHS indicator

Department FFP FTF Combined San Marcos 3.76 15.35 3.77 (1.01) (2.22) (1.00) Totonicapan 11.79 15.01 12.08 (1.70) (6.00) (1.60) Quiche 7.13 9.73 7.19 (1.50) (1.57) (1.42) Huehuetenango 8.91 15.58 8.75 (1.90) (2.00) (1.90) Quetzaltenango 6.84 9.94 6.85 (1.80) (8.25) (1.70)

Kim (ISU) Survey Data Integration 7/11/2017 24 / 25 Concluding remark

Survey data integration using measurement error model is considered. Prediction of the counterfactual outcome is obtained by Bayes theorem. Parameter estimation involves EM algorithm. Bayesian approach can be developed (not discussed here). Extension to GLMM model for the structural equation model is under progress.

Kim (ISU) Survey Data Integration 7/11/2017 25 / 25