Advanced Handling of Missing

One-day Workshop

Nicole Janz

[email protected] Goals

• Discuss types of missingness

• Know advantages & disadvantages of missing data methods

• Learn multiple

• Practical: diagnose, visualize and handle missing data in R

2 Steps in the research process

1. Identify patterns of missingness for each variable

2. Why are data missing? Could this your sample?

3. How do other scholars in your field handle missingness?

4. Decide on method to handle missingness for your particular variables

5. Robustness: try different missing data methods, run your analysis, compare the results

3 ProportionsA SIMPLIFIED of missingnessBIVARIATE TEST per GUIDE variable in a table

variable nmiss n propmiss

country 0 5568 0.00000000 year 0 5568 0.00000000 UN_FDI_flow 477 5568 0.08566810 US_fdi_electrical 1896 5568 0.34051724 US_fdi_machinery 1922 5568 0.34518678 US_fdi_transport 1968 5568 0.35344828 US_fdi_mining 3908 5568 0.70186782 US_fdi_services 3955 5568 0.71030891 US_fdi_petrol 4258 5568 0.76472701 US_fdi_utilities 4984 5568 0.89511494

4 ProportionsA SIMPLIFIED of missingnessBIVARIATE TEST per GUIDE variable in a graph Proportion of missingness

Petrol/GDP Mining/GDP Other FDI/GDP Deposit./GDP Finance/GDP US FDI/GDP Wh.Trade/GDP Food/GDP Chemical/GDP Metal/GDP Transp./GDP Machinery/GDP Mosley Law Mosley Prac. Mosley Labor Electr./GDP PTS Democracy CIRI Women CIRI Phys. CIRI Emp. CIRI Worker Trade GDP p. capita Population Conflict Fariss Life exp. Inf.mort.

0.0 0.2 0.4 0.6 0.8 1.0 5 TimeA SIMPLIFIED series: number BIVARIATE of years TEST withGUIDE existing data

6 HeatmapA SIMPLIFIED per country-year BIVARIATE TEST and GUIDE variable

yellow=missing 7 WhyA SIMPLIFIEDare my data BIVARIATE missing? TEST GUIDE

Due to social/natural processes • school graduation, dropout, death • a country does not exist anymore e.g. GDR • office reclassified variables • intentional non-disclosure

Skip patterns in surveys • E.g. only married respondents are asked certain follow-up questions

Respondent refusal • income

8 WhyA SIMPLIFIEDare my data BIVARIATE missing? TEST GUIDE

variable nmiss n propmiss

US_fdi_mining 3908 5568 0.70186782 US_fdi_petrol 4258 5568 0.76472701 US_fdi_utilities 4984 5568 0.89511494

• Mining FDI is available until 1999 • Petrol FDI is available from 2000 • Utilities FDI is a new category was introduced after 2000

9 ThreeA SIMPLIFIED types of missingnessBIVARIATE TEST GUIDE

1. MCAR - Missing Completely at Random

2. MAR - Missing at Random

3. MNAR Missing not at Random

10 MCAR:A SIMPLIFIED Missing CompletelyBIVARIATE TEST at Random GUIDE

Missing value (y) neither depends on x nor y. Probability of missingness is the same for all units.

Survey respondent decides whether to answer the “earnings” question by rolling a die and refusing to answer if a “6” shows up

Some survey questions asked of a simple random sample of original sample

What to do: If data are missing completely at random, then throwing out cases with missing data does not bias your inferences

-> do listwise deletion, then run analysis 11 MAR:A SIMPLIFIED Missing at BIVARIATE Random TEST GUIDE Probability that a variable is missing depends only observed data, but not the missing data itself, or unobserved data.

If sex, race, education, and age are recorded for all the people in the survey, then “earnings” is MAR if the probability of nonresponse depends only on these variables

If men are more likely to tell you their weight than women, and we record gender, then weight is MAR.

What to do? Some say listwise deletion is fine, but only if regression controls for all variables that affect probability of missingness.

More common: use multiple imputation (MI) because listwise deletion introduces bias. 12 MNAR:A SIMPLIFIED Missing BIVARIATE not at Random TEST GUIDE (non-ignorable missingness) Missingness depends at least in part on unobserved factors. Special case: Missingness depends on variable that is missing

People with college degrees are less likely to reveal their earnings, we don’t have education data for all respondents

If a particular treatment causes discomfort, a patient is more likely to drop out of the study. We don’t have a measure for discomfort for all patients.

Respondents with high income less likely to report income.

13 MNAR:A SIMPLIFIED Missing BIVARIATE not at Random TEST GUIDE (non-ignorable missingness)

What to do? Most problematic case. Potential lurking variables are often unobserved.

MI based on auxiliary, external data e.g. estimate race based on data associated with the address of the respondent.

Try to include as many predictors as possible in a model to get MNAR closer to MAR.

14 HowA SIMPLIFIEDto distinguish BIVARIATE between TEST MNAR GUIDE and MAR?

Think about your variables and use your substantive scientific knowledge of the data and your field.

Can you collect more data that explain missingness, or is it very likely that they will remain unobserved?

What does the literature say about predictors of that particular missing variable?

15 HowA SIMPLIFIEDto distinguish BIVARIATE between TEST MAR GUIDE and MCAR?

Again, think about the data. Some indication (but no definitive answer) can be gained from two tests:

1) Little’s test for MCAR (Little 1988)

Maximum likelihood chi-square test for missing completely at

random. H0 is that the data is MCAR.

If the p value for Little's MCAR test is not significant, then the data may be assumed to be MCAR and missingness is ignorable (do listwise deletion).

mcartest in ; EM option in SPSS; in R see lab 16 HowA SIMPLIFIEDto distinguish BIVARIATE between TEST MAR GUIDE and MCAR?

2. Dummy variable approach for MCAR

create dummy variables for whether a variable is missing: 1 = missing 0 = observed

Run t-tests (continuous) and chi-square (categorical) tests between this dummy and other variables to see if the missingness is related to the values of other variables

Tests which return a finding of significance indicate MAR rather than MCAR (-> use multiple imputation)

(SPSS: MVA option, R see lab) 17 Ad-hocA SIMPLIFIED methods BIVARIATE TEST GUIDE Listwise deletion (complete case analysis) Automatically done in regression in most software; or by hand; assumes MCAR • If MAR or MNAR: introduces biased sample • reduces sample size

Pairwise deletion (available case analysis) different aspects of a problem are studied with different subsets of the data • Results between subsets not consistent / comparable • if the non-respondents differ systematically from the respondents, this will bias the available-case summaries • Potential omitted variable bias if excludes a complete variable because its high missingness

18 Ad-hocA SIMPLIFIED methods BIVARIATE TEST GUIDE Last value carried forward replace missing outcome values with pre-treatment measure • would lead to underestimates of the true treatment effect • ignores changes over time

Mean imputation easiest way to impute is to replace each NA with the • distorts distribution for this variable, e.g. underestimates sd • ignores changes over time

Filling in values manually based on case-based knowledge from other sources • time-consuming • prone to measurement error

19 SingleA SIMPLIFIED imputation BIVARIATE TEST GUIDE Impute missing values from predicted values results from regression

• the error in these cases becomes zero. However, random errors are a feature of the real world and one variable treated with single imputation will be fundamentally different from the other variables. • leads to overconfidence in our models and the coefficients upwards

20 MultipleA SIMPLIFIED Imputation BIVARIATE Techniques TEST GUIDE Multiple imputation (MI) is also based on the idea of using predicted values, but it builds in mechanisms to incorporate uncertainty about the predicted values.

MI imputes values for each missing data point, but it does so n times (usually 5). It then creates n (5) completed data sets.

The observed values remain the same, but the imputed value varies across these 5 data sets, reflecting uncertainty.

MI is much closer to reality when calculating new values.

MI is a good alternative to listwise deletion because the main assumption is that data are MAR, meaning that some other variables in the may (and should) explain why an 21 observation is missing

MultipleA SIMPLIFIED Imputation BIVARIATE Techniques TEST GUIDE amelia.pdf /web/packages/Amelia/vignettes/ cran.r-project.org

22

Details on expectation maximization (EM) algorithm, see King et al. (2001). Figure:https:// CombinationA SIMPLIFIED of BIVARIATE results TEST GUIDE

Run each analysis (e.g. regression) on all 5 imputed data sets.

Collect all 5 coefficients and standard errors (and other measures of interest), and combine them into one estimate according to Rubin’s Rule (1987):

• Estimates: average of the individual estimates

: combine between-imputation and within-imputation variance

23 See King et al. (2001). MultipleA SIMPLIFIED Imputation BIVARIATE Software TEST GUIDE

{Amelia} in R (by Gary King and collaborators)

{mi} in R (by Andrew Gelman and collaborators)

{mice} in R (by Stef van Buuren and collaborators)

SPSS (Analyze > Multiple Imputation)

STATA mi estimate

24 Social Sciences Research Methods Centre

Lab

SummarizingA SIMPLIFIED and BIVARIATE Visualizing TEST GUIDE Missingness in R

% of missingness per variable and subsets of variables

Graphical display

Using Amelia for diagnosis of missingness

26 MCARA SIMPLIFIED patterns? BIVARIATE TEST GUIDE

1) BaylorEdPsych (Little’s Test to diagnose MCAR) https://cran.r-project.org/web/packages/BaylorEdPsych/ BaylorEdPsych.pdf

2) Creating a dummy variable for missingness 0/1, then running correlations among variables

27 Ad-hocA SIMPLIFIED measures BIVARIATE in R TEST GUIDE

1) Listwise deletion, pairwise deletion

2) Carry last value forward

3) Mean imputation

4) Manually recoding particular variables

5) Replace NAs with predicted values from regression

28 ExampleA SIMPLIFIED 1 BIVARIATE TEST GUIDE Adapted from Schlomer et al. (2010)

60 clients under age 21 years at a large university counseling center were referred for counseling by the dean of students due to underage drinking violations. The counseling center randomly assigned the students to one of two treatment programs (independent variable: Group), one of which uses the harm reduction approach, and the other of which is based on a 12-step model. Participants’ self-efficacy for sobriety was measured before (covariate) and after (dependent variable) the counseling.

• 7 variations of the DV: DV with no missing; DV with 10%, 20%, and 50% MCAR, and DV with 10%, 20%, and 50% MAR 29 ExampleA SIMPLIFIED 1 BIVARIATE TEST GUIDE Adapted from Schlomer et al. (2010)

Goal: Compare biases in estimates of mean, , regression coefficient, and standard error when the DV has 20% missing at random with when the DV has 0% missing using different missing data handling techniques.

Step 1: Calculate M, SD, B, and SE with DV0Miss

Step 2: Create the target data set with DV20MAR

30 ExampleA SIMPLIFIED 1 BIVARIATE TEST GUIDE Adapted from Schlomer et al. (2010)

Describe missing patterns Summarize and Little's (1998) MCAR Dummy code visualize missingness test missingness

Ad-hoc methods

Delete listwise Carry last Substitute Recode Predict from or pairwise value forward with mean manually regression

Multiple imputation

Amelia II 31 MultipleA SIMPLIFIED Imputation BIVARIATE with Amelia TEST GUIDE II

How to run an imputation in R incl diagnostics

- run Amelia on a data set - saving an imputed data set - combining several data into an amelia object - how to deal with ordinal, nominal, natural log data - cross-section - lags and leads - overimputation - time series plots

32 ReproducibilityA SIMPLIFIED BIVARIATE TEST GUIDE

• Set seed (!!!) – for yourself and others

When you re-run Amelia after diagnostics and want to make changes, it’s best to re-use exactly what you had with minimal changes

• Work in R, not the GUI version

• Keep your Rscript well commented; make a note of sessionInfo(), especially the Amelia and R version used

33 ReproducibilityA SIMPLIFIED BIVARIATE TEST GUIDE

On 12/4/2012 5:40 AM, Nicole Janz wrote:

Dear _____, I'm a PhD student at Cambridge University, and I work on foreign investment and labor standards. I read your with great interest. I was wondering if you could make the imputation Rcode available to me? I am asking this because I am using Amelia as well, and I would like to try and replicate your imputation with the same specifications.

Hi Nicole - Thanks for the note. Unfortunately, we did this in AmeliaView, so we don't have R code available (I assume you've found the data and Stata code on my website). 34 MoreA SIMPLIFIED practical tipsBIVARIATE TEST GUIDE

• Set the seed!

• Include any variable in the analysis model in your imputation model. Maybe use auxiliary variables if they make sens.

• Include variables in the form they enter the model (lags, logs, leads, transformations).

• Don’t impute things that don’t make sense! Don’t impute decades of missing data.

• Check diagnostics

35 Literature and tutorials Amelia mailing list https://lists.gking.harvard.edu/mailman/listinfo/amelia

Tutorial for three MI software packages by Thomas Leeper http://thomasleeper.com/Rcourse/Tutorials/mi.html

MISSING VALUES ANALYSIS & DATA IMPUTATION http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf

James Honaker and Gary King, What to do About Missing Values in Time Series Cross-Section Data American Journal of Vol. 54, No. 2 (April, 2010): Pp. 561-581.

Gary King, James Honaker, Anne Joseph, and Kenneth Scheve. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation, American Political Science Review, Vol. 95, No. 1 (March, 2001): Pp. 49-69.

36 Literature and tutorials Andrew Gelman and Jeniffer Hill, Using Regression and Multilevel/Hierarchical Models, CHAPTER 25: Missing-data imputation. Cambridge University Press, Cambridge (2006).

Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models www.math.smith.edu/~nhorton/muchado.pdf

Allison, Paul D. 2001. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks: Sage.

Enders, Craig. 2010. Applied Missing Data Analysis. Guilford Press: New York.

Little, Roderick J., . 2002. Statistical Analysis with Missing Data. John Wiley & Sons, Inc: Hoboken.

Schafer, Joseph L., John W. Graham. 2002. “MissingData: Our View of the

State of the Art.” Psychological Methods. 37 Thank you !

Nicole Janz

www.nicolejanz.de