Help, my data has gone missing! An introduction to incomplete data SEMINAR SERIES: HOW TO RUIN YOUR CAREFULLY PLANNED STUDY? TIPS FOR IMPROVING – SESSION 12

ACHILLEAS TSOUMANIS

2 Introduction

Missing data is a fact of life for the researcher

Almost all research fields

Well-designed and controlled studies

Aka item nonresponse and unit nonresponse

3 Consequences of

Can cause bias in estimation (both in point and in intervals) Reduces statistical power May reduce representativeness of samples Can lead to invalid conclusions May threaten validity of results Informative missingness (can give us information / indicates what we are doing wrong)

4 Sources of missing data

Surveys Refusal to participate Refusal to provide an answer Fatigue Lack of knowledge Skipped question after instructions or not

Longitudinal studies Drop out (Un)related events to the study

5 More sources of missing data

Merging of data sets Different IDs/ characteristics. Failures of Illegible hand-written Corrupted files Random events Accidental misplacement or damage of a biological sample Broken/faulty equipment lost in mail Bad weather

6 What do people usually do?

Epidemiological Cohort studies Randomized Clinical studies (2012 – 82 papers) Trials (2012 - 262 studies) (2016 – 86 RCTs)

Reported missing data 54 43 93

Complete case analysis 81 66 55

Multiple imputation 8 6 2

Sensitivity analysis 11 16

Eekhout et al. (2012). Missing Data A Systematic Review of How They Are Reported and Handled. , 23(5), 729-732. Karahalios et al (2012);. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol 12, 96. Fiero et al. (2016). Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials. 17. 1. 10.1186/s13063-016 7 Visualizing missing information

Visualizing missing data is an important tool for giving insight and help us understand the data.

Understand your data + Attrition due to social/natural processes (School graduation, dropout, death) + Skip pattern in survey + Random issues + Non-responders + Time effect + Clustering effect

— Inefficient in datasets with large number of samples or variables.

8 Visualizing missing information

Heatmap Co-occurrence plot

9 Summarizing missing information

Simple numerical summaries are effective at identifying problematic predictors and samples when the data become too large to visually inspect. The total number or percent of missing values for both predictors and samples can be easily computed Percentage missing Min. 1st Qu. 3rd Qu. Max. By sample (row) 0.00 0.00 0.00 0.18 0.65 1.31 Min. 1st Qu. Median Mean 3rd Qu. Max.

By item (column) 0.00 0.00 0.00 4.79 3.43 24.18

10 Types of missing data

Understanding the reasons why data are missing is important for handling the remaining data correctly. If there are systematic patterns: sample no longer representative, results are biased

MCAR MAR MNAR Censoring

11 Missing Completely At Random (MCAR)

Occurs entirely at random No pattern

Missing is independent both of observable variables and of unobservable parameters of interest

Ideal Unrealistically strong assumption Power may be reduced in the design Results remain unbiased

12 Missing At Random (MAR)

Occurs when the missingness is not random, but where missingness can be fully accounted for by observed variables in the dataset.

A weaker assumption than MCAR A more realistic assumption

MAR does not mean that the missing data can be ignored.

13 Missing At Random (MAR)

Occurs when the missingness is not random, but where missingness can be fully accounted for by observed variables in the dataset.

A weaker assumption than MCAR A more realistic assumption

MAR does not mean that the missing data can be ignored.

14 Missing Not At Random (MNAR)

Neither MAR nor MCAR

Missing is related to unobserved data, or the value of the variable that's missing

Worst-case scenario: the most complicated mechanism Removing observations can produce a bias in the model.

15 Informative missingness

Not Applicable I do not know I prefer not to answer I do not remember

16 Censoring

A data point isn’t missing but is also not complete. The value of a measurement is partially known It can exist by design.

Left censoring –below a certain value Interval censoring –interval between two values. Right censoring –above a certain value

17 How to treat missing data

There is NO perfect way to deal with missing data.

An optimal approach should: 1. Minimize bias 2. Maximize the use of available information 3. Yield accurate estimates for uncertainty

Recover the Values Contact the participants to fill out the missing values Check for missing values before the participant leaves helps in in-person studies.

18 How to treat missing data: The Methods

Deletion Imputation Other Methods Listwise Unconditional single imp. Predictive mean matching Pairwise Most frequent value Random forest Zero or constant Bayesian simulation methods Last observation carried forward Next observation carried forward Missing data indicator Linear interpolation Conditional single imp. Hot-deck imputation Conditional multiple imp. K Nearest Neighbors Expectation Maximization algorithm

19 Deletion Methods

Dropping variable or sample from the analysis (not the dataset!) Specific samples (persons) or items (questions) have the majority of the missing values in a dataset. Drop sample/variable from analyses As a rule of thumb, when the data goes missing on 60–70 percent of the variable, dropping the variable should be considered. The simplest approach Not good with small datasets Of course, predictor(s) in question that are known to be valuable and/or predictive of the outcome should not be removed

20 Listwise and pairwise deletion

Simple and direct More acceptable where the underlying distribution are known. Big sample/ few missing data Removing observations artificially decreases the error estimation. Loss of a large amount of observations

21 Listwise deletion Pairwise deletion Complete Case analysis Available Case analysis Exclude all data from any participant Exclude data from any participant with with missing values. missing values in the variables of interest

N ~ 365 obs. N ~ 900 obs.

22 Listwise deletion Pairwise deletion Complete Case analysis Available Case analysis Pros Pros Easily implemented Keeps as many cases as possible for Unbiased estimates if MCAR each analysis (default option in most No loss of power if large sample, or few software) missing Same number of records for all analysis No big loss of power

Cons Cons Biased estimates if not MCAR Cannot compare analyses – different N Loss of data: larger standard errors. wide confidence intervals, loss of power Assumes MCAR Doesn’t use all information If there are many missing observations, the analysis will be deficient.

23 Conditional single imputation

Missing values are replaced with prediction from a regression equation Advantages Easy and straightforward Incorporates all available information Can be used for both numerical and categorical data Disadvantages Assumes MAR Standard errors underestimated Does not account for the uncertainty in the imputations. Gives the researcher the feeling of more power than the data in reality

24 Conditional single imputation

Regression model imputation Mean value imputation

25 Conditional multiple imputation (MI & MICE)

It is a popular and generally accepted method for handling missing data Similar to single imputation Produces more than one complete datasets

Source: Melissa Humphries: Missing data & How to deal: An overview of missing data 26 Multiple imputation

Advantages Disadvantages Incorporates all available information Cumbersome coding Any kind of variables and more Room for error when specifying models Accounts for imputation uncertainty Can only support a linear relationship Valid (unbiased) estimates both for point among variables estimates and for standard errors (if Values are predicted from other done correctly) variables: no novel information is added Works if missing data and MAR or MNAR Imputation of the outcome: not (with some extra work). recommended to use MI if missing Implemented in many standard data are in the outcome statistical software (R, STATA, SPSS, SAS) 27 Pitfalls in multiple imputation analyses

Specify prediction models correctly A BMJ article from 2007 reported the development a tool for cardiovascular risk prediction. In the published prediction model, cardiovascular risk was found to be unrelated to cholesterol. In complete-case analysis there was a clear association between cholesterol and cardiovascular risk.

28 To impute or not to impute?

Change the distribution of the data, add , or less variability -> break the assumptions of the analysis methods afterwards. Average, LOCF, fixed value imputation, most common value, etc Examine the distribution of the data before and after filling in missing values An ideal solution would yield distributions that are similar in shape. Think of the purpose: If the data will simply be used to create an aesthetically pleasing visualization without "holes", it’s no big deal. On the other hand, if data will be used to generate official , the impact of filling in missing values must be carefully examined and clearly understood.

29 Good Practices in communicating missing data

Report the amount of missing data in your analyses Where complete cases and multiple imputation analyses give different results, the analyst should attempt to understand why, and this should be reported in publications. Communicate to your audience that you have filled in missing values Describe the method and state any assumptions. Address the potential impact of missing data on the findings in the Discussion section. Perform sensitivity analyses to assess how sensitive results are to reasonable changes in the assumptions that are made

30 Sensitivity analysis

Sensitivity analysis: The study which defines how the uncertainty in the output of a model can be allocated to the different sources of uncertainty in its inputs Evaluate the robustness of the results to the deviations from the MAR assumption. Compare results between complete-case analysis with other methods. Test the assumptions for the missing data

In case of contradictory results, understand why.

31 Avoid missing data

You cannot avoid missing data Prepare yourself: Design data collection to avoid missing data: minimize (non-essential) follow-up visits, shorter questionnaires and more user-friendly case-report forms, increase sample size. Documentation (protocol, manual of operations, train all parties involved: investigators, study team and participants, procedure to collect, enter, and edit data) Pilot study can provide invaluable information on what can go wrong. Engage the participants who are at the greatest risk of being lost during follow-up/drop out.

32 Conclusions

Nearly all the real-world datasets have missing values Can pose serious threats to the validity of the results Failure to account for missing data in analyses may lead to bias and loss of precision Action should be taken No best way to deal with missing data. It is vital to detect and understand the nature of the missing values. Visualization provides information regarding the extend and nature of missingness Document and communicate the amount of missing data, your approach, assumptions, method used, differences between complete case analysis and your approach, etc

33 Conclusions - methods

Choice of method is crucial for validity of conclusions, and should be based on careful consideration of the reasons for the missing data, missing data patterns and the availability of auxiliary information. There are many options to deal with missing data out there. Choose your strategy carefully. Some are inherently poor and should be avoided Multiple imputation should be done with care. Understand the nature of the missing data Set up carefully models and predictors

34 Further reading

Hyun Kang : The prevention and handling of the missing data, Korean J Anesthesiol. 2013 May; 64(5): 402-406 Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls BMJ 2009; 338 doi: https://doi.org/10.1136/bmj.b2393 Enders, Craig. 2010. Applied Missing Data Analysis. Guilford Press: New York. Little, Roderick J., Donald Rubin. 2002. Statistical Analysis with Missing Data. John Wiley & Sons, Inc: Hoboken. Schafer, Joseph L., John W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods.

35 Questions / Comments?

Next seminar is on October 1st

Pie is for birthdays, not for graphs – Simple tips for better graphs Presenter: Meryam Krit

They say that a picture says more than a thousand words. Indeed, to properly explain all the information found in a picture, multiple paragraphs are typically needed. Similarly, in research, it’s not uncommon for specialists to do a first “read” of papers by only glancing at the figures. Good quality graphs are therefore vital to get a message across. Additionally, graphs can easily mislead readers without being intended that way. In this session, we highlight some important choices leading to both visually pleasing and correct figures.

36