
Help, my data has gone missing! An introduction to incomplete data SEMINAR SERIES: HOW TO RUIN YOUR CAREFULLY PLANNED STUDY? TIPS FOR IMPROVING DATA ANALYSIS – SESSION 12 ACHILLEAS TSOUMANIS 2 Introduction Missing data is a fact of life for the researcher Almost all research fields Well-designed and controlled studies Aka item nonresponse and unit nonresponse 3 Consequences of missing data Can cause bias in estimation (both in point and in intervals) Reduces statistical power May reduce representativeness of samples Can lead to invalid conclusions May threaten validity of results Informative missingness (can give us information / indicates what we are doing wrong) 4 Sources of missing data Surveys Refusal to participate Refusal to provide an answer Fatigue Lack of knowledge Skipped question after instructions or not Longitudinal studies Drop out (Un)related events to the study 5 More sources of missing data Merging of data sets Different IDs/ characteristics. Failures of measurement Illegible hand-written measurements Corrupted files Random events Accidental misplacement or damage of a biological sample Broken/faulty equipment Questionnaires lost in mail Bad weather 6 What do people usually do? Epidemiological Cohort studies Randomized Clinical studies (2012 – 82 papers) Trials (2012 - 262 studies) (2016 – 86 RCTs) Reported missing data 54 43 93 Complete case analysis 81 66 55 Multiple imputation 8 6 2 Sensitivity analysis 11 16 Eekhout et al. (2012). Missing Data A Systematic Review of How They Are Reported and Handled. Epidemiology, 23(5), 729-732. Karahalios et al (2012);. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol 12, 96. Fiero et al. (2016). Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials. 17. 1. 10.1186/s13063-016 7 Visualizing missing information Visualizing missing data is an important tool for giving insight and help us understand the data. Understand your data + Attrition due to social/natural processes (School graduation, dropout, death) + Skip pattern in survey + Random data collection issues + Non-responders + Time effect + Clustering effect — Inefficient in datasets with large number of samples or variables. 8 Visualizing missing information Heatmap Co-occurrence plot 9 Summarizing missing information Simple numerical summaries are effective at identifying problematic predictors and samples when the data become too large to visually inspect. The total number or percent of missing values for both predictors and samples can be easily computed Percentage missing Min. 1st Qu. Median Mean 3rd Qu. Max. By sample (row) 0.00 0.00 0.00 0.18 0.65 1.31 Min. 1st Qu. Median Mean 3rd Qu. Max. By item (column) 0.00 0.00 0.00 4.79 3.43 24.18 10 Types of missing data Understanding the reasons why data are missing is important for handling the remaining data correctly. If there are systematic patterns: sample no longer representative, results are biased MCAR MAR MNAR Censoring 11 Missing Completely At Random (MCAR) Occurs entirely at random No pattern Missing is independent both of observable variables and of unobservable parameters of interest Ideal Unrealistically strong assumption Power may be reduced in the design Results remain unbiased 12 Missing At Random (MAR) Occurs when the missingness is not random, but where missingness can be fully accounted for by observed variables in the dataset. A weaker assumption than MCAR A more realistic assumption MAR does not mean that the missing data can be ignored. 13 Missing At Random (MAR) Occurs when the missingness is not random, but where missingness can be fully accounted for by observed variables in the dataset. A weaker assumption than MCAR A more realistic assumption MAR does not mean that the missing data can be ignored. 14 Missing Not At Random (MNAR) Neither MAR nor MCAR Missing is related to unobserved data, or the value of the variable that's missing Worst-case scenario: the most complicated mechanism Removing observations can produce a bias in the model. 15 Informative missingness Not Applicable I do not know I prefer not to answer I do not remember 16 Censoring A data point isn’t missing but is also not complete. The value of a measurement is partially known It can exist by design. Left censoring –below a certain value Interval censoring –interval between two values. Right censoring –above a certain value 17 How to treat missing data There is NO perfect way to deal with missing data. An optimal approach should: 1. Minimize bias 2. Maximize the use of available information 3. Yield accurate estimates for uncertainty Recover the Values Contact the participants to fill out the missing values Check for missing values before the participant leaves helps in in-person studies. 18 How to treat missing data: The Methods Deletion Imputation Other Methods Listwise Unconditional single imp. Predictive mean matching Pairwise Most frequent value Random forest Zero or constant Bayesian simulation methods Last observation carried forward Next observation carried forward Missing data indicator Linear interpolation Conditional single imp. Hot-deck imputation Conditional multiple imp. K Nearest Neighbors Expectation Maximization algorithm 19 Deletion Methods Dropping variable or sample from the analysis (not the dataset!) Specific samples (persons) or items (questions) have the majority of the missing values in a dataset. Drop sample/variable from analyses As a rule of thumb, when the data goes missing on 60–70 percent of the variable, dropping the variable should be considered. The simplest approach Not good with small datasets Of course, predictor(s) in question that are known to be valuable and/or predictive of the outcome should not be removed 20 Listwise and pairwise deletion Simple and direct More acceptable where the underlying distribution are known. Big sample/ few missing data Removing observations artificially decreases the error estimation. Loss of a large amount of observations 21 Listwise deletion Pairwise deletion Complete Case analysis Available Case analysis Exclude all data from any participant Exclude data from any participant with with missing values. missing values in the variables of interest N ~ 365 obs. N ~ 900 obs. 22 Listwise deletion Pairwise deletion Complete Case analysis Available Case analysis Pros Pros Easily implemented Keeps as many cases as possible for Unbiased estimates if MCAR each analysis (default option in most No loss of power if large sample, or few software) missing Same number of records for all analysis No big loss of power Cons Cons Biased estimates if not MCAR Cannot compare analyses – different N Loss of data: larger standard errors. wide confidence intervals, loss of power Assumes MCAR Doesn’t use all information If there are many missing observations, the analysis will be deficient. 23 Conditional single imputation Missing values are replaced with prediction from a regression equation Advantages Easy and straightforward Incorporates all available information Can be used for both numerical and categorical data Disadvantages Assumes MAR Standard errors underestimated Does not account for the uncertainty in the imputations. Gives the researcher the feeling of more power than the data in reality 24 Conditional single imputation Regression model imputation Mean value imputation 25 Conditional multiple imputation (MI & MICE) It is a popular and generally accepted method for handling missing data Similar to single imputation Produces more than one complete datasets Source: Melissa Humphries: Missing data & How to deal: An overview of missing data 26 Multiple imputation Advantages Disadvantages Incorporates all available information Cumbersome coding Any kind of variables and more Room for error when specifying models Accounts for imputation uncertainty Can only support a linear relationship Valid (unbiased) estimates both for point among variables estimates and for standard errors (if Values are predicted from other done correctly) variables: no novel information is added Works if missing data and MAR or MNAR Imputation of the outcome: not (with some extra work). recommended to use MI if missing Implemented in many standard data are in the outcome statistical software (R, STATA, SPSS, SAS) 27 Pitfalls in multiple imputation analyses Specify prediction models correctly A BMJ article from 2007 reported the development a tool for cardiovascular risk prediction. In the published prediction model, cardiovascular risk was found to be unrelated to cholesterol. In complete-case analysis there was a clear association between cholesterol and cardiovascular risk. 28 To impute or not to impute? Change the distribution of the data, add skewness, or less variability -> break the assumptions of the analysis methods afterwards. Average, LOCF, fixed value imputation, most common value, etc Examine the distribution of the data before and after filling in missing values An ideal solution would yield distributions that are similar in shape. Think of the purpose: If the data will simply be used to create an aesthetically pleasing visualization without "holes", it’s no big deal. On the other hand, if data will be used to generate official statistics, the impact of filling in missing values must be carefully examined and clearly understood. 29 Good Practices in communicating missing data Report the amount of missing data in your analyses Where complete cases and multiple imputation analyses give different results, the analyst should attempt to understand why, and this should be reported in publications. Communicate to your audience that you have filled in missing values Describe the method and state any assumptions. Address the potential impact of missing data on the findings in the Discussion section. Perform sensitivity analyses to assess how sensitive results are to reasonable changes in the assumptions that are made 30 Sensitivity analysis Sensitivity analysis: The study which defines how the uncertainty in the output of a model can be allocated to the different sources of uncertainty in its inputs Evaluate the robustness of the results to the deviations from the MAR assumption.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-