Imputing Missing Values in Time Series of Count Data Using Hierarchical Models

IMPUTING MISSING VALUES IN TIME SERIES OF COUNT DATA USING HIERARCHICAL MODELS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Clint D. Roberts, M.S. ***** The Ohio State University 2008 Dissertation Committee: Approved by Elizabeth Stasny, Adviser Peter Craigmile, Adviser Adviser Catherine Calder Graduate Program in Michael Maltz Statistics c Copyright by Clint D. Roberts 2008 ABSTRACT The Uniform Crime Reports, collected by the FBI, contain monthly crime counts for each of the seven Index crimes, but for one reason or another, a police agency may miss reporting for a particular month. The data are not complete, hence the need for the development of an imputation procedure to fill in the gaps. Since the early 1960’s, an imputation technique implemented by the FBI has been used to make annual crime count estimates. This approach ignores concerns of seasonality and does not make use of the agencies’ long-term data trends. Computing power has radically improved since the 1960’s, and it is now feasible to develop a more precise imputation method that can incorporate more information into our estimation procedure. A model- based approach also has the added value of making available variance estimates for the imputed data. We describe a method which uses three different models depending on an agency’s average monthly crime count for a particular crime. For small crime counts, we impute the mean and we assume the data are Poisson for the variance estimates. For large crime counts, we consider a time series SARIMA (Seasonal Auto- Regressive Integrated Moving Average) model. For intermediate crime counts, we use a Poisson Generalized Linear Model (GLM). Hierarchical Bayesian models are used to obtain improved imputations for missing data that borrow strength from many UCR series. Information about population growth contained in the crime counts is thought to result in improved imputations when using time series from other agencies. ii Dedicated to Rachel. Through her love she lifted me up and made me stronger than I am. iii ACKNOWLEDGMENTS First, I would like to give a special thank you to my co-advisers, Professor Eliz- abeth Stasny and Professor Peter Craigmile. Both of my co-advisers have given me tremendous support and helpful feedback. I thank Professor Elizabeth Stasny for in- troducing me to research in crime data and for her consistent encouragement. I thank Professor Peter Craigmile for challenging me to exercise my mind, and for teaching me how to keep research alive and fresh, while strengthening my knowledge of the foundations. I am also grateful to Professor Catherine Calder for serving on my exam committee. She gave me my first real exposure to applied Bayesian analysis, and she has provided me with several helpful comments. I thank Professor Michael Maltz. It was an honor and privilege for me to work with Dr. Maltz on a funded research project in 2005–2006. He has done much work to clean the UCR data, and to analyze the missingness in the UCR. For my research, he has provided instrumental expertise in the UCR data set, and great insights into imputation strategies. The joint work of Professor Stasny, Professor Maltz, and myself, in 2005–2006, was supported in part by the American Statistical Association Committee on Law and Justice Statistics and the Bureau of Justice Statistics. iv VITA September 26, 1981 .........................Born - Bakersfield, CA 2003 ........................................B.S. Statistics, California Polytechnic State University, San Luis Obispo 2005 ........................................M.S.Statistics, The Ohio State Univer- sity, Columbus FIELDS OF STUDY Major Field: Statistics v TABLE OF CONTENTS Page Abstract....................................... ii Dedication...................................... iii Acknowledgments.................................. iv Vita ......................................... v ListofTables.................................... ix ListofFigures ................................... x Chapters: 1. INTRODUCTION AND PROJECT BACKGROUND . 1 1.1 UniformCrimeReports ........................ 2 1.1.1 History ............................. 2 1.1.2 IssueswiththeUCR...................... 4 1.1.3 AccessingtheData....................... 8 1.2 ImputationMethods .......................... 10 1.2.1 Simple Imputation Methods . 12 1.2.2 FBIImputationMethods . 13 2. PRELIMINARYANALYSIS. .. .. 17 2.1 TimeSeriesAnalysis .......................... 18 2.1.1 Classical Decomposition Model . 22 2.1.2 Time Series Models Based on Differencing . 26 2.2 ExplanatoryVariables . 31 2.3 MissingnessintheUCR ........................ 35 vi 2.3.1 Explaining Missingness . 36 2.3.2 PatternsofMissingness . 38 2.3.3 Missing Data Assumptions . 41 3. LITERATUREREVIEW .......................... 43 3.1 PoissonRegression ........................... 44 3.1.1 ModelingRates......................... 49 3.1.2 Over-dispersion. .. .. 49 3.2 TimeSeriesforCountData . 51 3.2.1 Observation-drivenModels . 54 3.2.2 Parameter-drivenModels . 57 3.2.3 Generalized Additive Models . 59 3.3 MissingDatainTimeSeries. 61 3.3.1 Modeling ............................ 61 3.3.2 Imputation ........................... 62 4. AMETHODFORIMPUTING ....................... 64 4.1 ImputationProcedure . .. .. 67 4.1.1 TheForecast–BackcastMethod . 67 4.1.2 Evaluating the Combined Estimator . 68 4.1.3 ImputationAlgorithm . 74 4.2 TheThreeModels ........................... 77 4.2.1 LargeCrimeCounts . .. .. 77 4.2.2 IntermediateCrimeCounts . 82 4.2.3 SmallCrimeCounts . .. .. 87 4.3 SpecialCases .............................. 90 4.3.1 Covered-ByCases ....................... 91 4.3.2 Aggregation........................... 92 5. BAYESIANMODELING .......................... 98 5.1 BayesianInference . .. .. 100 5.2 ModelConstruction . .. .. 102 5.2.1 Specification of Hierarchical Structure . 104 5.2.2 CovariateSelection. 107 5.2.3 ModelingAutocorrelation . 115 5.2.4 Specification of Prior Distributions of the Parameters.... 119 5.3 AProposedModel ........................... 127 5.4 CheckingtheModel .......................... 131 vii 6. BAYESIANIMPUTATION . .. .. 136 6.1 AModelforMissingData . 137 6.2 MissingDataMethods . 140 6.2.1 DataAugmentationAlgorithm . 140 6.2.2 TheGibbs’Sampler . 142 6.2.3 MultipleImputation . 143 6.3 EvaluatingtheMethod. 148 6.3.1 WorstCaseScenario . 152 7. CONCLUSIONANDFURTHERWORK . 157 7.1 FinalRemarksonModeling . 158 7.2 FinalRemarksonImputations . 162 7.3 FurtherWork.............................. 164 Bibliography .................................... 173 viii LIST OF TABLES Table Page 1.1 FBI Classification of Population Groups . .. 10 1.2 PartIOffenseClassifications. 11 1.3 FBIImputationProcedure . 15 4.1 Comparison of three imputation methods in Columbus example. 81 4.2 Comparison of three imputation methods in Saraland example using a GLMmodel................................. 86 4.3 Comparison of three imputation methods in Saraland example using a SARIMAmodel. ............................. 87 6.1 Comparison of three imputation methods on 23 years of missing data. 152 ix LIST OF FIGURES Figure Page 2.1 Time series for larceny counts in Columbus, OH . ... 18 2.2 Time series for log larceny counts in Columbus, OH . .... 24 2.3 Classical Decomposition Algorithm showing the steps taken to remove trend and seasonality from the series of Columbus log larceny counts. 25 2.4 ACF and PACF for series adjusted by CDA . 26 2.5 ACF and PACF for MCD series of larceny counts in Columbus, OH . 28 2.6 Residual plots of the SARIMA model of log larceny counts in Colum- bus,OH .................................. 30 2.7 Most correlated agencies in Iowa. Plotted are monthly NDX crimes. Pearson correlations are shown in lower diagonal. ... 34 2.8 Number of cases of different gap sizes from missing data. The horizon- tal axis is scaled logarithmically to highlight the shorter runs. (Pro- ducedbyMichaelMaltz) . .. .. 39 2.9 Number of cases of different gap sizes from missing data for different FBI Groups, plotted on logarithmic scales. (Produced by Michael Maltz) 40 4.1 Murder, robbery, and larceny counts from Cleveland Heights,OH. 66 4.2 Dashed line represents average forecast-backcast MSPE of one missing value in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk givenallthedata....................... 71 x 4.3 Dashed line represents average forecast-backcast MSPE of missing value in gap in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk givenallthedata. ................... 73 4.4 Incomplete series of vehicle theft counts from Allen, OH. ....... 75 4.5 Series of log larceny counts in Columbus, OH, with dashed lines delin- eatingmissingperiod. .......................... 79 4.6 Forecasting/Backcasting of missing values of log larceny counts in Colum- bus, OH. The line represents original data, and the circles are estimated from the model. Plus signs represent 95% prediction bounds. ..... 80 4.7 Forecast/Backcast of missing values of larceny counts in Columbus, OH 81 4.8 Burglary counts in Saraland and Anniston Alabama, with dashed lines delineating missing period. 84 4.9 Forecasting/Backcasting of missing values of burglary counts in Sara- land, AL. The line represents the actual data, and circles are estimates fromfittedmodel.............................. 85 4.10 MurdercountsinWaterloo,Iowa. 88 4.11 Larceny counts with aggregation in Mobile, AL. Imputed values are representedbypointsinbottomplot. 94 4.12 Index crime

Load more