IMPUTING MISSING VALUES IN OF COUNT USING HIERARCHICAL MODELS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Clint D. Roberts, M.S.

*****

The Ohio State University

2008

Dissertation Committee: Approved by

Elizabeth Stasny, Adviser Peter Craigmile, Adviser Adviser Catherine Calder Graduate Program in Michael Maltz c Copyright by

Clint D. Roberts

2008 ABSTRACT

The Uniform Crime Reports, collected by the FBI, contain monthly crime counts for each of the seven Index crimes, but for one reason or another, a police agency may miss reporting for a particular month. The data are not complete, hence the need for the development of an procedure to fill in the gaps. Since the early 1960’s, an imputation technique implemented by the FBI has been used to make annual crime count estimates. This approach ignores concerns of seasonality and does not make use of the agencies’ long-term data trends. Computing power has radically improved since the 1960’s, and it is now feasible to develop a more precise imputation method that can incorporate more information into our estimation procedure. A model- based approach also has the added value of making available estimates for the imputed data. We describe a method which uses three different models depending on an agency’s average monthly crime count for a particular crime. For small crime counts, we impute the and we assume the data are Poisson for the variance estimates. For large crime counts, we consider a time series SARIMA (Seasonal Auto-

Regressive Integrated ) model. For intermediate crime counts, we use a Poisson Generalized (GLM). Hierarchical Bayesian models are used to obtain improved imputations for missing data that borrow strength from many UCR series. Information about population growth contained in the crime counts is thought to result in improved imputations when using time series from other agencies.

ii Dedicated to Rachel.

Through her love she lifted me up and made me stronger than I am.

iii ACKNOWLEDGMENTS

First, I would like to give a special thank you to my co-advisers, Professor Eliz- abeth Stasny and Professor Peter Craigmile. Both of my co-advisers have given me tremendous support and helpful feedback. I thank Professor Elizabeth Stasny for in- troducing me to research in crime data and for her consistent encouragement. I thank

Professor Peter Craigmile for challenging me to exercise my mind, and for teaching me how to keep research alive and fresh, while strengthening my knowledge of the foundations.

I am also grateful to Professor Catherine Calder for serving on my exam commit- tee. She gave me my first real exposure to applied Bayesian analysis, and she has provided me with several helpful comments.

I thank Professor Michael Maltz. It was an honor and privilege for me to work with Dr. Maltz on a funded research project in 2005–2006. He has done much work to clean the UCR data, and to analyze the missingness in the UCR. For my research, he has provided instrumental expertise in the UCR , and great insights into imputation strategies.

The joint work of Professor Stasny, Professor Maltz, and myself, in 2005–2006, was supported in part by the American Statistical Association Committee on Law and Justice Statistics and the Bureau of Justice Statistics.

iv VITA

September 26, 1981 ...... Born - Bakersfield, CA

2003 ...... B.S. Statistics, California Polytechnic State University, San Luis Obispo 2005 ...... M.S.Statistics, The Ohio State Univer- sity, Columbus

FIELDS OF STUDY

Major Field: Statistics

v TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... iii

Acknowledgments...... iv

Vita ...... v

ListofTables...... ix

ListofFigures ...... x

Chapters:

1. INTRODUCTION AND PROJECT BACKGROUND ...... 1

1.1 UniformCrimeReports ...... 2 1.1.1 History ...... 2 1.1.2 IssueswiththeUCR...... 4 1.1.3 AccessingtheData...... 8 1.2 ImputationMethods ...... 10 1.2.1 Simple Imputation Methods ...... 12 1.2.2 FBIImputationMethods ...... 13

2. PRELIMINARYANALYSIS...... 17

2.1 TimeSeriesAnalysis ...... 18 2.1.1 Classical Decomposition Model ...... 22 2.1.2 Time Series Models Based on Differencing ...... 26 2.2 ExplanatoryVariables ...... 31 2.3 MissingnessintheUCR ...... 35

vi 2.3.1 Explaining Missingness ...... 36 2.3.2 PatternsofMissingness ...... 38 2.3.3 Missing Data Assumptions ...... 41

3. LITERATUREREVIEW ...... 43

3.1 PoissonRegression ...... 44 3.1.1 ModelingRates...... 49 3.1.2 Over-dispersion...... 49 3.2 TimeSeriesforCountData ...... 51 3.2.1 Observation-drivenModels ...... 54 3.2.2 Parameter-drivenModels ...... 57 3.2.3 Generalized Additive Models ...... 59 3.3 MissingDatainTimeSeries...... 61 3.3.1 Modeling ...... 61 3.3.2 Imputation ...... 62

4. AMETHODFORIMPUTING ...... 64

4.1 ImputationProcedure ...... 67 4.1.1 TheForecast–BackcastMethod ...... 67 4.1.2 Evaluating the Combined Estimator ...... 68 4.1.3 ImputationAlgorithm ...... 74 4.2 TheThreeModels ...... 77 4.2.1 LargeCrimeCounts ...... 77 4.2.2 IntermediateCrimeCounts ...... 82 4.2.3 SmallCrimeCounts ...... 87 4.3 SpecialCases ...... 90 4.3.1 Covered-ByCases ...... 91 4.3.2 Aggregation...... 92

5. BAYESIANMODELING ...... 98

5.1 BayesianInference ...... 100 5.2 ModelConstruction ...... 102 5.2.1 Specification of Hierarchical Structure ...... 104 5.2.2 CovariateSelection...... 107 5.2.3 ModelingAutocorrelation ...... 115 5.2.4 Specification of Prior Distributions of the Parameters.... 119 5.3 AProposedModel ...... 127 5.4 CheckingtheModel ...... 131

vii 6. BAYESIANIMPUTATION ...... 136

6.1 AModelforMissingData ...... 137 6.2 MissingDataMethods ...... 140 6.2.1 DataAugmentationAlgorithm ...... 140 6.2.2 TheGibbs’Sampler ...... 142 6.2.3 MultipleImputation ...... 143 6.3 EvaluatingtheMethod...... 148 6.3.1 WorstCaseScenario ...... 152

7. CONCLUSIONANDFURTHERWORK ...... 157

7.1 FinalRemarksonModeling ...... 158 7.2 FinalRemarksonImputations ...... 162 7.3 FurtherWork...... 164

Bibliography ...... 173

viii LIST OF TABLES

Table Page

1.1 FBI Classification of Population Groups ...... 10

1.2 PartIOffenseClassifications...... 11

1.3 FBIImputationProcedure ...... 15

4.1 Comparison of three imputation methods in Columbus example. . . . 81

4.2 Comparison of three imputation methods in Saraland example using a GLMmodel...... 86

4.3 Comparison of three imputation methods in Saraland example using a SARIMAmodel...... 87

6.1 Comparison of three imputation methods on 23 years of missing data. 152

ix LIST OF FIGURES

Figure Page

2.1 Time series for larceny counts in Columbus, OH ...... 18

2.2 Time series for log larceny counts in Columbus, OH ...... 24

2.3 Classical Decomposition Algorithm showing the steps taken to remove trend and seasonality from the series of Columbus log larceny counts. 25

2.4 ACF and PACF for series adjusted by CDA ...... 26

2.5 ACF and PACF for MCD series of larceny counts in Columbus, OH . 28

2.6 Residual plots of the SARIMA model of log larceny counts in Colum- bus,OH ...... 30

2.7 Most correlated agencies in Iowa. Plotted are monthly NDX crimes. Pearson correlations are shown in lower diagonal...... 34

2.8 Number of cases of different gap sizes from missing data. The horizon- tal axis is scaled logarithmically to highlight the shorter runs. (Pro- ducedbyMichaelMaltz) ...... 39

2.9 Number of cases of different gap sizes from missing data for different FBI Groups, plotted on logarithmic scales. (Produced by Michael Maltz) 40

4.1 Murder, robbery, and larceny counts from Cleveland Heights,OH. . . 66

4.2 Dashed line represents average forecast-backcast MSPE of one missing value in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk givenallthedata...... 71

x 4.3 Dashed line represents average forecast-backcast MSPE of missing value in gap in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk givenallthedata...... 73

4.4 Incomplete series of vehicle theft counts from Allen, OH...... 75

4.5 Series of log larceny counts in Columbus, OH, with dashed lines delin- eatingmissingperiod...... 79

4.6 Forecasting/Backcasting of missing values of log larceny counts in Colum- bus, OH. The line represents original data, and the circles are estimated from the model. Plus signs represent 95% prediction bounds...... 80

4.7 Forecast/Backcast of missing values of larceny counts in Columbus, OH 81

4.8 Burglary counts in Saraland and Anniston Alabama, with dashed lines delineating missing period...... 84

4.9 Forecasting/Backcasting of missing values of burglary counts in Sara- land, AL. The line represents the actual data, and circles are estimates fromfittedmodel...... 85

4.10 MurdercountsinWaterloo,Iowa...... 88

4.11 Larceny counts with aggregation in Mobile, AL. Imputed values are representedbypointsinbottomplot...... 94

4.12 Index crime counts for Cleveland, Ohio. Index crime counts are repre- sented by the solid line. In bottom plot, predictions based on modeling Index crime series are represented by dotted line, and predictions based on modeling individual crime series are represented by dashed line. . 96

5.1 DiagramofHierarchyinUCR ...... 105

5.2 First plot shows annual population estimate, and second plot shows monthly larceny counts for the same agency during the same time period...... 109

5.3 LOESS curves of twelve crime series in MN, for each of the three prop- ertyIndexcrimes...... 111

xi 5.4 Fitting mean crime on Hermite polynomials for time in Bemidji, MN. 113

5.5 Fitting on burglary counts in Robbinsdale, MN. . . 115

5.6 Sample ACF and sample PACF plots of the burglary counts in Rob- binsdale,MN...... 119

5.7 Sample ACF and sample PACF plots of the residuals of the model for burglary counts in Robbinsdale, MN...... 120

5.8 Plots show correlations among alpha coefficients for the three different crimes using the thirteen cities in MN. Pearson correlations are shown inlowertriangle...... 123

5.9 Boxplots grouped by crime type show the distributions of the model coefficients for the thirteen agencies in MN. The last plot shows the ungrouped boxplots of the beta coefficients on the same scale. .... 124

5.10 Beta(5,10) ...... 126

5.11 The solid line shows posterior for the distribution of each µt (t = 1, 2,..., 516) on burglary series for Robbinsdale, MN. (Actual valuesshownascircles.) ...... 132

5.12 Posterior summaries of mean alpha (µα) coefficients for each crime. X marks the estimate from the independent Poisson GLMs...... 133

5.13 Posterior summaries of mean beta1 (µ (1) ) coefficients for each agency. βj X marks the estimate from the independent Poisson GLMs...... 133

5.14 Sample ACF plot of residuals from Bayesian model, represented by boxplots...... 134

6.1 The first plot shows the larceny crime series for St. Louis Park, Min- nesota, with data from 1988 removed. The second plot shows posterior predictive distributions for the missing data with the actual removed datarepresentedbythedottedline...... 145

6.2 There are six sets of imputations shown here for St. Louis Park, MN. The imputed values are represented by circles, and the actual data, removed, is represented by the dotted line...... 147

xii 6.3 Removed data in 2000 is represented by dotted line. The second plot shows posterior distributions of the missing values for 2000...... 149

6.4 Predictions from the three methods in Crystal’s larceny series. Actual crime counts for 1976 are represented by the dotted line...... 151

6.5 Imputation in Richfield, MN. Posterior means are represented by cir- cles. 95% posterior intervals are represented by plus signs. Actual data removed is represented by dotted line. Boxplots shows posterior distributions of missing values based on 2500 samples...... 154

6.6 In the first plot, Fall crime counts are labeled by “F” and Winter crime counts are labeled by “W”. In the second plot, September crime counts are labeled by “S” and January crime counts are labeled by “J”.... 155

xiii CHAPTER 1

INTRODUCTION AND PROJECT BACKGROUND

The statistical definition of a is a survey conducted on the full set of units in a population. Completeness in reporting is fundamental to this type of survey, especially when each unit’s responses are of interest. When data are missing, there are a number of ways of dealing with the problem. Our desire is to create a completed data set for public use. This requires imputation, which is the substitution of plausible values for missing values. Of course, the imputed values should not be considered truth, and accurate measures of uncertainty associated with the imputed values will provide a more honest picture of the data. This dissertation is a study of imputation methods in a monthly census where responses are in the form of counts.

Our research, in other words, is concerned with estimating missing values in multiple time series of . The motivating example, used throughout this report, deals with the problem of filling in the gaps in the Uniform Crime Reports (UCR).

In this chapter, we will describe the UCR data set in Section 1.1, and in Section 1.2 we will introduce some basic imputation methods along with the current imputation methods used in the UCR. In Chapter 2 we will describe our preliminary analysis of the time series data in the UCR and our exploration of the missingness in the UCR.

In Chapter 3 we will review time series models for count data, and missing data

1 methods for time series. In Chapter 4 we describe our initial imputation procedure,

which uses three different models. Later, in Chapters 5 and 6, we will take a Bayesian

approach to develop an imputation method for the missing data in the UCR. Finally,

we summarize our conclusions and discuss further work in Chapter 7.

1.1 Uniform Crime Reports

The Uniform Crime Reporting system in the United States is one of the largest and oldest sources of social data. It consists, in part, of monthly counts of crimes for over 18,000 police agencies throughout the country. Crime data is transmitted from police agencies to the Federal Bureau of Investigation (FBI) which publishes the compiled crime statistics annually in Crime in the United States (CIUS). The purpose of the UCR is to provide reliable crime statistics for use in our nation’s law enforcement systems [FBI, 2000].

1.1.1 History

In 1927, the International Association of Chiefs of Police (IACP) created the

Committee on Uniform Crime Records to develop a system for keeping the national crime statistics. The committee determined seven primary crimes for which they would compute crime rates: murder and non-negligent manslaughter, forcible rape, burglary, aggravated assault, larceny and motor vehicle theft. These crimes were chosen because they are considered serious crimes that are likely to be reported by police. In addition to these seven serious crimes known as Part I Index crimes (we will capitalize “Index” when we refer to these crimes), there are twenty-one different crimes known as Part II Index crimes (e.g. embezzlement, gambling, prostitution). In

1929, the committee founded the Uniform Crime Reporting program. The FBI was

2 elected the task of collecting and publishing the statistics in 1930. The first report

in January of 1930 included data from 400 cities in 43 states covering approximately

20 million people, only a portion of the total U.S. population [FBI, 2000]. When

Congress passed the Anti-Arson Act of 1982, arson was added to become the eighth

Part I crime, but it still remains a difficult crime to acquire accurate data.

In 1985, a progressive conversion to the National Incident-Based Reporting System

(NIBRS) began. In the NIBRS, each crime is described in multiple records with more

specificity. With this enhanced UCR program, detailed information is gathered about

the incident, victim, property, offender, and arrestee [FBI, 2000]. One major problem

with the UCR that is resolved by NIBRS is the UCR’s hierarchy rule where only the

most serious crime is counted in one incident [Maltz, 1999]. The hierarchy rule was

instituted to prevent double-counting of crimes, but it actually ignores crime. For

example, a robbery resulting in death is counted only as murder. Imputing the ignored

crimes would go against UCR reporting guidelines, so we will not be concerned with

making imputations for these crimes.

For police agencies that use NIBRS, the FBI extracts summary UCR data from the NIBRS. As of September 2007, approximately 25% of the U.S. population is policed by agencies that report using NIBRS [JRSA, 2006]. Thirty-one states have been certified to report NIBRS, and according to the FBI, only five states (Alaska,

Florida, Georgia, Nevada, Wyoming) have no current plan to convert to NIBRS. Over

90% of all agencies, however, submit UCR data. Thus, despite the advantages of the new system, there will always be a need for the that the UCR provides as we now discuss.

3 1.1.2 Issues with the UCR

For a variety of research, planning, and administrative purposes, the UCR data is used by law enforcement officials, legislators, city planners, sociologists, and criminol- ogists. The UCR data is used to study fluctuations in crime levels and relationships between crimes rates and other variables. Upon finding explanations for the behavior of crime patterns, it is not unlikely for policy changes to be made or new programs to be implemented in an attempt to reduce crime in a particular city. Funding for resource allocation, and patrol distribution by local police departments are just two types of decisions that are made with the use of the UCR data [Lyons et al.,

2006]. There are, however, limitations in the UCR that are not always brought into consideration when interpreting the data. These misapplications can lead to flawed conclusions and inappropriate policy decisions.

Many groups with an interest in crime (e.g. news media, tourism agencies) use the UCR statistics to rank cities. The FBI publishes a warning against comparing cities and counties in this way in its annual issue of CIUS: “These lead to simplistic and/or incomplete analyses which often create misleading perceptions.”

The FBI [2000] advises examining all variables (not just population) that affect crime counts before making comparisons among cities.

There are several other major issues with the UCR that cause much debate among criminologists. One, for example, is the issue of misclassification of crime types. Nolan et al. [2006] have studied the problem of misclassification in the UCR on the part of the reporting agencies, and introduce a methodology for assessing the statistical accuracy of UCR crime statistics. In this section, we discuss two issues important to understanding the UCR’s purpose and its limitations. The first issue concerns

4 whether or not the UCR is the best measure of crime in the United States. The

second issue, which relates to the focus of this dissertation, is the problem of missing

gaps in the UCR.

The Two Crime Measures

One of the most misunderstood issues with national crime statistics is the dis- crepancy that exists between the UCR and the National Crime Victimization Survey

(NCVS). The UCR and the NCVS are the U.S. Department of Justice’s two national measures of crime. When annual statistics are released by the two systems, often they give similar crime estimates, but occasionally the results are in disagreement.

For example, in 2001, statistics from the UCR suggested that violent crime, after declining in the past few years, had stabilized in 2000, but statistics from the NCVS suggested that violent crime had dramatically fallen 15% in 2000 [Rand and Rennison,

2002]. When differences between these two systems occur, it causes much confusion among criminologists. Both systems measure the same general set of crimes, so there is an expectation for them to give similar statistics. There are significant differences, however, in the purposes and methods between the two programs.

The NCVS was designed in 1972 to be a gauge for UCR statistics, but also to fill in characteristics of crime that are lacking in the UCR. The NCVS is a survey conducted from a large sample of U.S. households, and it measures crime that people have expe- rienced whether or not they reported the crime to the police. The UCR, on the other hand, measures only crime reported to police agencies across the country. Victims do not report crimes to the police for various reasons including fear, embarrassment, mistrust in the police, or belief that reporting would not result in any police action.

Since it is a survey, the NCVS is subject to error and to non-sampling error,

5 like response error. Another difference is in how the UCR and NCVS count crimes.

The hierarchy rule in the UCR permits only the most serious crime in one incident to be counted. The NCVS can count each crime to each victim in the sample in a particular incident. Unlike the UCR, the NCVS does not count crimes committed against children under age 12, businesses or organizations, or people who are not permanent residents of the United States. The NCVS does not measure murder or arson occurrences. Furthermore, there are slight differences in how the two programs define some crimes. Understanding these fundamental differences between the UCR and NCVS will go a long way in reconciling the discrepancies in reported results. At the beginning of each issue of the CIUS, the FBI cautions UCR data users against comparing crime trends with statistics in NCVS. Though it may not be meaningful to compare estimates in the two systems, it can be highly valuable to use the two systems in conjunction to understand the crime problem.

Both programs are important, and when considered together they give us a fuller knowledge of the characteristics of crime in the United States. The NCVS is essential because as much as two-thirds of property crime and one half of violent crime are not reported to police. The UCR is essential because it is the only system that tells us how much crime has been reported to police throughout the country and it is the only national crime data series that provides information by state and by city each month [Rand and Rennison, 2002]. In the appendices of each publication of CIUS, the FBI gives an explanation of the necessity of both crime measures. The following is an excerpt of FBI [2000]:

Because the UCR and NCVS programs are conducted for different pur- poses, use different methods, and focus on somewhat different aspects of

6 crime, the information they produce together provides a more comprehen- sive panorama of the Nation’s crime problem than either could produce alone.

The UCR, therefore, is a reliable and necessary measure of crime in the United States.

The interminable significance of the UCR promotes the continuing effort to improve upon its quality as a data source.

Gaps in the UCR

The UCR is a voluntary reporting system, so it’s not surprising that the data are incomplete. Some states have instituted laws requiring agencies within those states to provide crime reports to the FBI, but usually there is no penalty for not doing so [Maltz, 1999]. Still, most agencies comply with submitting reports to the FBI.

In 2003, UCR data were compiled from agencies representing 93% of the population in 47 states. There are many reasons why an agency may miss reporting occasional months, strings of months, or even whole years. Computer problems, budget prob- lems, changes in record management systems, personnel shifts or shortages, natural disasters, and a number of other reasons account for the gaps in the UCR [Maltz and Targonski, 2002]. Even the rarity of crime has caused some agencies to neglect

filling out crime reports. Aside from reasons for non-reporting, cases of data entry errors or mis-reporting may occur, and incorrect data must be identified and edited or excluded.

Gaps in reporting do not prevent the UCR from being used in the media and in research studies [Maltz and Weiss, 2006]. As an example, county-level UCR data were used in a study on the effect of laws permitting the carrying of concealed weapons on homicide and other violent crimes. Soon after this study was completed, the potential

7 effect of gaps in the data used by this study was analyzed more closely, leading to the

conclusion that the data could not support the study’s findings [Maltz and Targonski,

2002]. In this example, analyses of inadequate data were used to persuade states to

reform policies on carrying concealed weapons. Not only have policy-oriented studies

been flawed, but numerous research studies published in criminology journals have

used UCR data that is inaccurately assumed to be sufficient [Maltz and Weiss, 2006].

Clearly, this is a compelling reason for the use of missing data techniques to fill in

the gaps in the UCR.

1.1.3 Accessing the Data

Researchers have increasingly been analyzing UCR data at sub-national levels, largely due to the greater accessibility of the data [Maltz, 1999]. The UCR statistics have been available to the public since the FBI began its annual publication of CIUS in 1930, but now the data is easily accessible from the internet. UCR data sets can be downloaded at no cost from the National Archive of Criminal Justice Data

(NACJD), which is part of the Inter-University Consortium for Political and Social

Research (ICPSR) at the University of Michigan [Maltz and Weiss, 2006]. The data are also accessible from the FBI website and the Bureau of Justice Statistics (BJS) website.

Maltz and Weiss [2006] worked to provide a clean UCR dataset in Excel. These researchers were concerned with making monthly data available and usable for the criminal justice community, so that issues of seasonality could be addressed and so that changes in crime over time could be clearly seen for each jurisdiction. The data starts at 1960 because that is the earliest year for which electronic records are

8 available. Using a set of macros in Excel, Maltz and Weiss [2006] have developed a

UCR charting utility to allow researchers to view the time series of a specified state, agency, and crime. By giving viewers time series plots of the crime data, a clear picture of the variability in the data can be observed.

In the clean dataset, each state has its own Excel worksheet. In each worksheet, there are several spreadsheets of information. Each agency is identified by its ORI

(Originating Agency Identifier), which is a code consisting of seven characters begin- ning with the state’s postal code. Each agency also has annual population estimates and annual population group classifications (we will use capitalized “Group” to refer to this classification). Although there are conceivably many other population char- acteristics related to crime, population size is the only correlate of crime that CIUS utilizes [FBI, 2000]. Table 1.1 shows how the FBI classifies agencies based on their population size. Groups 6–9 represent both cities and other types of jurisdictions, including university, county, and state police agencies, as well as fish and game police, park police, and other public law enforcement organizations. Most (but not all) of these “other types” of jurisdictions are called “zero-population” agencies by the FBI, because no population is attributed to them. This is because, for example, counting the population policed by a university police department or by the state police would be tantamount to counting that population twice. Thus, the crime count for these zero-population agencies is added to the crime count for the other agencies to get the total crime count: for the city, in the case of the university police, and for the state, in the case of the state police.

Besides crime counts for the seven Index crimes, counts for subcategories of crimes are also available. For example, the vehicle theft Index is the sum of auto thefts, truck

9 Population Group Political Label Population 1 City 250,000 and over 2 City 100,000 to 249,999 3 City 50,000 to 99,999 4 City 25,000 to 49,999 5 City 10,000 to 24,999 6 Citya Less than 10,000 8 (Nonmetropolitan County) Countyb N/A 9 (Metropolitan County) Countyb N/A Note: Group 7, missing from this table consists of cities with populations under 2,500 and universities and colleges to which no population is attributed. For compilation of CIUS, Group 7 is included in Group 6. a Includes universities and colleges to which no population is attributed. b Includes state police to which no population is attributed.

Table 1.1: FBI Classification of Population Groups

thefts, and thefts of other vehicles. These Part I offenses are shown in Table 1.2 as

defined in the UCR handbook [FBI, 2004]. The sum of the seven Part I crimes is

labeled NDX. All analyses described in this dissertation use the clean dataset Maltz

has provided.

1.2 Imputation Methods

A missing datum is a month in which no counts have been provided. There are

an endless variety of possible ways to substitute a missing datum with a “plausible”

value. In an individual case, some values may seem to be more plausible than others.

Ideally, we want our imputation methods to make reasonable sense, to be simple

to implement, and to produce precise and accurate estimates of the missing values.

In this section, we begin considering various simple single imputation methods as a

10 Label Index Crime Subcategories MUR Criminal Homicide Murder and non-negligent manslaughter Manslaughter by negligence RPT Forcible Rape Rape by force Attempt to commit forcible rape RBT Robbery Firearm Knife or cutting instrument Other dangerous weapon Strong-arm - Hands, fists, feet, etc. AGA Aggravated Assault Firearm Knife or cutting instrument Other dangerous weapon Strong-arm - Hands, fists, feet, etc. BUR Burglary - Breaking or Entering Forcible entry Unlawful entry - no force Attempted forcible entry LAR Larceny - Theft VTT Motor Vehicle Theft Autos Trucks and buses Other vehicles

Table 1.2: Part I Offense Classifications

starting place for choosing a solution to our problem of imputing for missing data in the UCR. We also describe what the FBI has done to handle missing data.

Little and Rubin [2002] consider two classes of single imputation methods: those based on explicit modeling and those based on implicit modeling. Imputation based on implicit modeling is typically much simpler, and involves implementing an algo- rithm. Imputation based on explicit modeling usually involves imputing the mean of, or a draw from, the predictive distribution. Only single imputation methods are discussed in this section but multiple imputation will be discussed in Chapter 4.

11 1.2.1 Simple Imputation Methods

In our work with missing data in the UCR, we will consider imputation methods that use information from other responding agencies, that use the longitudinal struc- ture in the data, and that use available covariates. We, therefore, consider methods in the nonresponse literature that address these issues.

In survey analysis, a common method for dealing with item non-response is hot deck imputation [Little and Rubin, 2002]. This method uses the observations from

“similar” individuals in the same dataset to donate values to impute for missing ob- servations. The most complicated part of this method is defining the criteria for the

“similar” individuals. Nearest neighbor imputation is the name for using the closest hot deck. We must be careful with using hot deck strategies in the UCR because our goal is to create a complete crime series for each individual agency. Typ- ical use of hot deck imputation in surveys is not necessarily concerned with providing the best estimates for individual respondents, but rather with providing an accurate picture of the population as a whole. An advantage of this method, however, is that the imputations are real possible outcomes.

In longitudinal , where data are missing because of dropouts, the simplest imputation method is the “last observation carried forward” (LOCF) method

[Shao and Zhong, 2003] in which the observation which precedes a missing datum

(or missing period of data) replaces all of the missing values following it. Like hot deck imputation, LOCF is based on a model, though implicit in its specification.

This method makes the simple assumption that an individual’s outcome remains unchanged after leaving the study. Problems with this method are that the means are biased, the structure is distorted and the measures of precision are

12 usually underestimated [Ting and Brailey, 2005]. An advantage of this method, and

similarly “next observation carried backward” (NOCB), is that these methods are

easy to understand and to communicate. Since modeling approaches have become

available, use of LOCF has been discouraged.

An imputation method that is somewhat more complicated is regression imputa-

tion. This type of imputation is based on explicit regression models of the missing

values on observed values for an individual. Mean imputation is a special case of

regression imputation. Generally speaking, an imputation method uses a predictive

distribution to impute. This distribution can be used in different ways, but the most

common ways are imputing a mean and imputing a draw. If the model uses the pre-

dicted values as imputations, then it is basically imputing a mean. On the other hand,

we can use the model to obtain draws from the predictive distribution for imputation,

thus performing regression imputation stochastically.

Our own imputation methods will build on these basic strategies. Increasing the complexity in imputation methods may result in better estimates of the missing values. We give a full discussion of our imputation methods in Chapters 4 and 6.

1.2.2 FBI Imputation Methods

In 1958, when the FBI felt they had enough coverage with the UCR to provide national-level crime estimates, they decided to use a simple imputation method to fill in the gaps in the UCR. Their primary concern has been annual crime count estimates at the state, regional, and national level. In fact, the FBI does not publish imputed data at the county or local levels [Maltz, 1999]. Gaps in the data occur at the agency

13 level, when a police department misses reporting one or more months, but the FBI

imputes only to the state, regional, or national level.

The FBI’s imputation approach is based on the current year’s reporting and in-

volves two procedures. If three or more months’ crimes were reported during the

year, the estimated crime count for the year is C 12/M, where M is the number of A ·

months reported and CA is the total number of crimes reported during the M months

(Table 1.3). For example, if Navajo, Arizona reported two robberies during January through November in 2000, then the estimate for the year would be2 12/11=2.18 · robberies.

If fewer than three months were reported in a particular year, the estimated crime

count is based on “similar” agencies in the same state. A “similar” agency is one

that meets three selection criteria: it must be in the same state, the same Group (as

defined in Table 1.1), and it must have reported data for all 12 months. The crime

rate for these similar agencies is then computed (the total crime count is divided by

the total population), and this rate is multiplied by the population of the agency

needing imputation, to estimate its annual crime count (P C /P ). This procedure A · S S resembles the hot deck imputation method described previously. Instead of imputing

from a nearest neighbor, this method uses information from all “similar” reporting

units.

If a “zero-population” police agency does not report fully for the UCR, no impu-

tation is made by the FBI to compensate for the missing data [Maltz and Targonski,

2002]. Clearly, this can lead to underestimating the crime totals at the county and

state levels.

14 Number of Months Reported by Agency 0 to 2 3 to 11 Formula for Imputation C PA C 12 S PS A M CA, PA: The agency’s crime count and population, respectively, for the year CS, PS: The crime and population counts of similar agencies, for the year M: The number of months reported, for the year in question

Table 1.3: FBI Imputation Procedure

Two drawbacks of the FBI’s imputation method are that it ignores seasonality and it only uses one year of data. The FBI’s imputation procedure for agencies that report at least three months assumes that the level of crime during the reported months is the same as the level of crime for the unreported months. This can cause significant , especially if there is a seasonal effect on crime in a particular area.

In the imputation procedure for agencies that report fewer than three months, there also can be bias, especially if missingness is correlated with low crime totals as one might expect.

A more appropriate imputation method might be to base the imputation method on the agency’s long-term history, but the computing power needed to do this did not exist in the late 1950s and early 1960s, when the FBI method was implemented.

Nor did the FBI change its method afterwards, when computing became cheaper and easier, probably preferring to maintain consistency in its reporting (Maltz, 1999).

Recognizing the need for better imputation procedures, the FBI and the BJS held a Workshop on UCR Imputation Procedures in 1997. In his report, “Bridging

Gaps in Police Crime Data,” based on discussions from the workshop, Michael Maltz describes the problems of missing data in the UCR and gives suggestions for improving the imputation procedures. In his discussion of imputation philosophy, Maltz [1999]

15 recommends using a longitudinal estimation procedure instead of a cross-sectional one. It makes sense to base current crime predictions on past crime data for each individual agency since the of the data within an agency may be more significant than the cross-sectional correlation among similar agencies’ crimes.

Maltz [1999] also suggests that “an estimate for a missing month that is based on the same month last year is better than one based on the reported months for this year,” which implies that seasonality has a strong effect on crime counts for most agencies.

Interestingly enough, studies of crime seasonality are found in CIUS at the national and regional levels, but not at the local level.

We can incorporate seasonality into our crime estimates by imputing each month separately. By predicting each missing month’s value, using a model that conditions on the observed data in the UCR, we also eliminate the need for the FBI’s arbitrary three-month cut-point for imputing incomplete-reporting versus non-reporting agen- cies. Using both the agency’s history of data and similar agencies’ data is the driving strategy in our development of new imputation procedures for the UCR.

16 CHAPTER 2

PRELIMINARY ANALYSIS

Our preliminary analysis begins with a visual inspection of the time series plots of the UCR data. Practically speaking, the UCR is too big to examine all of the plots for each crime from each agency in each state, but with a sensible sampling of agencies across states, we can look at enough series to get a good idea of what to expect from the data source as a whole. An example of a time series from the UCR is the series of monthly counts of larcency from Columbus, Ohio shown in Figure

2.1. The clean UCR charting utility [Maltz and Weiss, 2006] makes the data easily accessible and viewable. Upon first inspection of a sample of crime series from the

UCR, we notice several features of the data. Since the data is indexed by time, it is not surprising that we are able to identify components commonly found in time series data, namely trends and seasonality (seen in Figure 2.1). Other features particular to the UCR, such as types of missing data and types of aggregated data, are also evident when exploring UCR crime series. Of further interest is the considerable amount of variability among crime series of different types and from different agencies.

In Section 2.1 we perform a basic time series analysis on an example from our data. The time series techniques here assume that the data are normal. This is not a reasonable assumption for most of the data in the UCR, but for series with large

17 Columbus, OH Larceny Counts 1000 2000 3000 4000 5000

1960 1970 1980 1990 2000

Year

Figure 2.1: Time series for larceny counts in Columbus, OH

crime counts, the methods are appropriate. The purpose of the following classical time series analysis on UCR data is to introduce the modeling techniques commonly used for dependent data. We will use some ideas behind these strategies when we develop our time series models for count data.

In Section 2.2 we discuss various factors which influence crime and possible ex- planatory variables for inclusion in our data models. In Section 2.3 we explore the missingness in the UCR, describe its patterns, and validate the missing data assump- tions necessary for our imputation techniques.

2.1 Time Series Analysis

In our study of the UCR, the purpose of time series analysis is twofold: to un- derstand the stochastic process that is thought to generate the observed values, and

18 then to predict missing observations including at past and future time points. In this

section, we focus on the first of these goals, that is, determining (or modeling) the

stochastic process that generates the crime series in the UCR. Here we present the

basics of time series modeling and we start to apply the modeling techniques to the

UCR data.

Let us first take a to define notation. For a particular crime type and a particular agency, let Yt be a of the crime count at time t and let yt be

the realized value. Assume that Y : t =1, 2,...,n are dependent random variables { t } with a joint probability distribution. In the study of the UCR, the time-indexes, t, represent months starting with t = 1 on January 1960, although this does not necessarily have to be the case. The sequence of random variables, Y ,Y ,Y ,... , { 1 2 3 } indexed in time, is called a discrete-time stochastic process. Let the mean function

for the process be given by

µt = E(Yt) for t =1, 2, 3,...,

and the covariance function for the process be given by

γt,s = Cov(Yt,Ys) for t, s =1, 2, 3,....

Then, a stochastic process is (weakly) stationary if the mean function is constant

over time and the autocovariance function (ACVF), defined as γ(h) = Cov(Yt+h,Yt),

is independent of t for each lag h.

Let Z WN(0, σ2) be a white noise process. That is, Z : t = 1, 2,... is { t} ∼ { t ± ± } a sequence of uncorrelated error terms with mean 0 and variance σ2. Generally, the

Z are assumed to be normally distributed. A linear process is defined by Brockwell { t}

19 and Davis [2002] as a time series, Y , that can represented by { t} ∞

Yt = ψjZt−j for all t −∞ j=X ∞ where ψ is a sequence of constants with −∞ ψ < . { j} j= | j| ∞ P An important class of linear processes is the autoregressive moving average (ARMA) class developed by Box and Jenkins [1976]. ARMA models are very useful for sta- tionary processes. If Y is stationary, then an ARMA(p, q) process in Brockwell and { t} Davis [2002] is given by

Yt = φ1Yt−1 + φ2Yt−2 + . . . + φpYt−p + Zt + θ1Zt−1 + . . . + θqZt−q. (2.1)

The model has a moving-average (MA) component and an autoregressive (AR) com- ponent. The MA component refers to the weighted average of Z ,...,Z − , white { t t q} noise random variables from a latent process. The AR component refers to the re- gression of the crime counts on themselves, the linear combination of Y − ,...,Y − . { t 1 t p} AR and MA parameters are represented by φ and θ respectively. The orders of the

AR and MA components are p and q respectively.

Brockwell and Davis [2002] define causality in an ARMA(p, q) as the condition that there exist constants ψ such that ∞ ψ < and { j} j=0 | j| ∞ ∞ P

Yt = ψjZt−j for all t. jX=0 For the ARMA process, this condition is equivalent to the condition that

φ(z)=1 φ z φ zp =0 for all z 1. − 1 −···− p 6 | | ≤

This condition can easily be tested for any ARMA process. Causality becomes im- portant for forecasting because the process should be defined independent of future

Zt.

20 Notice that the ARMA model has p + q + 1 parameters to be estimated (the p

AR parameters, q MA parameters, and the variance σ2, of the white noise process

Z ). Choosing values for p and q is an important step in ARMA modeling, but { t} it is often a subjective task. Examination of plots of the sample autocorrelation function (ACF) and the sample partial autocorrelation function (PACF) is one basic technique for order selection of an ARMA process. The MA(q) (ARMA(0, q)) moving average process and the AR(p) (ARMA(p, 0)) autoregressive process are relatively straightforward to identify using the sample ACF and PACF plots. Theoretically, the ACF of an MA(q) process is zero after q lags and the PACF of an AR(p) process is zero after p lags. The ACF of an AR process decays exponentially or sinusoidally, while the PACF decays exponentially or sinusoidally for an MA process. Difficulty arises in selecting the appropriate p and q values when we have an ARMA process, involving both the AR and MA components. In this case, the ACF diminishes after p q lags, but some trial and error is often required to determine p and q, and − ARMA order selection becomes more of an art than a science. Information-based criteria such as Akaike’s AIC [1973] and BIC [1978], can be used to help select values for p and q. These criteria do not authoritate order selection, but rather should be used as guidance. One general goal in model building is to keep the model as simple as possible. If possible, we should avoid large values of p and q, especially if we plan to use our model for forecasting. Because of the variance increase for estimating more parameters, and the possibility of over-fitting to the data, higher-order models tend to do poorly in forecasting.

In our discussion of ARMA models thus far, we have assumed that our process is stationary. In many cases, the process is not stationary, but can be reduced,

21 rather, to a stationary series of residuals which are modeled by ARMA techniques.

Trend and seasonality must be removed from the original series in order to obtain these stationary residuals. There are different ways of doing this, and we will briefly describe two methods and discuss the benefits of each.

2.1.1 Classical Decomposition Model

Basic time series elements are clearly identified in the classical decomposition model, which is a very generic form of time series model. Common to most time series models, the ultimate goal is to obtain stationary residuals that can be modeled by a such as the ARMA process. An advantage of this method is that the trend and seasonal components are estimated explicitly. This may or may not be important, depending on the application. In our situation, estimating trend and seasonality is not necessarily desirable as a long-term goal. Understanding these components, however, is beneficial in the exploratory stages of our analysis. The model given by

Yt = mt + st + Zt for t =1, 2, 3,...,n (2.2) decomposes the original stochastic process, Y , into three parts. The trend compo- { t} nent, represented by the function mt, is a smooth slowly changing function that can be modeled independently from the rest of the series. The seasonal component, st,

d is a periodic function of time, for a specified period d. We assume t=1 st = 0. The P random noise component is represented by the stationary residuals, Z . { t} If (2.2) is appropriate, we can estimate its terms from the data via the classical decomposition algorithm (CDA). This involves first estimating mt and st with the hope that the residuals will be a stationary process to be modeled by ARMA.

22 We use UCR data from Columbus, Ohio to illustrate the use of the CDA. Larceny, or common theft, has large crime counts and no missing observations in Columbus.

Not only does the raw larceny series (See Figure 2.1) exhibit an increasing trend and seasonality, but the variance and seasonal effect also change over time. The time series for agencies reporting in the UCR data are not generally stationary, mainly because of population changes over time.

We will first make an appropriate transformation of our data. From the plot of the

original data, we noticed that the magnitude of the random noise component increases

with the crime level. This suggests that a variance-stabilizing transformation may

be needed to provide better stationarity in the residuals. Figure 2.2 shows the time

series plot of the logarithmically transformed larceny data from Columbus. The level

of noise fluctuations now seems to be more constant across time. This transformation

is also seen to lessen the severity of the extreme crime counts in the late 1990’s.

Figure 2.3 shows plots resulting from the classical decomposition algorithm. The

trend function, mt, is estimated using a smoothing algorithm with a finite moving

average filter spanning the season, which we deem to be twelve months. The trend

looks to be a fairly linear increasing function of time. When the trend is removed from

the original series, we are left with a series that should be centered around zero. The

seasonality is then estimated by calculating the average response from the detrended

data for each of the twelve months, and then each seasonal estimate is centered so

they have mean zero. It appears that the occurrence of larceny is highest in the

summer and lowest in the winter. The seasonal component is subtracted from the

series and the resulting detrended and deseasonalized data more closely exhibits the

23 Columbus, OH Log Larceny Counts 6.5 7.0 7.5 8.0 8.5

1960 1970 1980 1990 2000

Year

Figure 2.2: Time series for log larceny counts in Columbus, OH

properties of stationarity, although the variance still appears to be slightly increasing at the end of the series, and outliers are still seen in the late 1990’s.

These “residuals” now can be modeled using stationary ARMA models. Figure

2.4 shows residual plots for the stationary residuals obtained from classical decom- position. The normal Q-Q plot shows heavy tails in the distribution of the residuals, which indicates a problem with normality. This is probably due to the extreme crime counts observed in the late 1990’s. From the sample ACF plot of the residuals, we see that not all of the correlation in the data has been accounted for by CDA. Specifically, the seasonality has not been sufficiently captured since the autocorrelation is high at lags 6, 12, 18, and so on. This suggests that the seasonal effect is not constant over the years, but rather evolves over time. The sample PACF plot of the residuals shows

24 Estimated Trend Trend Removed Log Larceny Counts 6.5 7.0 7.5 8.0 −0.8 −0.4 0.0 0.4 1960 1970 1980 1990 2000 1960 1970 1980 1990 2000

Year Year

Estimated Seasonality Trend and Seasonality Removed −0.6 −0.2 0.2 0.6 −0.20 −0.10 0.00 0.10 Jan Mar May Jul Sep Nov 1960 1970 1980 1990 2000

Month Year

Figure 2.3: Classical Decomposition Algorithm showing the steps taken to remove trend and seasonality from the series of Columbus log larceny counts.

25 Residual Plot Normal Q−Q Plot Residuals Sample Quantiles −0.6 0.0 0.6 −0.6 0.0 0.6 1960 1970 1980 1990 2000 −3 −2 −1 0 1 2 3

year Theoretical Quantiles

ACF of Residuals PACF of Residuals ACF Partial ACF −0.4 0.0 0.4 −0.4 0.0 0.4 10 20 30 40 50 10 20 30 40 50

Lag Lag

Figure 2.4: ACF and PACF for series adjusted by CDA

significant negative correlation among the closely neighboring residuals. Ignoring the

seasonal correlation in the sample ACF, we might consider modeling the residuals

as an AR(1) process, although a more complicated model would likely be needed to

reduce the series down to a white noise process.

We turn now to an alternative method for dealing with nonstationary seasonal

time series that can improve on the results from the CDA.

2.1.2 Time Series Models Based on Differencing

Another way to achieve the desired result of a stationary process, is differencing the data, a common technique developed by Box and Jenkins [1976]. Lag-1 differencing

26 is defined by, for each t,

(1 B) Y = Y Y − − t t − t 1 where B is the backward shift operator, defined by BYt = Yt−1 [Brockwell and Davis,

2002]. Differencing is used to transform a nonstationary process into a stationary process, so that we can model the stationary process by ARMA techniques. In most cases, Lag-1 differencing will sufficiently eliminate linear trends, but we will still need to make corrections for seasonality.

With our log-transformed data, we first consider the ARIMA (autoregressive in- tegrated moving average) model [Brockwell and Davis, 2002] for nonstationary time series. The ARIMA process imposes an ARMA model on the differenced series. If the differenced series, X = (1 B)d Y , is an ARMA(p, q) process, then Y is an t − t t ARIMA(p,d,q) process, where here d is the number of times the series is differenced to produce the ARMA process. Figure 2.5 shows the residual plots from the mean- corrected differenced (MCD) data. We subtract the mean from our differenced series so that our residuals are centered around zero. In the first plot, the “residuals” refers to the MCD series using lag-1 differencing. Notice how the transformation of the data improved the normality of the residuals relative to the decomposition algorithm residuals. It is obvious from the ACF plot that, in this case, no effort was made to model the seasonality in the data. Without any adjustment for seasonality, the

ARIMA model is inadequate.

The SARIMA (seasonal autoregressive integrated moving average) form of time series models establishes a flexible class of nonstationary seasonal models with pa- rameters specified by (p,d,q) (P,D,Q) . The first set of parameters, (p,d,q), is the × s familiar ARIMA construction. The second set of parameters, (P,D,Q)s, designates

27 Residual Plot Normal Q−Q Plot Residuals −1.0 0.0 −1.0 0.0 Sample Quantiles 1960 1970 1980 1990 2000 −3 −2 −1 0 1 2 3

Year Theoretical Quantiles

ACF of Residuals PACF of Residuals ACF Partial ACF −0.4 0.0 0.4 −0.4 0.0 0.4 10 20 30 40 50 10 20 30 40 50

Lag Lag

Figure 2.5: ACF and PACF for MCD series of larceny counts in Columbus, OH

a seasonal supposition on the data, where s is the seasonal period. Seasonality, itself, may exhibit local trends over time (and thus may need to be differenced out), and may also manifest itself as dependencies that we choose to model with AR and MA structural characteristics. Let Bs be the lag-s backward shift operator defined by

s B Yt = Yt−s. In the SARIMA model, the data, Yt, are transformed by lag-1 differ-

encing d times and lag-s differencing D times. The resulting series, Xt, is modeled

by applying ARMA sequentially and cyclically. The model from Brockwell and Davis

[2002] is given by

X = (1 B)d(1 Bs)DY , for t 1, 2, 3,... , where t − − t ∈{ }

φ(B)Φ(Bs)X = θ(B)Θ(Bs)Z , and Z WN(0, σ2). t t { t} ∼

28 In the above equation,

φ(z)=1 φ z φ zp, Φ(z)=1 Φ z Φ zP − 1 −···− p − 1 −···− P

θ(z)=1+ θ z + + θ zq, and Θ(z)=1+Θ z + +Θ zQ, 1 ··· q 1 ··· Q are polynomials of degrees p, P , q, and Q respectively. The autoregressive parameters

are represented by the coefficients in φ(z), and the moving average parameters are

represented by the coefficients in θ(z). Similarly, the seasonal autoregressive param-

eters and seasonal moving average parameters are represented by the coefficients in

Φ(z) and Θ(z) respectively.

We initially tried complicated models containing many terms for the UCR data.

Ultimately, however, we chose much simpler models to adhere to the principle of

parsimony. The sample ACF and PACF suggested that we should use SARIMA

models with parameters (1, 1, 1) (1, 0, 1) and period 12. Figure 2.6 shows plots for × the residuals from the SARIMA model. The ACF and PACF plots demonstrate that

the model has appropriately accounted for the dependence in the data, represented

through trend and seasonality. In Figure 2.6 the points in the ACF and PACF plots

fall in a bound between +1.96/√516 and 1.96/√516, the confidence bounds for the − white noise, showing that the model does capture the seasonality and trend features of the data.

Except for some extreme values in the tails, the Columbus larceny series is modeled rather accurately using SARIMA because the crime counts are very large ranging from about 500 per month in 1960 to close to 4000 per month in 2000. SARIMA models work well for data that is approximately normal, but for most of the time series in the

UCR, the crime counts are much smaller and the data need to be treated as discrete.

29 Residual Plot Normal Q−Q Plot Residuals Sample Quantiles −1.0 0.0 −1.0 0.0 1960 1970 1980 1990 2000 −3 −2 −1 0 1 2 3

Year Theoretical Quantiles

ACF of Residuals PACF of Residuals ACF Partial ACF −0.4 0.0 0.4 −0.4 0.0 0.4 10 20 30 40 50 10 20 30 40 50

Lag Lag

Figure 2.6: Residual plots of the SARIMA model of log larceny counts in Columbus, OH

30 When it is unreasonable to assume normality in the data, SARIMA models are quite inappropriate. When our data are small counts, the Poisson distribution becomes a more acceptable distributional assumption than normality. Poisson regression models are common for independent data but are increasingly seen in applications involving dependent data. In Chapter 3, we take a closer look at some of the time series models for count data that appear in the literature.

2.2 Explanatory Variables

The time series models we have described so far make no use of covariates. If predictions were made using these models, they would only depend on the data in the crime series itself. Thus, any imputations made using predictions from these models would be based solely on the crime counts in the months prior to, and the months following the missing period. Recognizing that other information can help explain crime counts, the FBI, as we have seen, sometimes makes use of information outside of the primary data series to make predictions for missing values. Part of our exploratory data analysis involved looking for variables that might be used to help us make better estimates for missing values.

Studying the causes of crime is the main concern for many researchers in the social sciences. For our research, we are not as interested in individual causes of crimes as we are with making the best predictions of crime counts as possible, using as much information as we can. In CIUS, the FBI notes that many factors affect crime volume and type. The following is a list of factors related to crime taken from CIUS:

Population density and degree of urbanization; variations in composition of the population; particularly youth concentration; stability of population with respect to residents mobility, commuting patterns, and transient fac- tors; modes of transportation and highway system; economic conditions,

31 including income, poverty level, and job availability; cultural fac- tors and educational, recreational, and religious characteristics; family conditions with respect to divorce and family cohesiveness; climate; effec- tive strength of law enforcement agencies; administrative and investigative emphases of law enforcement; policies of other components of the criminal justice system (i.e., prosecutorial, judicial, correctional, and probational); citizens attitudes toward crime; crime reporting practices of the citizenry [FBI, 2000].

Many of these factors are difficult to quantify. For several of the factors that can be quantified, data are not readily available at the police agency level. Still, under- standing the crime problem and its causes may help us to decide on factors which are effective and convenient for use in our models.

Restricting ourselves to information available in the UCR, we may first consider

the following classes of variables: time, location, and population. Time variables

include sequential monthly time points and cyclical time factors given by months of

the year or by seasons. Location variables are hierarchical in the order of region,

state, county, and individual agency. Population variables include annual popula-

tion, county urbanicity, and Group number (defined in Chapter 1), which also clas-

sifies zero-population agencies. Urbanicity is a measure proposed by Goodall et al.

[1998] which is calculated using the highest city population sizes in a county. Let

P ,P ,...,P be the population sizes for the k largest places in a county of n { 1 2 k} places. The authors consider the class of urbanicity measures defined by

k 1/2 2 Uk,n = 10 log2 Pi , ×  !  Xi=1   but recommend U3,n as a measure of county urbanicity.

If we focus on one particular time series, then crime type, region, state, and county

are all constant. These may seem to be of little use in the preliminary stage of our

32 analysis, but in Chapter 5 we will describe how they can be utilized to formulate a hierarchical structure to combine many series of UCR data together. Population size, Group number, and urbanicity are not constant over time, but have relatively small within-series variation compared to between-series variation. Population, Group number or urbanicity, therefore, are not suitable variables to group series together, because these variables can change within a time series.

In the SARIMA time series models described previously, trend and seasonality are not factors, like in mixed-effects models, but rather specify the dependence structure in the data. We have already seen in the CDA that it is possible to include trend and seasonality into a model as explanatory variables. If crime is thought to increase lin- early with time, then a linear trend component can be modeled by adding a variable for time (1, 2, 3,...,n). A higher-order polynomial of time can model a more compli- cated trend pattern. We use orthogonal Hermite polynomials for time in the models we describe in Chapter 5, to remove multicollinearity. Including seasonality in the model as an explanatory variable can be done by using indicators for month or season

2πt 2πt (using factors) or by using harmonic terms (e.g. sin( 12 ), cos( 12 )) as predictors. As described earlier, the FBI’s imputation method uses crime rates from “similar” agencies with the same Group number and in the same state. Taking this idea of using data from similar agencies to improve estimates from our models, we considered different selection criteria for possible use in our imputation methods. We suspected that correlation across series may be a useful variable in selecting a donor agency. We studied this theory with agencies from Iowa. We wanted our donor agency to be in the same state and to have complete data. Out of Iowa’s eighteen most complete agencies for the Part I crime index (NDX), which is the sum of seven Part I Index crimes, the

33 20 60 120 10 40 70

Des Moines 500 2000

0.70 Fort Madison 20 80

0.77 0.56 Ankeny 0 60

0.70 0.55 0.58 Warren 10 40 70

0.68 0.57 0.70 0.54 Davenport 0 600 500 1500 0 40 80 0 400 1000

Figure 2.7: Most correlated agencies in Iowa. Plotted are monthly NDX crimes. Pearson correlations are shown in lower diagonal.

scatterplot matrix of the five most correlated agencies is shown in Figure 2.7. We see that fairly strong relationships exist among the aggregate time series across different agencies in Iowa. More interesting still is the diversity in these agencies. The five most correlated agencies represent five different Group numbers. Des Moines, a Group 2 agency, represents the largest population, but Groups 5 (Fort Madison), 4 (Ankeny),

9 (Warren), and 3 (Davenport) are also represented. Warren, which is in Group 9, is

34 a county agency. It is probably true that actual crime levels are more similar within population Groups, but this example from Iowa suggests that Group number may not be as important for selecting predictor agencies in a model-based imputation method as correlation could be. We use this idea for donor agencies as we construct our

Poisson regression models in Chapter 4.

Another clear choice for a possible explanatory variable is geographical distance.

This idea is illustrated in the example above because Des Moines, Ankeny, and Warren are locationally close to each other, near the middle of the state, and are quite corre- lated with each other. Spatial variables, for inclusion in the model or for donor agency selection, potentially have a great role in imputation procedures. Nearest neighbor imputation concepts may apply, at a fundamental level, to many of the explanatory variables we have considered in this section. Geography, although not pursued in this dissertation, would be very interesting for future work in crime imputation and would require much research and investigation.

2.3 Missingness in the UCR

We have started to model complete observed UCR data series, but recall that we are more concerned with the data that is unobserved. Before we can work toward developing an imputation strategy, we must first understand the missing data prob- lem. The issue of missing data in the UCR is described briefly in Chapter 1, but an exploratory analysis into the missing data gives us a closer look into the origins, behaviors, and implications of the missingness.

35 2.3.1 Explaining Missingness

One of the initial difficulties encountered in examining missingness in the UCR data was determining whether a datum was missing or merely zero. First, not all

“missing” data in the UCR is truly missing. Data may appear to be missing because:

1. The agency had not (yet) begun reporting data to the FBI because it did not

exist (or did not have a crime reporting unit) at that time.

2. The agency ceased to exist (it may have merged with another agency).

3. The agency existed, but reported its crime and arrest data through another

agency (i.e., was “covered by” another agency).

4. The agency existed and reported data for that month, but the data were ag-

gregated so that, instead of reporting on a monthly basis, the agency reported

consistently (for some length of time) on a quarterly, semiannual, or annual

basis.

5. The agency existed and reported monthly data in general, but missed reporting

for one month and compensated for the omission by reporting, in the next

month, for both months.

6. The agency had few or no crimes to report and didn’t see the need to send in

a report filled with zeroes.

7. The agency did not submit data for that month (a true missing datum).

The clean UCR data [Maltz and Weiss, 2006] is now coded to indicate if a UCR report is missing because it is from a nonexistent agency, covered-by another agency, aggregated (usually 3, 6 or 12 months), or truly missing.

36 Agencies that did not exist until after 1960 require no imputation during the

unobserved period of time at the beginning of the series. Similarly, we do not need

to impute for any months after an agency has ceased to exist, since it is impossible

to observe crime reports from a non-existent agency. If we have missing data at the

beginning or end of a series, sometimes it is difficult to determine whether the agency

existed at all during that time period, even with the “clean” data. It is for this reason

that we have decided not to impute for months that are missing at the beginning or

end of any agency’s series.

When an agency’s crime counts are included in a second agency’s reports, the second agency is said to be “covering” the first agency, and the first agency is said to be “covered by” the second agency. Since such covered-by crimes have been counted and reported, we do not need to impute them for the covered agency if we are only in- terested in aggregate crime statistics. Still, we cannot treat agencies with covered-by months or covering months the same as other agencies. If a covering agency has peri- ods of covering surrounded by periods of non-covering, then we might have a problem when we try to model the series. The months that are covering another agency will need a different model from the months that are not covering. Fortunately, covered- by agencies usually have much smaller crime counts than the covering agencies, so the covering agencies’ series are not greatly affected. We discuss solutions for dealing with the “covering” issue in Chapter 4.

Aggregated crime counts also are a type of missing data. Even if we know the total crime count for a year, we may not know the crime count for each of the twelve months separately. When a UCR series includes aggregated data, we must be careful not to treat the months with aggregated crime counts the same as months that have been

37 reported individually. In a time series plot, an aggregated crime count will appear like a large outlier. These extreme observations may be troublesome to include in our model estimation, even though they provide critical information about the missing values. Special techniques need to be used to deal with cases of aggregation and these are described in Chapter 4.

2.3.2 Patterns of Missingness

Missing data patterns are characterized by duration and location. To study the duration of missing periods Maltz [2006] analyzed the lengths of runs of missingness and how they vary by year, state, and by agency size and type. With regard to the time indexes (location) of missing data in the time series, Maltz [2006] has also spent much effort analyzing the reporting trends over time.

Figure 2.8, produced by Michael Maltz [2006], shows the overall pattern of miss- ingness for all states, and all years. We see from this plot that missing gaps of only one month’s duration occur most ofter, and about 70% of the missing gaps are ten months or fewer. From an imputation standpoint, shorter runs of missingness are much easier to handle and have smaller prediction uncertainty. Also from this graph is evidence that some series are entirely missing. Some agencies have never submitted

UCR data to the FBI. Any imputation for missingness of this type must be based on a procedure that borrows information from another source, for example, hot-deck techniques.

The typology of agencies, or Group classification, used by the FBI, is described in Table 1.1 in Chapter 1. As can be seen, the typology is fairly simple, and agencies

38 Figure 2.8: Number of cases of different gap sizes from missing data. The horizontal axis is scaled logarithmically to highlight the shorter runs. (Produced by Michael Maltz)

may change their Group designation as their population changes. As shown in Ta- ble 1.1, Groups 1-5 represent cities of different sizes, but Groups 6-9 represent both cities and other types of jurisdictions, including university, county, and state police agencies, as well as fish and game police, park police, and other public law enforce- ment organizations. Figure 2.9, also produced by Michael Maltz [2006], depicts the missingness patterns for the different Groups. The largest agencies (Groups 1–4) have the least missingness, while smaller agencies (Groups 5–9) have the most miss- ingness. Interestingly, Group 1 agencies are almost entirely complete. Furthermore, the patterns of missingness are very similar for Groups 6–9, as can be seen in the log-log relationship between gap size and number of cases. One possible reason for the difference in overall missingness between large agencies and small agencies is that more effort on the part of the government is spent in encouraging large agencies to

39 Figure 2.9: Number of cases of different gap sizes from missing data for different FBI Groups, plotted on logarithmic scales. (Produced by Michael Maltz)

submit UCR reports. This makes sense because large agencies are responsible for a substantial contribution to the FBI’s national summary statistics on crime.

Less than one-fifth of all missing periods are single month occurrences. The large majority of cases of missingness involve two or more consecutive months of missing data. This is an important observation for developing our imputation procedure, and especially our variance estimation, because predictions made for months in the middle of a large gap will be less precise. This has never before been accounted for by the

FBI in their imputation procedure. Missing value inferences that take gap size into account are potentially very useful.

40 2.3.3 Missing Data Assumptions

To further understand and characterize our problem, we first consider the missing data mechanisms. There are three types of commonly-used missing data mechanisms: missing at random (MAR), missing completely at random (MCAR), and not missing at random (NMAR) [Little and Rubin, 2002]. If missingness does not depend on any data values, then the data are MCAR. If missingness only depends on observed data then the data are MAR. If missingness does depend on the missing values, then the data are NMAR. Most missing data methods require at least the assumption of MAR.

Before we continue any further in the analysis of the UCR data we must first make a claim about the missing data mechanism and provide supporting evidence for the claim. We assume that the data are missing at random. This is a necessary assump- tion for many imputation methods, and it is also a very common assumption to make, despite lack of concrete evidence. We conduct no test for the MAR mechanism, nor can we prove undeniably its validity, but we present here the arguments in favor of its acceptance. MAR requires that missingness is independent of the missing values, but allows for missingness to be dependent on the observed data. Let’s recall what variables that missingness might be related to. We have already seen that smaller agencies have more missingness, but agency size is an observed variable. Actually, since we have separated the data by agency, we need only impose MAR for an indi- vidual series. Within a series, missingness might be related to time or season, both of which are observed. Missingness could be be related to something unobserved such as personnel changes or system changes, but these are of little relation to crime. Perhaps missingness is related to the actual missing values themselves. The possibility that an agency does not report UCR data because it has little or no crimes to report is

41 a threat to the MAR assumption. Considering an NMAR mechanism is an area for

future research, but for now, assuming data is MAR is a first step and is better than

what has been done previously.

We conclude this chapter by summarizing the foundations for our imputation strategy. We will build a model of the data that uses all of the information from the data source. In particular, we will model the relationships among the groups of data and the dependency in the data exhibited over time and within seasons. Our approach to imputation is based on prediction from the data models and on the assumption that the data are missing at random.

42 CHAPTER 3

LITERATURE REVIEW

Our next step in the process of developing a procedure to estimate missing values in multiple time series of counts is to study the work on these topics. With an open, yet critical mind, we can learn from other’s past research. What we find in one journal article or textbook will not be the answer to all of our questions, but our goal in this chapter is to gather relevant pieces that can eventually be put together to create sensible methods for imputation.

Our imputation approach, as mentioned earlier, will first involve finding an ap- propriate model, or models, to the crime series. The SARIMA model in Chapter 2 applies only to a small portion of series in the UCR, namely count data that can be modeled approximately with a Gaussian process. An important aspect of our re- search is in constructing time series models for count data. Not surprisingly, most developments in time series research apply specifically to data that is approximately normal. In this chapter, we review time series models for count data found in the literature. Most commonly, these models are based on Poisson regression, which we describe in Section 3.1. Following our discussion of models in Section 3.2, we review the techniques for dealing with missing data in time series in Section 3.3.

43 3.1 Poisson Regression

To begin constructing a regression model, let us first make explicit some basic

assumptions about our data. We have already described the UCR data as monthly

crime counts. Typically, the term “count data” is reserved for discrete data from

the sample space of all non-negative integers with no finite upper limit. This defined

sample space is characteristic of Poisson data, which records the number of occur-

rences of a particular event in a given time interval. For a Poisson random variable,

we assume that successive events occur independently and at the same rate. The

probability mass function for a Poisson random variable, Y , is given by the formula:

e−µµy P(Y = y µ)= for y =0, 1, 2,... (3.1) | y!

As can be seen by Eq. 3.1, the Poisson distribution is defined by one positive real- valued parameter, µ, which is the rate at which the events occur. We will use this common discrete distribution as the foundation on which to build our models for

UCR crime counts.

In Chapter 2, we introduced the SARIMA model, which is a model for Gaus- sian processes. We applied the SARIMA modeling techniques to the UCR data even though the data, being discrete, can never be exactly normal. Asymptotically, how- ever, the Poisson distribution is related to the normal distribution. As the mean of a Poisson random variable increases, its distribution becomes approximately normal

[McCullagh and Nelder, 1989, page 195]. We have seen evidence of this from our preliminary analysis of data from Columbus in Chapter 2. For a small subset of the

UCR data, when crime counts are large enough, the data is approximately normal and the SARIMA model is reasonably appropriate. In this chapter, however, we begin

44 to develop a more universally applicable modeling strategy for the UCR data, based on the Poisson distribution.

An important feature of the Poisson distribution is the mean-variance relationship.

Both the mean and variance are equal to the rate parameter, µ. The mean-variance relationship, E(Y ) = var(Y ), must be taken into careful consideration when we con- struct our regression model for the crime counts.

In the UCR data, we see that the variability in crime counts increases as the crime level increases. This feature of the data supports the idea of making the Poisson distribution a reasonable model for the crime counts. The phenomenon of differing , known as heteroskedasticity, is a common problem in analysis. Ordinary (OLS) regression requires the assumption that the errors are independently and identically distributed normal random variables. If we assume that our data are realizations of Poisson random variables, then the model errors are not identical for each observation because the variance increases with the mean. A common remedy for heteroskedasticity in OLS regression is to make a log transformation of the response data. This is not recommended for count data for two reasons. First, if y = 0, then log(y) is undefined. Second, we want to predict E(y), but exp(E[log(y)]) = E(y), even though exp(log(y)) = y. Instead, we turn to Poisson 6 regression which falls under a broader class of models, namely the log-linear model, a type of (GLM).

McCullagh and Nelder [1989] give a thorough description of GLMs, including log- linear models in Chapter 6. The purpose of the generalized linear model is to extend linear modeling to cases for which the normal distribution is not appropriate. The

GLM can be used when data are not normal, and it can be used when the relationship

45 between the mean and the predictors is not linear. The GLM has three parts: the

distributional assumption, the link function, and the systematic component. The

distributional assumption, (3.2), is used to specify a distributional family for the data,

and the link function, (3.3), links the mean to the systematic component, which is

the linear predictor. Let Y represent a collection of independent random variables,

Y : t =1, 2,...,n . Then a GLM is specified by { t }

Y f(y θ) (3.2) ∼ |

and g(E[Y ]) = Xβ, (3.3)

where θ represents the distribution parameters, g is a known link function, X is the

design matrix, and β is a vector of regression parameters.

Now we begin to construct a Poisson regression model by specifying the three

GLM parts. Let the Y t =1, 2,...,n be random variables which are independent { t | }

conditional on past and present information that we will define later. Also, let Xk, t

be the kth covariate at time t. Then the model is given by

Y Poisson(µ ) for t =1, 2, 3,...,n (3.4) t ∼ t

with log(µ )= β + β x + + β x . (3.5) t 0 1 1, t ··· k k, t The random component of the GLM, (3.4), puts the Poisson distributional assumption

on the data, and the link function, (3.5), which links the Poisson mean to the linear

predictor, is typically the natural logarithm. The interpretation of the regression

parameter, βk, involves the multiplicative effect of the explanatory variable on the

expected response. Holding all other independent variables constant, as xk increases

βk one unit, the mean of y changes by a factor of e . A positive value of βk indicates a

positive relationship between Y and Xk.

46 One method used to estimate the regression parameters of a GLM is called maxi-

mum quasi-likelihood estimation, introduced by Wedderburn [1974]. Quasi-likelihood

functions are especially useful in the cases where the likelihood functions are difficult

or impossible to construct. The advantage of quasi-likelihood estimation is that we

do not need to specify a distribution for the observations. It is enough to define the

variance of the observations as proportional to some function of the mean. After

defining this relationship, var(y ) V (µ ), the quasi- Q(µ ; y ) for t ∝ t t t each t is defined in Wedderburn [1974] by the relation

∂ Q(µ ; y ) y µ t t = t − t . (3.6) ∂µt V (µt) Wedderburn [1974] discovered that quasi-likelihood can be used to estimate GLMs for any choice of link functions and variance functions. For the GLM, when the likelihood is of Aitkin form [McCullagh and Nelder, 1989, Equa- tion 2.4], the quasi-likelihood estimation method leads to the same estimates as the maximum likelihood method for independent data [Wedderburn, 1974]. For example, if we have independent Poisson data with the common mean, µ, then V (µ)= µ, and the quasi-likelihood function is

n µ Yt w Q(µ; Y)= − dw = Yt log(µ) nµ . (3.7) Yt w − Xt=1 Z  X  Using the corresponding score equation, we derive the maximum quasi-likelihood

estimateµ ˆ = (1/n) Yt, which is equivalent to the maximum likelihood estimate. P We can estimate the regression parameter, βk, by solving the estimating equation

n ′ ∂µt −1 S(βk)= V (µt) Yt µt(βk) =0 . (3.8) ∂βk ! { − } Xt=1

S(βk) is called a quasi-score function because its integral with respect to βk is the

quasi-likelihood function.

47 The Poisson regression model is useful if we wish to understand the relationship between Y and covariates X. Our data, however, have the additional characteristic of being dependent over time. Unless our model addresses the autocorrelation that exists in our data, it will most likely be inadequate. McCullagh and Nelder [1989, Sections

9.2–9.3] offer some discussion of the use of GLMs for time series. The quasi-likelihood method is particularly useful in the case of dependent data, and the quasi-likelihood estimating equations are used in the same way as in the case of independent obser- vations [McCullagh and Nelder, 1989, Section 9.3]. For example, Li [1994] uses the quasi-likelihood technique of obtaining the maximum likelihood estimates iteratively by Fisher scoring for a Poisson GLM in which µt depends on past observations. We can see, therefore, that the quasi-likelihood method is ideally suited for estimating

Poisson regression models for dependent data.

GLMs for dependent data are also described in detail in Diggle et al. [1994,

Chapters 7-10]. The authors distinguish three types of GLMs for longitudinal data: marginal models, random effects models, and transition models. Most interesting to our example, at this point, is the transition model, which is a GLM where the past outcomes are treated as predictor variables. We will describe transition models in more detail in Section 3.2.

Before we begin our review of log-linear models for dependent data, let us first discuss two additional options for the GLM. It is common, in Poisson regression, to consider situations when an offset might be useful, and situations when a dispersion parameter should be estimated. We now explain these ideas in detail.

48 3.1.1 Modeling Rates

When our response data has an associated index variable, we might find it appeal- ing to model standardized rates rather than rates. For example, in the UCR, we may be interested in modeling the crime rate per 1,000 people in the population. Here, the population size is an index variable for crime count. For Poisson regression, an offset is used to model rates. An offset is added to the model like an explanatory variable with a coefficient equal to 1. Since the coefficient is equal to 1, the index variable is directly proportional to the response variable. The following equation shows how an offset is typically added to a log-linear model as in Agresti [1996]. Retaining the same model as defined above in Section 3.1, we replace Equation 3.5 by

log(µ ) = log(z )+ β + β x + + β x , t t 0 1 1, t ··· k k, t or equivalently by

µt log = β0 + β1x1, t + + βkxk, t .  zt  ···

Here, log(zt) is the offset in our model, where zt is the population size in thousands

at time t. We include the offset if we believe that crimes increase at the same rate

as population size. Our model, therefore, could be designed to predict the average

number of crimes per 1,000 people in the population for a given month, t, though we

do not pursue this strategy presently.

3.1.2 Over-dispersion

The Poisson variance is equal to the mean, but in real data, this is not always the

case. If the variance exceeds the mean, we have over-dispersion, and must adjust our

model accordingly. There are a variety of reasons why we might encounter the problem

49 of over-dispersion. If our model does not appropriately pick up the crime pattern

over time, then we might underestimate the mean, thereby distorting the mean-

variance relationship. Over-dispersion can happen because the mean is a random

variable, not the same for every observation. As Brockwell and Davis [2002] point out,

over-dispersion can sometimes be explained by serial dependence in the data. Over-

dispersion should be investigated by examining or or

even by looking at the estimated autocorrelation function.

There are a number of different ways to express over-dispersion in the model. The

2 two most common variance assumptions are: (i) var(yt) = σ µt, and (ii) var(yt) =

2 2 µt + µt /ν, where σ and ν are constants [McCullagh and Nelder, 1989, Section 6.2].

In the first case, σ2 is a dispersion parameter, which can be estimated using the

Pearson X2 in the following equation [McCullagh and Nelder, 1989], where

p is the number of model parameters, andµ ˆt is the estimated value of yt: X2 1 n (y µˆ )2 σ˜2 = = t − t . (3.9) n p n p! µˆt − − Xt=1 For over-dispersed data, a scaling factor based on this estimated dispersion parameter can be used to adjust the (underestimated) standard errors of the model parameter estimates.

In the second case, the variance assumption naturally allows for a Bayesian inter- pretation of the model, and is appropriate if the degree of over-dispersion relative to

Yt depends on µt. We can assume that the means, µt, are random and follow a gamma

2 distribution with mean µt and variance µt /ν [Lawless, 1987]. Then the marginal dis-

tribution of the counts, Yt, is the negative binomial distribution with mean µt and

2 variance µt + µt /ν. Conditional on µt, the response variable has a Poisson distribu- tion. The level of influence of the gamma variance in the Poisson mixture is seen in

50 the index parameter ν. Note that as ν goes to infinity, then our model tends toward a Poisson distribution. Lawless [1987] demonstrate how to test that Yt is Poisson distributed versus a general negative binomial distribution.

Quasi-likelihood estimation methods are especially appropriate for over-dispersed data, because as noted earlier, a variety of variance-mean relations are allowed without the requirement of directly defining the distributions for the observations.

3.2 Time Series for Count Data

In general, Gaussian time series models are inappropriate for analyzing count

data. If the actual counts are not large enough to be approximated by continuous

variables, as in much of our data, then one solution is to specify the conditional

distribution of each observation by a generalized linear model. The conditioning here

is with respect to the past observations, means, or past and present covariates X.

For this section, in all the models that we discuss, we will assume that we are dealing

with small count data that is reasonably modeled by a Poisson distribution. Let

Y be a time series of counts, let X be a vector of covariates at time t, and let { t} t D = X , X ,..., X ,Y ,Y ,...,Y − , be the past and present covariates as well as t { 1 2 t 1 2 t 1} the past observations at time t. Let Y D : t =1, 2,...,n be a set of independent { t| t } random variables. In most of the models we describe, the distributional assumption

of the GLM is represented by

Y D indep. Poisson(µ ), for t =1, 2, 3,...,n, t| t ∼ t

where the Poisson probability mass function is given by (3.1).

Two classes of time series models considered in Chapter 2 are the ARIMA model

and the classical decomposition model. Now we present a broad class of models for

51 time series called the state-space class. State-space models consist of an observation equation and a state equation. The model defined below is taken from Brockwell and

Davis [2002, Chapter 8]. Suppose that W and V represent uncorrelated white { t} { t} noise vectors. Let the observation Yt be w-dimensional and the state variable Xt be v-dimensional. Also, let G be a sequence of w v matrices and let F be a { t} × { t} sequence of v v matrices. The observation equation expresses the observation, Y , × t as a, typically linear, function of the state variable, Xt, plus noise.

Yt = GtXt + Wt, t =1, 2,... (3.10)

The state equation determines the state in terms of the previous state plus noise.

Xt+1 = FtXt + Vt, t =1, 2,... (3.11)

One advantage of the state-space approach is that we can model each component separately and then put them together to form an overall model. Both ARIMA models and classical decomposition models can be constructed using the state-space representation. Another advantage of the state-space representation is the unified estimation approach. Maximum likelihood estimates of any state-space model can be found using Kalman recursions [Brockwell and Davis, 2002]. We will see yet another benefit of using state-space models, when we discuss missing data techniques at the end of this chapter.

In time series analysis there are two important types of state-space models as distinguished by Cox [1981]: “parameter-driven” and “observation-driven.” The dif- ference is found in the state equation. In a parameter-driven model, the state does not depend on the past observations, while in an observation-driven model, the state depends on past observations. In the case of count data, for both observation-driven

52 and parameter-driven models, the observations, conditional upon a specified mean

process, can be independent Poisson observations. One such class of dependent gen-

eralized linear models we consider are called the “generalized linear autoregressive

moving average” (GLARMA) [Davis et al., 1999] processes, which we now describe.

The GLARMA class of models uses link functions, characteristic of GLMs, to link

the mean of the data distribution to a linear predictor with both covariates and an

ARMA component. GLARMA models are GLMs that may include functions of past

observations as well as past conditional mean values. For Poisson data, the GLARMA

model will have the following form. For t =1, 2,...,n, let

Y D indep. Poisson(µ ), t| t ∼ t p q where log(µt)= α + xt β + φiHi(Yt−i)+ θiKi(µt−i), (3.12) Xi=1 Xi=1

and Hi and Ki are known functions [Davis et al., 1999].

A GLARMA model combines elements of both parameter-driven and observation-

driven time series models. The autoregressive component corresponds to an observation-

driven model and the moving average component corresponds, in some sense, to a

parameter-driven model.

A third type of model we will discuss in this chapter is the generalized additive model, which is a time series model that is neither parameter-driven, nor observation- driven. Instead, the time dependence is modeled through smooth functions of covari- ates.

The models we describe, like most time series models, can be checked by a resid- ual analysis which includes an examination of the dependencies of the residuals. If

53 modeled properly, the Pearson residuals, defined in McCullagh and Nelder [1989] as

Yt µt rt = − for t =1, 2,...,n, should be approximately uncorrelated. V (µt) q 3.2.1 Observation-driven Models

Observation-driven models, which are defined explicitly in terms of past observa- tions, are perhaps the most intuitive way of modeling time series data. A model for discrete data and an autoregressive model for Gaussian data (a

Markov chain model for continuous data) are both examples of observation-driven models. The examples given in this section are all autoregressive in the sense that the autocorrelation structure depends directly on past observations. Conditional expectation of the outcome given the past depends explicitly on past values. Condi- tional rather than marginal moments are modeled because the data are dependent.

Observation-driven models for longitudinal data are described in Diggle et al. [1994,

Chapter 10] as transition models.

The models we describe in this section will have the following form: For t =

1, 2,...,n, let

Y D indep. Poisson(µ ), t| t ∼ t k where log(µt)= xtβ + θi fi(Dt), (3.13) Xi=1 and f : i =1, 2,...,k are known functions. One of the simplest observation-driven { i }

models for count data, letting f1(Dt) = yt−1, is given below. For t = 1, 2,...,n,

suppose that

Y D indep. Poisson(µ ), t| t ∼ t

where log(µt)= xtβ + θyt−1. (3.14)

54 This model is not preferred for a variety of reasons. Zeger and Qaqish [1988] note

that model 3.14, when x β = β , leads to a stationary process only when θ 0. The t 0 ≤ authors also argue that under model 3.14, β has a lack of interpretability. The rate,

µt, is only equal to exp(xtβ) when yt−1 = 0.

Instead, Zeger and Qaqish [1988] suggest using the following model: For t =

1, 2,...,n, let

Y D indep. Poisson(µ ) t| t ∼ t q ∗ with log(µ )= x β + θ [log(y − ) x − β] . (3.15) t t i t i − t i Xi=1 ∗ where xt is a vector of covariates and yt−i = max(yt−i,c), 0

positive autocorrelation is represented by positive values of θ1, and negative auto-

correlation is represented by negative values of θ1. The parameter c determines the

probability that yt > 0 given yt−1 = 0 because

θ β(x −θx − ) P (y > 0 y − =0)=1 exp( c e t t 1 ). t | t 1 − −

We can see that when c gets closer to zero, then P (y > 0 y − = 0) gets closer to t | t 1 zero, under positive autocorrelation.

Zeger and Qaqish [1988] give three reasons for preferring the function fi(Dt) =

∗ log(y − ) x − β to f (D ) = y − . The first is that β can be interpreted as the t i − t i i t t i

proportional change in the marginal expectation of yt per unit change in xt since

the marginal mean for each t is approximated by E(y ) exp(x β). The second t ≈ t

is that positive and negative associations are possible. If fi(Dt) = yt−i, then only

negative association is possible in the absence of covariates. The third reason for

the preference is that βˆ and θˆ1 are approximately orthogonal, which results in more

accurate estimation because of the implied independence in the coefficients.

55 Perhaps it makes more sense, in some situations, for the mean at time t, µt, to depend on past means, rather than only on past observations. This could be an obser- vation and parameter driven approach. Li [1994] considers a modeling approach that uses an analog of the classical moving average component with a GLM. The model of Zeger and Qaqish [1988] is extended by including a moving average component. Li

[1994] effectively enlarges Dt to include µt−1,...,µt−k for some k < n. The following

model is an example: For each t =1, 2,...,n, we assume

Y D indep. Poisson(µ ), t| t ∼ t

∗ with log(µt)= α + θ[log(yt−1/µt−1)], (3.16)

∗ and where yt−1 is defined above. Li [1994] showed that realizations of this conditional

Poisson “moving average” model have an autocorrelation function that looks similar

to that of a first-order moving average process.

As an example in application, Li [1994] tried his approach by fitting four different

GLARMA models on counts of U.S. polio cases (data in Zeger [1988]). Two models

were second-order autoregressive models, and two were second-order moving average

models. The best model (3.17), based on the , was the second-order moving

average model with trend modeled by βt. This model captures more of the auto-

correlation in the Polio data than Zeger’s autoregressive model. For t = 1, 2,...,n,

let

Y D indep. Poisson(µ ), t| t ∼ t

with log(µt)= α + βt + θ1[log(yt−1/µt−1)] + θ2[log(yt−2/µt−2)]. (3.17)

56 This is an example of how higher-order GLARMA models may be appropriate in some situations, and how using past means can sometimes improve the fit over an autoregressive model.

An alternative way that both the past mean and past observation can be incor- porated into f(Dt) is suggested by Davis et al. [2003]. The authors propose a model in which the log-mean process depends linearly on scaled deviations of previous ob- servations from their means. For t =1, 2,...,n, let

Y D indep. Poisson(µ ), t| t ∼ t

−λ with log(µ )= x β + θ(y − µ − )µ − , (3.18) t t t 1 − t 1 t 1 where 0 <λ 1 is a constant. For fixed λ, conditional maximum likelihood estima- ≤ tion of the model is straightforward, but the interpretation of the effect of covariates on the mean response is complicated. The advantage of this model is that it is stationary, even when the serial dependence is positive. Whereas the solution to non- stationarity in Zeger and Qaqish [1988] involves mean correction, Davis et al. [2003] use standardization.

Observation-driven models are usually better at forecasting than parameter-driven models, but it is more difficult to establish the asymptotic properties of observation- driven models [Davis et al., 1999]. We now take a closer look at parameter-driven models.

3.2.2 Parameter-driven Models

In parameter-driven models, a latent process ǫ characterizes the over-dispersion { t} and the autocorrelation in Y . Conditional on the ǫ , the observed counts are { t} { t}

57 independent Poisson random variables with means

E(Y ǫ ) = exp(x β + ǫ ) . (3.19) t| t t t

We assume that the process ǫ is a strictly stationary time series. It might be { t} chosen to be a Gaussian autoregressive moving average process. The autocorrelation

in Y must be less than or equal to the autocorrelation in ǫ . One advantage of { t} { t} the parameter-driven model is that the ǫ can be defined to make µ = exp(x β) so { t} t t that β has the interpretation of the proportional change in the marginal expectation of Yt given a unit change in Xt.

Parameter-driven models have been shown to be effective in accounting for serial

correlation in time series count data. Zeger [1988] uses a parameter-driven model with

first-order autoregressive errors on a time series of counts. Comparing the parameter-

driven model to log-linear models that assume independent observations, Zeger [1988]

found, through case study and simulation study, that the parameter-driven model,

which accounts for autocorrelation, leads to more valid inferences. Models that do not

account for autocorrelation tend to underestimate the standard errors of the estimate

of the parameter vector, β, βˆ.

Davis et al. [1999] offer a good review of time series models for counts. In compar- ing observation-driven models to parameter-driven models, the authors forewarn that parameter-driven models are difficult to forecast since serial dependence is modeled by the latent process, which is unobservable. Furthermore, there is no simple interpreta- tion for the latent process. Another drawback is that parameter estimation methods are typically more computationally intensive for parameter-driven models than for observation-driven models. One advantage of parameter-driven models is that the effects of the covariates on the response variable have straightforward interpretations.

58 As an example of an application of a parameter-driven model, Durbin and Koop-

man [2000] study the effect of seat-belt laws on road casualties using the following

model. Let Yt represent the count of road casualties for time t. For t = 1, 2,...,n, let

Y D indep. Poisson(µ ), t| t ∼ t

with log(µt)= xtβ + γt + ǫt, (3.20)

where the trend ǫt = ǫt−1 +ηt is a random walk, xt is an indicator variable for the seat-

11 belt law, and the monthly seasonal component γt is generated by j=0 γt = ωt. The P processes ω and η are mutually independent Gaussian white noise processes. For { t} { t}

the regression problem, it is desirable to construct a model with the property that µt

depends on the regression parameters but not on the latent process parameters. In

Chapter 5, we will begin to look at how parameter-driven models can be very useful

for the UCR data in a Bayesian setting.

3.2.3 Generalized Additive Models

Generalized additive models (GAM), introduced by Hastie and Tibshirani [1986],

replace the linear predictor in generalized linear models with an additive predic-

tor. That is to say, generalized additive models consist of a random component, an

additive component and a link function relating these two components. The addi-

tive component is usually defined by the sum of nonparametric smooth functions

of the covariates. Then, our response variable, Y , depends on predictor variables

(X ,X , ,X ) is a more nonparametric way. Hastie and Tibshirani [1986] estimate 1 2 ··· p the smooth functions by the local scoring algorithm using scatterplot smoothers. In

the Poisson case, a generalized additive model could look like the following: For

59 t =1, 2,...,n, let

Y D indep. Poisson(µ ) t| t ∼ t p where log(µt)= s0 + sj(Xt). (3.21) jX=1 In the above equation, the s ( )’s are smooth functions and the vector X are the j · t covariates for each t. Advantages of generalized additive models are flexibility and improved predictive ability [Hastie and Tibshirani, 1986]. GAM’s are very useful if the effects of the covariates are nonlinear. For example, trend and seasonality can be modeled by the smooth functions.

Often in practice, a mixture of the GLM and GAM will be used. A typical model will include both a linear component and an additive component. In Dominici et al.

[2000] and in Dominici et al. [2004], a log-linear generalized additive model is used to estimate the relative change in the rate of mortality associated with changes in air pollution. This model includes smooth functions of calendar time and weather as predictors to account for trend and seasonality. For t =1, 2,...,n, let

Y D indep. Poisson(µ ) t| t ∼ t p where log(µt)= xtβ + sj(Zt) (3.22) jX=1 and where β is the regression parameter for pollution covariates x and the s ( )’s t j · are the smooth functions to be estimated for factors, Zt, of time and weather.

In both the Markov models and the smoothing models, the interpretation of β is usually the same, but the interpretations of the seasonal and trend effects may be different. The smoothing functions, however, like nuisance parameters, are mainly used to adjust for confounding factors. The model is not autoregressive but, in

60 Dominici et al. [2004], the autocorrelation functions showed that the GAM did in fact account for time dependence in Y . We can conclude, therefore, that observation- driven or parameter-driven models are not the only ways to sufficiently account for serial correlation in time series data.

3.3 Missing Data in Time Series

In this section, we discuss two ways of dealing with missing data in time series.

Typical goals are to estimate the model parameters and to estimate the missing values.

We present the basic modeling strategies for time series with missing observations, and we describe how imputations can be made for the missing values themselves.

3.3.1 Modeling

To deal with missing values, Brockwell and Davis [2002] use state-space models and Kalman recursions. Minimum mean squared error (MMSE) state estimates are obtained using the Kalman fixed-point smoothing algorithm. This technique involves defining a new series Y∗ , which is given a state-space representation which coincides { t } with Y for observed times and at unobserved times takes random values from a { t} distribution independent of θ. Applying Kalman recursions will let us find the like- lihood function of the observed data, from which we can derive maximum likelihood estimates of the model parameters. This technique can also be used to estimate the missing values themselves.

EM Algorithm

Little and Rubin [2002] and Brockwell and Davis [2002] discuss the EM algorithm approach to fitting time series models for incomplete data. This iterative approach,

61 which applies most suitably to the state-space models described in Section 3.2, pro-

vides maximum likelihood estimates, with the help of Kalman recursions, assuming

the data are MAR. The E-step, or expectation step, imputes the missing observations,

Y , with conditional expected values E(Y Y , θ(t)). The M-step, or maximiza- mis mis| obs tion step, calculates the maximum likelihood estimates of the model parameters based

on the filled-in data from the E-step. After the final iteration of the EM algorithm,

not only are the model parameters estimated, but the missing values are estimated

as well. Little and Rubin [2002] apply the EM algorithm to data with missing obser-

vations from both an AR(p) model and a state-space model.

3.3.2 Imputation

We have already mentioned that there are techniques to estimate the model pa- rameters, such as Kalman recursions and the EM algorithm, which also provide es- timates of the missing values themselves. Let Ym be a value missing from the time series. Let Y = Y ,Y ,...,Y − represent the data prior to the missing value and a { 1 2 m 1} let Y = Y ,Y ,...,Y represent the data following the missing value. The b { m+1 m+2 n} best interpolator for the missing value, Ym, in the time series is

Yˆ = E(Y Y ,Y ). (3.23) m m| a b

For simple models, and simple cases of missing observations, we can use a direct approach to estimating missing values by minimizing the mean square prediction error (MSPE) [Brockwell and Davis, 2002]. In more complicated situations, we can use Kalman filtering, or if we are willing to forfeit minimum MSPE, we can adopt a simple ad hoc strategy.

62 Abraham [1981] proposes a simple method of imputation for time series. The data

subsequent to the missing value is used to create a forecast estimate and the data

following the missing value is used to create a backcast estimate. The value imputed

ˆ (f) ˆ (b) is a linear combination of the forecast, Ym , and backcast, Ym , in the following equation where the weights, w1 and w2, are generally functions of the parameters of

the known underlying time series process.

Y˜ = w E(Y Y )+ w E(Y Y ) m 1 m| a 2 m| b ˆ (f) ˆ (b) = w1Ym + w2Ym (3.24)

The weights, w1 and w2, are found by minimizing the the MSPE of Y˜m. For multiple

missing values, the author advises that if the missing values are not adjacent, then

there should be a sufficient number of observations between the missing values to

build a time series model. Abraham [1981] showed that for any first difference model,

the estimate of the missing value is the average of the forecast and backcast. Abraham

concluded that there was little effect of the imputed values, using his technique, on the

parameter estimates and on forecasts of future observations. For some simple models

(e.g. ARMA(1,1)), Abraham’s method produces minimum MSPE interpolators. In

general, however, the MSPE of the linear combination (3.24) is always at least as

large as the MSPE for the best interpolator. In Chapter 4, we compare the MSPE

for Abraham’s interpolator with that of the best interpolator. In Chapter 4, we will

also describe and illustrate how we use Abraham’s technique to make imputations for

missing values in the UCR.

63 CHAPTER 4

A METHOD FOR IMPUTING

In the late 1950’s, when the FBI first developed an imputation method to deal with missing data in the UCR, computing capabilities were very limited, so the method was designed to be simple to implement. Over the last sixty years, computing power has increased significantly, but the FBI has not changed its method for imputing. Since we are no longer limited to only using one year of data in our imputations’ calculations, we might consider what benefit a more complicated modeling approach can bring to our predictions. Our goal in developing a new UCR imputation method is to improve upon the FBI’s existing method by incorporating more available information into our estimates. By using an agency’s reported crime counts prior to and following a period of non-reporting, we can provide better estimates of the missing crime counts. We describe this strategy in Section 4.1.

The UCR time series vary considerably across agencies. Some agencies have a great deal of seasonal variation in crime counts while others have none. Some are in cities with high population growth while others are in cities that have experienced population declines. More importantly from our standpoint, some have high monthly crime counts, so that the probability of a zero count is vanishingly small, while others have sparse crime counts. Because of the high degree of variation in crime counts from

64 jurisdiction to jurisdiction, as well as from crime to crime within a jurisdiction, we employed different methods for different situations. Figure 4.1 shows an example from

Cleveland Heights, Ohio which illustrates the variability among series. We can see that trend and seasonality is difficult to identify in the time series for murder counts, but the time series for larceny counts shows visible crime pattern and seasonality.

From Figure 4.1, we also notice that there is some correlation among crime series within an agency (particularly between robbery and larceny), and we will investigate this relationship further in Chapter 5.

Fundamental to our procedure is the idea of fitting models to the data. Finding that a “one-size-fits-all” approach cannot be used for all UCR series, we have devel- oped a method which uses three different models, depending on the statistics of the agency and crime in question. For series that have high crime counts (such as larceny counts for Cleveland Heights), we found that a SARIMA time series model was most useful. For series with lower crime counts, we chose models appropriate for discrete data. For series with intermediate crime counts (such as robbery counts for Cleveland

Heights), we used a Poisson regression model. Otherwise (for series such as murder counts for Cleveland Heights), we imputed the mean monthly value averaged over all available data and assumed a Poisson distribution of crime counts. We describe our three-model method in Section 4.2. In Section 4.3, we discuss how to handle special cases in the data, including “covering” jurisdictions and aggregated crime counts.

65 Cleveland Heights, OH Murder Counts 0.0 0.5 1.0 1.5 2.0 1960 1970 1980 1990 2000 Year Robbery Counts 0 10 20 30 1960 1970 1980 1990 2000 Year Larceny Counts 0 50 100 200 300 1960 1970 1980 1990 2000 Year

Figure 4.1: Murder, robbery, and larceny counts from Cleveland Heights, OH.

66 4.1 Imputation Procedure

For the most part, our goal is to develop rather than extrapolation methods. Missing data at the beginning or end of a series might be a misrepresenta- tion of a non-existent agency, and so to avoid confusion, we only make imputations for missing data within the time series. In other words, we are focusing on using data from the early and late periods of the time series to fill in the gaps in the middle.

In this section, we develop our procedure on the basis of three criteria. Our

imputation procedure must make reasonable sense, must provide plausible estimates,

and must be practical to implement. The logic of our strategy is described in Section

4.1.1, the evaluation is described in Section 4.1.2, and the implementation is described

in Section 4.1.3.

4.1.1 The Forecast–Backcast Method

As first discussed in Section 3.3, ?? takes a simple imputation approach that uses

forecasting and backcasting estimates from time series models. Forecasting is using

prior data to predict the future and backcasting is using later data to “predict” the

past. For intermittent missing values, when forecasting and backcasting are combined,

Abraham’s method provides estimates with lower standard errors than an estimation

procedure that is based on prior or subsequent data points alone.

Lety ˆ (f) be the forecast andy ˆ (b) be the backcast for time t with s.e.( ) representing t t · the of the estimate. We combine the two estimates using a weighted average. Our forecast-backcast predictor (for competing forecasts) is given by

y˜ = θ yˆ (f) + (1 θ)ˆy (b) (4.1) t t − t

67 with variance

var(y ˜ )= θ2 var(ˆy (f))+(1 θ)2var(ˆy (b)) (4.2) t t − t

assuming that the forecast and backcast are independent. The parameter θ can be derived to obtain the minimum mean square prediction error (MSPE) as in Diebold

[1988]. The more general forecast-backcast estimator is

(f) (b) y˜t = w1 yˆt + w2 yˆt . (4.3)

For simple time series models, such as AR(1), Abraham [1981] has shown that gen- eral predictors (4.3) based on forecasts and backcasts, can have minimum MSPE. In

Section 4.1.2, we evaluate the accuracy of two competing predictors for a simple case.

For other models, such as the SARIMA model in Chapter 2, finding the best linear unbiased predictor (BLUP) is doable but complicated. Taking an ad hoc strategy, we decided that we wanted to give more weight in the combined estimate to the more accurate predictor (either forecast or backcast). The weighting was chosen so that the prediction with the lower standard error would be given greater weight in the (b) s.e.(ˆyt ) composite estimate. Hence, we let θ = (f) (b) in Equations 4.1 and s.e.(ˆyt )+ s.e.(ˆyt ) 4.2.

4.1.2 Evaluating the Combined Estimator

To evaluate the forecast-backcast predictor against the best predictor, let us con-

sider a simple AR(1) process with parameter φ. Let Y ,Y ,Y ,...,Y be a time { 1 2 3 n} series generated from the following data model where φ is known, σ2 = 1, and

Y N(0, σ2(1 φ2)) is the stationary starting condition. Let 1 ∼ −

Yt = φYt−1 + Zt for t =2, 3,...,n,

68 where Z t =2, 3,...,n indep. N(0, σ2) . { t | } ∼ Suppose that Y is missing, for some 1

Yk has the form Yˆk = a(Yk−1 + Yk+1) [Brockwell and Davis, 2002, Section 8.6]. The

predictor is derived by finding a to minimize the MSPE, which is E(Y Yˆ )2. Recall k − k that γ( ) is the autocovariance function, and γ(0) represents the variance of Y , · { t} σ2 which is . We then have 1 φ2 − 2 MSPE(Yˆ ) = E(Y a(Y − + Y )) k k − k 1 k+1 2 = γ(0) 2E(aY (Y − + Y )) + E(a(Y − + Y )) − k k 1 k+1 k 1 k+1 = γ(0) 4aγ(1) + a2(2γ(0)+2γ(2)) − = γ(0) 1 4aφ +2a2(1 + φ2) . (4.4) −   Taking the derivative of (4.4) with respect to a, and setting equal to zero, we solve

the following equation for a:

γ(0) 4φ +4a(1 + φ2) =0 −   φ to get a = . 1+ φ2 The second derivative, 4γ(0)(1 + φ2), is positive, so a produces the minimum

MSPE, and the best linear predictor of Yk is

φ Yˆ = (Y − + Y ) . (4.5) k 1+ φ2 k 1 k+1

This predictor has MSPE found by putting a into (4.4).

2 φ φ 2 MSPE(Yˆk) = γ(0) 1 4 φ +2 (1 + φ )  − 1+ φ2 ! 1+ φ2 !  (1 + φ2) 4φ2 +2φ2  = γ(0) − 1+ φ2 !

69 1 φ2 = γ(0) − 1+ φ2 ! σ2 = . (4.6) 1+ φ2

Using the forecast-backcast method of Abraham [1981], our predictor of Yk has the form ˜ ˆ (f) ˆ (b) Yk = c (Yk + Yk ) (4.7) since φ and σ2 are assumed to be known. The forecast and backcast receive the same weight because there is only one missing observation. If we reversed the order of our AR(1) time series, then the backward series would still be AR(1) with the same

2 ˆ (f) ˆ (b) parameters φ and σ . This will result in Yk and Yk being of the same form. The forecast will be based on the observed value preceding the missing value,

(f) Yˆ = E(Y Y ,...,Y − )= φY − , k k| 1 k 1 k 1 and the backcast will be based on the observed value following the missing value,

Yˆ (b) = E(Y Y ,...,Y )= φY . k k| k+1 n k+1

The forecast-backcast predictor is found by minimizing the MSPE of the combined estimator.

2 MSPE(Y˜ ) = E(Y c (φY − + φY ) ) (4.8) k k − k 1 k+1

Here, we recognize from (4.4) that if cφ = a, then we will again have the BLUP of

Yk.

Now, let us consider a simpler approach. Suppose that instead of finding the forecast-backcast estimator with the minimum MSPE, we decided only to take the

70 MSPE 0.5 0.6 0.7 0.8 0.9 1.0 −1.0 −0.5 0.0 0.5 1.0

AR(1) Parameter

Figure 4.2: Dashed line represents average forecast-backcast MSPE of one missing value in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk given all the data.

simple average of the forecast and backcast, still assuming that we know φ and σ2. φ The MSPE found by substituting a = into (4.4) is 2

2 2 φ φ 2 MSPE(Y˜k) = γ(0) 1 4 +2 (1 + φ ) − 2 4 ! 3φ2 φ4 = γ(0) 1 + . (4.9) − 2 2 !

Figure 4.2 compares the MSPE under the AR(1) model for the BLUP of Yk given all

the data with the MSPE from the average forecast-backcast method, when σ2 = 1.

We can see that the average forecast-backcast method comes close to the accuracy of

the best linear predictor, in the AR(1) case, when φ is close to 0, +1, or -1.

Now suppose that we had a longer period of missingness. Let Y − ,Y ,Y be { k 1 k k+1}

missing from the time series. To get the best linear predictor for Yk, we find a to

71 minimize the MSPE as shown below.

2 MSPE(Yˆ ) = E(Y a(Y − + Y )) k k − k 2 k+2 2 2 = E(Y ) 2E(aY (Y − + Y )) + E(a(Y − + Y )) k − k k 2 k+2 k 2 k+2 = γ(0) 2a(γ(2) + γ(2)) + a2(γ(0)+2γ(4) + γ(0)) − = γ(0) 4aγ(2)+2a2(γ(0) + γ(4)) − = γ(0)(1 4aφ2 +2a2(1 + φ4)). (4.10) −

Differentiating (4.10) with respect to a and setting equal to zero, we solve the equation

γ(0) 4φ2 +4a(1 + φ4) =0, (4.11) −   φ2 to get a = . Since the second derivative of (4.10), 4(1 + φ4), is positive, the a 1+ φ4 produces the minimum MSPE. Now, the BLUP of Yk is

φ2 Yˆ = (Y − + Y ) (4.12) k 1+ φ4 k 2 k+2

with

2 2 2 φ 2 φ 4 MSPE(Yˆk) = γ(0) 1 4 φ +2 (1 + φ )  − 1+ φ4 ! 1+ φ4 !   4φ4 +2φ4  = γ(0) 1+ − 1+ φ4 !! 2φ4 = γ(0) 1 . (4.13) − 1+ φ4 !

Now, applying the average forecast-backcast method, the best linear predictor of

ˆ (f) 2 Yk in terms of Yk−2 is Yk = φ Yk−2, and the best linear predictor of Yk in terms of

ˆ (b) 2 Yk+2 is Yk = φ Yk+2, so the average forecast-backcast predictor is

φ2 Y˜ = (Y − + Y ) k 2 k 2 k+2 72 MSPE 1.00 1.10 1.20 1.30

−1.0 −0.5 0.0 0.5 1.0

AR(1) Parameter

Figure 4.3: Dashed line represents average forecast-backcast MSPE of missing value in gap in AR(1) process. Solid line represents the MSPE from using the BLUP of Yk given all the data.

with

2 2 2 φ 2 φ 4 MSPE(Y˜k) = γ(0) 1 4 φ +2 (1 + φ )  − 2 ! 2 !   φ4 φ4 φ8  = γ(0) 1 4 +2 +2 − 2 ! 4 ! 4 !! 3φ4 φ8 = γ(0) 1 + . (4.14) − 2 2 !

Figure 4.3 shows the comparison between the minimum MSPE under an AR(1) model with the MSPE from the forecast-backcast method of predicting Yk, when

2 Y − ,Y ,Y are missing, and σ = 1. In both figures, when φ 0, then the MSPE k 1 k k+1 ≈ is close to σ2. Also, in both figures, when φ moves toward +1 or -1, then the MSPE decreases because of the increasing correlation between values in the process. Still,

Figure 4.3 looks quite different from Figure 4.2 for the intermediary values of φ.

73 When a missing value is surrounded on both sides by observed values, then we expect

our MSPE to be smaller than σ2 because we have more information, and we have

greater dependence between values closer in the process. When a missing value is

in the midst of a string of missing values, then the MSPE will be higher, because

we have less certainty about values more removed from observed data. Based on the

above analysis, we have decided that we will use a simple combination of forecasts

and backcasts. Taking a Bayesian approach in Chapter 6, will allow us to directly

estimate missing values using the entire observed portion of the data series.

Another way we have tested our proposed imputation method is by simulating missingness in UCR series by eliminating known data points from the time series and then using various models to estimate the deleted data. We used mean square prediction error to assess the overall accuracy of our models’ predictions. We also inspected the results of different imputation procedures by looking at plots of the observed series with the imputed values. Examples of this are shown in Section 4.2.

4.1.3 Imputation Algorithm

In the previous section, we showed that for the case of an AR(1) process, the average forecast-backcast method is not the best way, in terms of MSPE, to impute for missing values. In comparing the MSPE for both methods, we have seen that they are relatively similar, especially for φ close to 0, -1, or +1. There are some difficulties

in finding the BLUP for a more complicated model, like the SARIMA model we have

fit to UCR data. For simplicity in implementation, we have adopted the forecast-

backcast method to make our imputations. The forecast-backcast method has been

shown to be better than forecasting or backcasting individually.

74 Allen, OH Vehicle Theft Counts 0 5 10 15 20

1960 1970 1980 1990 2000

Year

Figure 4.4: Incomplete series of vehicle theft counts from Allen, OH.

We now describe the general structure of the algorithms we used to make impu- tations in the UCR data. It is important that our algorithm be able to accommodate most series with missingness. The characteristics of missingness, therefore, must be carefully considered before constructing an imputation algorithm. Some series we have seen are impossible, even for our algorithm, to handle. These are series that are either completely missing, or are only observed for a small subset of months, that are not necessarily even consecutive. These cases might be handled by methods such as hot-deck imputation techniques from survey research, which we do not employ here.

Multivariate methods, which use data from other agencies, can also be used in these situations. Figure 4.4 shows a typical time series with intermittent periods of miss- ingness. In this example from Allen, Ohio, we see several missing gaps of different lengths. It is for series with this appearance of missingness that we will construct our

75 algorithm. From here, we must decide what observed data we should use to impute

for each gap.

One barrier to estimating model parameters is the presence of outliers. Many

outliers in the UCR are suspected of being processing errors, and many others are

the result of aggregated crime counts. Processing errors must be dealt with outside

of this research, but adjusting for aggregated crime counts is described in Section 4.3.

The goal of our procedure is not to remove or replace outliers, but we do exclude

them during the modeling step, to provide better estimates for missing values.

For each missing gap, two models are fit, provided that enough data is available

surrounding the missing period. The fitted model from data prior to the gap will be

used to make forecasts and the fitted model from data following the gap will be used

to make backcasts. For each month missing, we then calculate a weighted average

(described in Section 4.1.1) of the two imputations for our final estimate, reducing

the variance from that obtained using a single estimate.

All imputation algorithms were written in the R statistical programming language.

Model parameters are estimated in R using the standard techniques discussed in

Chapter 2. The exact Gaussian log-likelihood for an ARIMA model is calculated

in R by using Kalman filtering on the state-space representation of the process [R

Development Core Team, 2008]. The fitting of the ARIMA model in R allows for

missing values. GLM models are fit in R by using Iteratively Re-weighted Least

Squares after omitting missing values. This is important because it allows us to use

incomplete data on either side of the gap of interest, and it prevents us using imputed

data for some gaps to fill in other gaps.

76 4.2 The Three Models

We have seen throughout the course of our research that SARIMA models are appropriate for series with large crime counts. We chose to implement our imputation procedure based on the SARIMA model for series with an average monthly crime greater than 35. If the average monthly crime count for a series was less than 35, we used a Poisson GLM to model crime counts, and we based our imputations on this model. However, we saw that for series were very low crime counts, trend and seasonality are sometimes impossible to identify. We decided to impute the observed mean for missing within a series that had an average monthly crime count less than

1. We then used our imputation strategy to impute for missing gaps for every agency in Iowa. Based on a visual inspection of the predictions, in comparison with observed crime counts within a series, we concluded that our procedure performed well. We do not claim that the break points of 1 and 35 are optimal, but we do suggest that they are reasonable break points for our three-model imputation method.

4.2.1 Large Crime Counts

For crime counts averaging above 35 crimes per month, we used estimates from

SARIMA time series models. In our preliminary analysis of the UCR, described in

Chapter 2, we ultimately chose to use the SARIMA (1, 1, 1) (1, 0, 1) model for × 12 series with large crime counts. SARIMA models require the assumption that the data is approximately normally distributed, and with over 35 crimes per month on average, this assumption is reasonable. This model uses first-order lag-1 differencing of the data and supposes first-order AR and MA components at both lag-1 and lag-

12. In this way, seasonality is allowed to change over time. We define Y as the { t}

77 crime count at time t, Z IID Normal(0, σ2) is a random error term, and B is the { t} ∼ backward shift operator. We let

φ(z)=1 φ z, Φ(z)=1 Φ z, θ(z)=1+ θ z, Θ(z)=1+Θ z. − 1 − 1 1 1

Then the form of the SARIMA (1, 1, 1) (1, 0, 1) is given by equation 4.15. × 12

φ(B)Φ(B12)(1 B)Y = θ(B)Θ(B12)Z (4.15) − t t

This is equivalent to the following model for the lag-1 differenced data:

X = Y Y − , t t − t 1

where X = φX − + ΦX − φΦX − + Z + θZ − +ΘZ − + θΘZ − . (4.16) t t 1 t 12 − t 13 t t 1 t 12 t 13 The process X is causal if and only if φ(z) = 0 and Φ(z) = 0 for z 1 [Brockwell { t} 6 6 | | ≤ and Davis, 2002, Section 6.5]. Under this model, the crime count for month t depends on the crime count for the previous month as well as the crime count for the same month in the previous year. The moving average of the latent process also depends on the Gaussian IID random variables at lags 1 month, 12 months, and 13 months.

In Chapter 2, we fit the SARIMA (1, 1, 1) (1, 0, 1) model to the log-transformed × 12 larceny crime counts from Columbus, Ohio. A logarithmic transformation of larceny counts was used because the variance of larceny counts increased as the average monthly crime count increased. The transformed counts exhibited a more stable variance. Now we use the model to make imputations using the forecast-backcast method. Figure 4.5 shows the log larceny counts from Columbus. The section of data from December 1979 to December 1981 was removed from the time series to illustrate our imputation procedure. Examples of imputation for the missing data when forecasting, backcasting, and both are used are shown in Figure 4.6. Forecasts

78 Columbus, OH Log Larceny Counts 6.5 7.0 7.5 8.0 8.5

1960 1970 1980 1990 2000

Year

Figure 4.5: Series of log larceny counts in Columbus, OH, with dashed lines delineat- ing missing period.

are less accurate near the end of the missing period and backcasts are less accurate near the beginning of the missing period, but the two are averaged to give improved estimates in Figure 4.6. The arima function in R also provides forecasting variance estimates via Kalman filtering. We use these variance estimates in our calculation of the variance for the combined estimate. The bounds are then calculated by Yˆ 1.96 t ± × s.e.(Yˆt). We see in Figure 4.6 the bounds increase noticeably as we move away from the data used for forecasts or backcasts but are more stable in the estimates from the weighted combination of the forecasts and backcasts. In this example, backcasts seem to be less accurate but have lower standard errors of prediction than the forecasts.

The backcasts are less accurate because the missing crime counts appear to more closely follow the pattern in the earlier half of the series. Backcasts here have lower

79 A. Forecasting

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Log Larceny Count 6.5 7.0 7.5 8.0 8.5 1978 1979 1980 1981 1982 1983 1984 Year

B. Backcasting

+ + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + Log Larceny Count 6.5 7.0 7.5 8.0 8.5 1978 1979 1980 1981 1982 1983 1984 Year

C. Forecasting and Backcasting

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Log Larceny Count 6.5 7.0 7.5 8.0 8.5 1978 1979 1980 1981 1982 1983 1984 Year

Figure 4.6: Forecasting/Backcasting of missing values of log larceny counts in Colum- bus, OH. The line represents original data, and the circles are estimated from the model. Plus signs represent 95% prediction bounds.

80 Forecasting and Backcasting

++ +++ + + +++ + + + +++ + ++ + ++ ++ ++ ++ + ++++ ++++ ++ +++ + ++ Larceny Counts + ++ 1000 3000 5000

1978 1979 1980 1981 1982 1983 1984

Year

Figure 4.7: Forecast/Backcast of missing values of larceny counts in Columbus, OH

standard errors of prediction, it appears, because there is less unexplained variation in the later half of the series. The imputations and confidence bounds are shown on the original scale in Figure 4.7. Table 4.1 show a comparison of average bias and MSPE for the Columbus example. We see that the forecast-backcast estimator produces a smaller MSPE than either the forecast or backcast used individually.

Method BIAS MSPE Forecasting 0.03 0.0061 Backcasting -0.11 0.0184 Forecast-Backcast -0.05 0.0057

Table 4.1: Comparison of three imputation methods in Columbus example.

81 4.2.2 Intermediate Crime Counts

For series with between 1 and 35 crimes per month on average, we used a time series model for count data. The literature of these models is reviewed in Chapter

3. Typical practice for modeling count data is to assume that the counts follow a

Poisson distribution. To account for the serial dependence in our data, we incorporate an autoregressive component into our model. We also use a seasonal effect that indicates the month of the year. When possible, we make use of complete covariate data from a “donor” series selected from “similar” agencies in the same state. Letting

Y be the crime count at month t, we assume that Y D , where D is defined in { t} { t| t} t Chapter 3, are a set of independent random variables where for each t,

Y Poisson(µ ) , t ∼ t 11 with log(µt)= α + β log(yt−1 +1)+ γ log(xt +1)+ ηjmj,t . (4.17) jX=1

Here, xt is the crime count at time t from the donor agency, and mj,t is an indicator function for month j at time t. Thus, the Poisson crime rate for month t, µt, depends on the actual crime count in the previous month, yt−1, the crime count from a similar agency in month t, xt, and a mean effect of α+ηj in month j ( η12 = 0 for December).

For cases in which there is no available “similar” agency with complete data, the term

γ log(xt + 1) is dropped from the model.

One of the questions we dealt with was how to select donor agencies for the imputation algorithm. Our definition of “similar” was different from that used by the

FBI. We chose only one donor agency, selected from a number of candidate agencies in the same state, that had the least missing data, and with no missingness in the imputation interval in question. The same type of crime was used in the donor agency

82 as in the agency to be imputed. The agency we selected was the one whose crime series

was most highly correlated with the crime series for the agency requiring imputation.

When possible we estimated model parameters using available data on both sides

of the missing period and averaged the forecast and backcast predictions based on

the different estimated models as illustrated in the following example.

Figures 4.8–4.9 illustrate imputation results for burglary in Saraland, Alabama

obtained using the Poisson GLM. Anniston, Alabama’s series of burglary counts was

used as the donor series because both cities are in the same state, and the correlation

between the reported data in both series was 0.708. The highest correlation between

the crime series of any two nearly complete agencies in Alabama is about 0.90 but

most correlations are less than 0.40. Thus, we judged that the correlation between

crime counts in the Anniston and Saraland series is fairly strong. The similarity

between crime patterns in Anniston and Saraland can be seen in the plots in Figures

4.8.

Figure 4.9 show an example of imputation for missing data in the Saraland se- ries when forecasting, backcasting, and both are used to obtain the imputed values.

Anniston’s data is used as a covariate in the Poisson regression model for prediction.

For illustration purposes, we deleted crime counts from June 1974 through June 1976 from the complete Saraland series. Plot A in Figure 4.9 shows the forecasted values using data leading up to missing months. Plot B in Figure 4.9 shows the backcasted values using data following missing months. Plot C in Figure 4.9 shows the average of forecasts and backcasts. Low forecasts and high backcasts, caused by the fact that missing data occur near the end of population growth for Saraland, are averaged to

83 Saraland, AL Burglary Counts 0 5 10 15 20 25 30 1960 1965 1970 1975 1980 1985 1990 Year

Anniston, AL Burglary Counts 0 20 60 100 140 1960 1965 1970 1975 1980 1985 1990 Year

Figure 4.8: Burglary counts in Saraland and Anniston Alabama, with dashed lines delineating missing period.

84 A. Forecasting Burglary Counts 0 5 10 15 20 25 30 1972 1973 1974 1975 1976 1977 1978 Year

B. Backcasting Burglary Counts 0 5 10 15 20 25 30 1972 1973 1974 1975 1976 1977 1978 Year

C. Forecasting and Backcasting Burglary Counts 0 5 10 15 20 25 30 1972 1973 1974 1975 1976 1977 1978 Year

Figure 4.9: Forecasting/Backcasting of missing values of burglary counts in Saraland, AL. The line represents the actual data, and circles are estimates from fitted model.

85 give good overall estimates. Table 4.2 shows that backcasting produced the best es- timates in terms of average bias and MSPE. The poor forecasting resulted in pulling the composite estimates too low in 1975. Still, we can conclude that the forecast- backcast predictions performed much better than the forecast predictions alone. It is interesting, though, to see from Figure 4.9, that forecasting does much better than backcasting for the first several months in 1974. This perhaps suggests using a dif- ferent weighting scheme which places more weight on forecasts for the beginning of the missing period, and more weight on backcasts for the end of the missing period.

Another reason for the discrepancy between forecasts and backcasts could be the different model estimates on either side of the missing period.

Method BIAS MSPE Forecasting 3.68 41.39 Backcasting -0.34 20.25 Forecast-Backcast 1.55 22.84

Table 4.2: Comparison of three imputation methods in Saraland example using a GLM model.

We then tried to impute for the same missing period in Saraland, using the

SARIMA model instead of the GLM. Our results, shown in Table 4.3, confirm that the Poisson GLM is the more appropriate model to use in cases with low crime counts such as these. The SARIMA model produced inferior predictions in terms of average bias and MSPE. The SARIMA predictions were primarily underestimating the true crime counts. Results such as in this example using the Saraland burglary series sug- gest that the SARIMA model may only be suitable for series with large crime counts, as we have already suspected.

86 Method BIAS MSPE Forecasting 4.86 41.59 Backcasting 2.36 23.94 Forecast-Backcast 4.12 34.59

Table 4.3: Comparison of three imputation methods in Saraland example using a SARIMA model.

Using the GLM function in R, we obtained variance estimates for the model pa- rameters as well as for predictions. When we do not have crime counts missing in consecutive months, then the prediction variance is straightforward, because the pre- diction is based on the previous or following month’s observation. When we do have crime counts missing in consecutive months, we have predictions based on data that includes imputed values. Our current variance estimates tend to be too small because they ignore variability from all previous imputations for a string of missing data. De- riving the correct variance is difficult, but, taking a different approach to the problem,

Bayesian tools, described in Chapters 5 and 6, are useful in handling this difficulty.

4.2.3 Small Crime Counts

Small crime counts occur very frequently in the UCR. In Ohio, for example, 34% of the crime series have less than 1 crime per month on average. Most often, small counts are associated with murder and rape.

For crimes that occur with low in a particular agency, we use a simple mean imputation method. When the mean monthly crime count for the observed data is less than 1 crime per month, we use the mean to make imputations for the missing data in that series. The mean count is calculated by dividing the total count for the agency by the total number of months for which the agency provided reports. This

87 Waterloo, IA Murder Counts 0.0 0.5 1.0 1.5 2.0

1960 1970 1980 1990 2000

Year

Figure 4.10: Murder counts in Waterloo, Iowa.

method does not account for seasonality, but with such sparse data, seasonality may be impossible to identify. An improvement over this method, which would attempt to account for any time trends, might be to calculate monthly estimates using the average over a window of months surrounding the missing month. Intuitively, this strategy makes sense, but the actual gain may be insignificant, as we would be making a bias/variance trade-off. The strategy would have less stable estimates with larger variances. In addition, the question of how many months to include in the window would almost certainly require different answers for different data series. For the variances of the imputed values, we assumed that the data came from a Poisson distribution. This model can be represented by Y Poisson(µ), independent for t ∼ t =1, 2, 3,...,n, where µ is the crime rate, or average number of crimes per month,

and yt is the realized crime count for month t.

88 As an example of a series with low crime counts, Figure 4.10 shows the time series of monthly reported murder counts in Waterloo, Iowa. Missing data in Waterloo is evident in 1991 and part of 1992. For each month with missing murder counts, we would impute the mean ofµ ˆ = 109/501 = 0.218, calculated from the observed data.

Note that if we accumulate these estimates over 12 months, we obtain an estimated

2.616 murders per year. Thus, in the case of very few crimes, we have chosen to use imputations that are reasonable in the aggregate rather than trying to assign a crime to one particular month. We made this choice to avoid spurious associations that might result from randomly assigning a crime to a particular month.

For small sample sizes, Faulkenberry [1973] provides a method for calculating the

n+1 using the conditional distribution of Yn+1, given T = i=1 Yi. The P statistic, T , is sufficient for µ in the joint distribution of Y ,Y ,Y ,...,Y . The { 1 2 3 n+1} 1 conditional distribution of Yn+1, given T = t, is binomial with parameters n+1 and t. t! 1 i n t−i Let B(i t) be defined as the binomial probability, B(i t)= . | | i!(t i)! n +1 n +1 −     For a (100 β)% interval, we can choose [a, b] so that b B(i t) = β. Let × i=a | n P W = i=1 Yi. Then finding the confidence interval amounts to solving for a and P b in b 1 β a−1 1 β B(i w + b)=1 − and B(i w + a)= − . | − 2 | 2 Xi=0 Xi=0 For the Waterloo data, we obtain a 97.9% prediction interval of [0,1] for January 1991.

For large samples, this method is the same as estimating the exact Poisson prediction

interval. Using the observed crime counts in the series, we estimate that for a given

missing month, the probability of reporting at least 1 crime is 98/501 = 0.196. Using the binomial(12, 0.196) distribution, we can estimate the probability of observing no reported crimes for the year 1991 as 0.073.

89 Using the , we can also calculate an approximate 95% confidence interval for µ. We first estimate log(µ) to be -1.52 with standard error

0.096, so the confidence interval for log(µ) is [-1.709, -1.334]. Exponentiating these

confidence limits gives our approximate 95% confidence interval for µ: [0.181, 0.264].

This is interval is much narrower than the 97.9% prediction interval, but provides

inference on the mean rather than on an unknown observation.

A missing data technique that might be more appropriate in the case of small

crime counts is multiple imputation. In multiple imputation, described in Little and

Rubin [2002], we build sets of imputations by taking several draws from each missing

value’s predictive distribution. The predictive distribution can be Poisson, so that our

imputed values resemble Poisson data. This happens because our imputations involve

conditional draws rather than conditional means. The result is the construction of

multiple completed datasets. Combining these datasets with differing imputed values

will let us provide estimates of the imputation variance. By giving several plausible

sets of imputed values, we can guard against treating an imputed value as a known

value, and we have the advantage of calculating the imputation uncertainty.

4.3 Special Cases

We saw in Chapter 1 that there are several types of missingness in the UCR. A

datum can be missing because it is from a non-existent agency, it is from a covered-by

agency, it is part of an aggregated report, or it is truly missing. Non-existent agencies

shouldn’t have crimes reported, so no imputations are needed, and these time periods

should not affect our models. We have already described, in this chapter, methods to

90 impute for truly missing data. In this section, we discuss ways to deal with the two

special cases: covered-by agencies, and aggregated reports.

4.3.1 Covered-By Cases

When an agency’s crime counts are included in a second agency’s reports, the second agency is said to be “covering” the first agency, and the first agency is said to be

“covered by” the second agency. Since such covered-by crimes have been counted and reported, we do not need to impute them for the covered agency. More importantly, we cannot treat covering agencies the same as other agencies.

If a covering agency has periods of covering surrounded by periods of non-covering, then we might have a problem when we try to model the series. The months that are covering another agency will have a different model from the months that are not covering. For agencies with covered-by months, making the usual imputations of the missing values will not be a problem as long as the agency has a sufficient amount of observed data. For each of the covering agencies, we would need to look at plots to see how appropriate our unadjusted model would be. Since the number of agencies affected by coverings is large, it is not feasible to look at the plots of each one. We can discover, though, that most covering agencies report large crime counts and most covered-by agencies contribute small counts. In Iowa, for example, the average total for Index crimes for covered-by agencies in months when they report their own data is 15.21 crimes per month, but the average total for Index crimes for the covering agencies is 100.01 crimes per month. Under this assumption, making adjustments in the model for covering months will not significantly change our predictions.

91 Alternatively, Maltz and Weiss [2006] suggest that the “covered-by” problem can

be reduced by combining individual agencies to calculate countywide statistics. In

this solution, crime estimates are aggregated, and we learn little about crime counts

for individual agencies, but the statistics that are offered can be more accurate.

4.3.2 Aggregation

The second special case of missingness occurs when a police jurisdiction reports

aggregated crime totals over multiple months. During some time periods, some agen-

cies may only submit UCR reports semi-annually or annually. In other cases, an

aggregated crime count is just the result of missing the previous month.

We can still employ our imputation method to make imputations for these missing

data, with some minor adjustments. Our imputed crime count for time t is given by

T yˆt =y ˜t k , for t =1, 2,...,k, (4.18) i=1 y˜i ! P wherey ˜t represents the unadjusted crime count estimate at time t based on the imputation methods described earlier in this chapter, and T represents the reported aggregated crime count for months t =1, 2,...,k. Now, when we sum up our imputed values, we get the reported aggregate crime count T .

k k T yˆt = y˜t k y˜i ! Xt=1 Xt=1 i=1 k P T = y˜t ! k y˜ Xt=1 i=1 i = T. P 

So the total T is known, but the monthly counts that sum up to T are the estimated

values.

92 Thus, we can assume that the crime counts, for aggregated months t =1, 2,...,k, follow a multinomial distribution with known parameter T and unknown parameters

y˜t p1,p2,...,pk. We have already seen in Equation 4.18 that pt is estimated by k . i=1 y˜i Now the variance ofy ˆt can be estimated by P

Var(ˆˆ y )= T pˆ (1 pˆ ). t × t − t

This variance estimate will be smaller than usual because knowing T gives us more certainty about our estimates for the crime counts of missing months.

The top plot in Figure 4.11 shows an example of a series with aggregated crime

counts. During the time periods from 1991–1997 and 1999–2002, Mobile, Alabama

submitted aggregated crime counts to the FBI in annual reports each December.

It is clear that the reported crime counts during this time period are far greater

than the monthly crime counts submitted for the long history of the series. The

bottom plot in Figure 4.11 shows the series after imputations have been made for the

periods of aggregation. Now, the annual crime estimates, during these periods, are

still accurate, and we have used our knowledge of seasonality to provide reasonable

monthly estimates where missing.

Another type of aggregation that is commonly seen in the UCR is the aggregation

of crime categories. There are three aggregated crime series that are often of interest

to criminologists. The sum of the seven Part I Index crimes is called the Part I crime

index. The violent crime index is the sum of crimes in the categories of murder, rape,

robbery, and aggravated assault. The property crime index is the sum of crimes in

the categories of burglary, larceny, and vehicle theft.

Clearly, if an agency has missed a report for one crime, then we have also a missing data problem for the crime index. In general, if an agency has missed reporting a

93 Mobile, AL Larceny Counts 0 4000 8000 12000 1960 1970 1980 1990 2000 Year

Forecasting Imputations Larceny Counts 200 600 1000 1400

1960 1970 1980 1990 2000 Year

Figure 4.11: Larceny counts with aggregation in Mobile, AL. Imputed values are represented by points in bottom plot.

94 crime count for one type of crime, then counts for all types of crime are missing for that particular month. One question we initially set out to answer concerned how we should make imputations for missing observations in the crime index series. There are three possible approaches we considered. The first method is to make imputations for each of the seven crimes separately and then sum them to get an estimate for the crime index. The second method is to make our imputation on the crime index series and then allocate this estimate to the seven crime series. The third method is to make imputations for each of the seven crimes and each of the three crime indices all separately. Advantages of the first method are that it mimics the way the crime index is created, as the sum of the seven crime counts, and it avoids the question of how an estimate of the composite crime index should be allocated to the seven individual crimes. One advantage of the third method is that estimating the variances of the imputed values is straightforward. The disadvantage of the third method is that there is some discrepancy between our estimated values for the crime index and the sum of the estimates for the individual crimes.

Fortunately, we found that the predictions for both crime index prediction ap- proaches are relatively close, and we concluded that both methods produce similar estimates. Using the Cleveland, Ohio time series, we compared the two prediction estimates. The means of each crime type are preserved when aggregated, but since the crime types are not independent, the variances are not preserved. The variances for imputed crime index values should be calculated using the crime index series.

Predictions of the crime index for years 2000–2003 are based on forecasting using a model of the data from 1960–1999. The bottom plot in Figure 4.12 shows how similar

95 Cleveland, OH Index Crime Counts 1000 3000 5000 7000 1960 1970 1980 1990 2000 Year

Forecasting Index Crime Counts 2000 2500 3000 3500 4000 2000 2001 2002 2003 Year

Figure 4.12: Index crime counts for Cleveland, Ohio. Index crime counts are repre- sented by the solid line. In bottom plot, predictions based on modeling Index crime series are represented by dotted line, and predictions based on modeling individual crime series are represented by dashed line.

96 the two prediction approaches for the crime index are and shows how accurate they are to the true data, which was removed for the imputation procedure.

97 CHAPTER 5

BAYESIAN MODELING

Up until now, we have only considered modeling crime counts for a single type of crime reported to an individual police jurisdiction. In the previous chapter, we realized the benefit of using donor agency crime series as covariates, but there is the potential to incorporate data from more agencies into a comprehensive model. By grouping agencies, we can construct a hierarchical model. We now consider modeling crime counts for multiple types of crimes reported to multiple police agencies. Hier- archical models have the feature of allowing agencies to learn from each other when making inference on agency-level parameters.

Before we begin our new analysis of the data, let us first introduce the Bayesian approach. The fundamental difference between Bayesian analysis and classical analy- sis is that with classical analysis, the parameters are assumed to be fixed, but typically unknown. In Bayesian analysis, the parameters are assumed to be random variables rather than fixed quantities. This amounts to parameters following distributions in the Bayesian setting. on parameters are based on the posterior distribution of the parameters. The posterior distribution is the conditional density of the parameters given the observed data. To derive a posterior distribution, a prior distribution must initially be assumed for the parameter. The prior distribution is

98 the parameters’ marginal distribution, which is defined in terms of other parameters,

known as hyperparameters. The prior distribution is often denoted by π(θ) for pa-

rameter θ. Letting f(y θ) be the distribution of the data, Y , given θ, the posterior | distribution is defined [Ghosh et al., 2006] using Bayes rule by

π(θ)f(y θ) π(θ y)= | . (5.1) | π(θ′)f(y θ′)dθ′ Θ | R After calculating the posterior distribution, we will be able to make inferences, even

in the form of probability statements, on θ.

For our problem, we ultimately would like to make probability statements concern-

ing missing values. In Bayesian analysis, these too can have posterior distributions.

To predict unknown ym, based on the observed data, yobs, we can make inference

using the posterior predictive distribution defined [Ghosh et al., 2006] by

π(ym yobs)= π(ym θ)π(θ yobs)dθ. (5.2) | ZΘ | | One of the greatest benefits of the Bayesian approach, for our situation, is the way

that imputations can be made for the missing values. The imputations automatically

take into account the uncertainty in the model parameters, through π(θ y ), as | obs can be seen in Equation 5.2, and since we estimate a distribution for each missing

value, imputation variance is provided with minimal effort. We will devote Chapter

6 to discussing Bayesian imputation, but for now we describe how we construct our

Bayesian model for crime counts.

In Section 5.1, we describe the computational techniques used to find the posterior

distributions of our model parameters. In Section 5.2, we describe the process by

which we construct our model. We present a full model specification in Section 5.3,

and we discuss ways of checking our model in Section 5.4.

99 5.1 Bayesian Inference

Bayesian inference on a model parameter, as mentioned previously, is based on its posterior distribution. Often, a point estimate for a parameter is taken to be the mean of its posterior distribution. Probability statements, typically in the form of posterior intervals, are constructed using from the posterior distribution. Bayesian inference can also be in the form of random draws from the posterior distribution, as in multiple imputation [Rubin, 1987]. The posterior distribution (Equation 5.1 or 5.2) is critical to Bayesian inference, but is often very difficult to obtain. For example, in

(5.1), we may assume π(θ) and f(y θ), but the marginal distribution f(y) is generally | not easy to calculate. As a consequence, π(θ y) is difficult to calculate. To get around | this common problem, we use a computational technique to approximate the posterior distribution called Markov chain Monte Carlo (MCMC).

MCMC is based on iteratively drawing samples from a Markov chain whose dis- tributions that, under a set number of conditions, increasingly become closer to the posterior distribution [Gelman et al., 2004, Chapter 11]. In other words, MCMC does not draw directly from π(θ y), but instead draws from distributions that become closer | to π(θ y) after each iteration. After some finite, but unknown, number of iterations, | the chain will produce random samples from the stationary distribution, which is the posterior distribution. This is called ‘convergence’ of the chain. After convergence, the samples are draws from the target posterior distribution. In some situations, a large number of iterations of the Markov chain is required before convergence occurs.

For complicated models, MCMC can take many hours of computing time.

Two commonly used MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs’ sampler. The Metropolis-Hastings algorithm is a rejection sampling

100 algorithm, which samples from a proposal distribution, and accepts the sample with

some acceptance probability that changes after each sample is taken. Let the proposal

distribution (or jumping distribution) at iteration i be J (θ∗ θ(i−1)). Then the accep- i | π(θ∗ y)/J (θ∗ θ(i−1)) tance probability at iteration i is r = | i | where θ∗ is the current π(θ(i−1) y)/J (θ(i−1) θ∗) | i | proposal drawn from the proposal distribution [Gelman et al., 2004]. The basic idea

behind the Metropolis-Hastings algorithm is that we simulate an easy Markov chain

which has the target posterior density as its stationary density [Ghosh et al., 2006,

Chapter 7].

A special case of the Metropolis-Hastings algorithm is the Gibbs’ sampler. The

Gibbs’ sampler is designed to sample from a joint posterior distribution, when we have multiple parameters, with a dependence structure. For each parameter, the Gibbs’ sampler samples from univariate densities, conditional on the other parameters. After iteration i we take draws

θ(i+1) π(θ y, θ(i),...,θ(i)) 1 ∼ 1| 2 p θ(i+1) π(θ y, θ(i), θ(i),...,θ(i)) 2 ∼ 2| 1 3 p . .

(i+1) (i+1) (i+1) θ π(θ y, θ ,...,θ − ). p ∼ p| 1 p 1

Together, these densities, known as full conditionals, determine a multivariate joint distribution [Ghosh et al., 2006]. The Gibbs’ sampler is especially well-suited for inference with hierarchical models, which often involve many parameters.

Since we will have multiple parameters and hyperparameters in our model, which we will describe in the next section, we will need to use the Gibbs sampler to estimate

101 posterior distributions. BUGS (Bayesian inference Using Gibbs’ Sampling) [Spiegel- halter et al., 1994] is the software we use to fit our Bayesian models. WinBUGS is the Windows version of BUGS and OpenBUGS is the open-source version of BUGS.

The BUGS software can be used easily in the R environment calling the libraries

R2WinBUGS (WinBUGS) or BRugs (OpenBUGS), which we have chosen to use in our analysis.

Bayesian computations with our model will be performed by OpenBUGS, using

Gibbs sampling to sample from the posterior distributions of our model parame- ters. We only need to give a completely specified probability model, including prior distributions, and to assign initial values for each parameter in the Markov chain.

OpenBUGS produces a chain of samples, with the length arbitrarily chosen for each parameter in the model. Convergence of the MCMC must be checked, and we de- scribe this step in Section 5.4. Even after convergence, the samples within a sequence are correlated with each other, so it may be a good idea to only keep samples at every kth iteration, totaling to the desired sample size [Gelman et al., 2004]. Finally, we can use these samples to make inferences on our model parameters.

5.2 Model Construction

We first consider building a model for crime counts of the three property crimes reported to the four agencies in one county. Then in Section 5.3, we describe how we can extend this model to include all counties in each state, and all states in the country, if desired. We will start with the Poisson regression model, described in

Section 3.1. Now with additional indexes, let Yj,k,t be the crime count from agency j

102 for crime-type k at time t. Crime counts among crime series for an agency are condi- tionally independent given the parameters that are crime type specific. Crime counts among agencies are conditionally independent given the agency-level parameters. For agency j =1, 2,..., 4, and crime k =1, 2, 3, and t =1, 2,..., 516, let

Y µ indep. Poisson(µ ) j,k,t| j,k,t ∼ j,k,t

with log(µj,k,t)= αj,k + βj,kXt, (5.3)

where αj,k is the intercept term in the linear predictor component of the GLM for crime counts for agency j and crime type k, and βj,k is a regression coefficient for some covariate Xt, which is time dependent but is independent of agency and crime.

We will determine Xt later. Equation 5.3 is a starting place for us to construct our

Bayesian hierarchical model.

Part of our model building process includes an exploratory analysis of our data.

For this, we look at complete property crime data reported to thirteen police agencies from Minnesota. We will look at Poisson regression models for each of these series individually, and summarize the results, to help us develop a comprehensive model.

With four agencies from Hennepin county in Minnesota, we will construct a model, and we will form and develop imputation strategies. In Section 5.2.1, we describe how we construct a hierarchy for our model parameters. In Section 5.2.2, we explore which covariates will best help us model the trend and seasonality in our data. In

Section 5.2.3, we discuss ways of modeling the autocorrelation in the observations.

Finally, we discuss which prior distributions are most appropriate in Section 5.2.4.

103 5.2.1 Specification of Hierarchical Structure

Hierarchical models are prevalent in practice. For example, Dominici et al. [2000]

combine pollution-mortality relative rates for 20 cities and model between-city vari-

ation in relative rates as a function of city-specific covariates. Hierarchical models

are also used in Hay and Pettitt [2001] and in Ghosh et al. [1996], for example. Hay

and Pettitt [2001] use a hierarchical model on the time series of the incidence of an

infectious disease. They use a parameter-driven model in the form of a Poisson gener-

alized linear . Ghosh et al. [1996] take a hierarchical Bayesian approach

for bivariate time series modeling of median incomes for four- and five-person fami-

lies. For our situation, the advantage of a hierarchical model is that, by incorporating

groups of data together, parameter inferences are computed with the pooled strength

of data across groups.

The organization of the UCR dataset has a structure that is essentially hierar- chical. This makes the Bayesian hierarchical modeling approach highly appropriate to use with the UCR data. Grouping variables such as county and state allow for natural groupings of the data. These grouping variables can be utilized to formulate a hierarchical structure to combine many groups of data.

In modeling the UCR data, we have some discretion in organizing the hierarchy, but it seems most natural to group agencies within a county, and group counties within a state. In our hierarchical structure for the UCR data, the lowest level represents the individual agencies. For each agency, crime counts are observed. At the next level, we group agencies by county. At the highest level, we group counties by state.

Figure 5.1 shows an example illustrating how the hierarchical structure of the UCR might be specified. Counties Aitken, . . . , Washington are in Minnesota. Police { }

104 Figure 5.1: Diagram of Hierarchy in UCR

jurisdictions Bloomington, . . . , Wayzata are in the Minnesota county of Hennepin. { } Burglary, larceny, and vehicle theft are the property index crimes measured for every

police agency.

The idea behind hierarchical models, in the context of the UCR, is that by group- ing agencies together, we are able to borrow strength from all agencies to estimate agency-specific parameters. This happens because the agency-specific parameters are seen as a sample of possible parameter values from a common population distribution.

We assume that there is some similarity among the parameters. We don’t necessarily assume the parameters are correlated, but we assume that they come from the same parent distribution. With hierarchical models, we can exploit the assumed relation- ship or similarity among parameters. The advantage of this modeling approach is seen

105 most easily when we observe rare outcomes. In this case, the smoothed parameter es-

timate for a unit, obtained using all units in a group, may be a more sensible estimate

than the maximum likelihood estimate based only on the individual unit [Congdon,

2001]. When parameter estimates are pulled toward the group mean, this type of

“shrinkage” may cause bias, and this is sometimes the disadvantage of hierarchical

models, but the trade-off is reduced variance.

When we switch to a hierarchical model, we greatly increase the number of pa- rameters in our model. The non-Bayesian models that we have studied thus far have too few parameters to accurately fit a dataset as large as the UCR. When we have added too many parameters, in our ARIMA model for example, we have noticed a tendency of producing models that fit the data well but lead to poor forecasting. In hierarchical models, we have enough parameters to fit the data well, but over-fitting is avoided by structuring dependence into the parameters [Gelman et al., 2004]. We now explain the need for additional parameters in a hierarchical model.

In Bayesian modeling, we treat the model parameters as random variables, as mentioned earlier. Consider our model in Equation 5.3. Now, instead of considering the model parameters to be fixed effects, as in previous chapters, we can treat them as random effects. For example, we could have the regression term as

β indep. Normal(µ , σ ) for k =1, 2, 3 (5.4) j,k ∼ βj βj

where µβj and σβj are hyperparameters. In our construction of a hierarchy, we make some assumptions about relationships among the model coefficients. This implies dependencies in the data across crimes and agencies. In Equation 5.4, we are assuming that for agency j, the βj,k parameters for each of the three crime series come from the same population distribution.

106 Hierarchical modeling depends on an exchangeability assumption among agencies.

The property of exchangeability is crucial to setting up a joint probability model for

the parameters. The parameters (βj,1, βj,2, βj,3) are exchangeable in their joint distri-

bution if the joint distribution is invariant to permutations of the indexes [Gelman

et al., 2004]. In our data, if we do not know about any differences in the crimes,

then we can claim that our parameters are exchangeable in the joint prior distribu-

tion. Using Equation 5.4 as an example, for agency j if each βj,k is an independent

sample from a common prior distribution, then we have exchangeability of crimes. In

our hierarchical model, we will assume that agency-level parameters within a county

are exchangeable. This means that each µβj (j = 1, 2,...,J) is a sample from the

same distribution. Later, if we wish to extend our model to incorporate higher levels

of the hierarchy, we could assume that county-level parameters within a state are

exchangeable, and state-level parameters within the country are exchangeable.

Besides making inference on the model parameters themselves, another result of hierarchical modeling is that we will be able to estimate the variation among the model parameters for different agencies within a county, or for different counties within a state. If our model is structured properly, we also are able to obtain overall mean estimates of the parameters across groups by using the posterior distributions of the hyperparameters. Our hierarchical structure will be fully specified when we determine our prior distributions in Section 5.2.4.

5.2.2 Covariate Selection

One of the goals of our model-building process is to learn how the average crime count changes over time. The terms in our model should reveal to us something

107 about the overall crime pattern over time for each agency, which will ultimately help us to make predictions for the missing values. If we had reason to believe that there

(1) was linear trend in the crime counts, we would include the term βj,k t as a the linear predictor in our regression model in Equation 5.3, where t = 1, 2,...,N represents time in months. Models with this trend term are seen in Gelman et al. [2004, page

99] and Li [1994] (Equation 3.17). Often in practice, however, patterns over time are more complicated, and require more sophisticated models.

Instead of using time itself in the model, we might be able to explain the crime pattern better by using a time-dependent covariate. For example, Dominici et al.

[2004] use smooth functions of covariates in a Poisson generalized additive model. In the UCR, population is a time-dependent variable which we at first expected to be useful in predicting crime counts. The logic behind this idea is that as population increases, then events of crime also increase. More people means more perpetrators

(or more victims). We have seen, unfortunately, that there are problems with using population as a covariate in our model. Population is given with the UCR data as annual estimates, yet still has a missing data problem. Besides the occasional peri- ods of missing population sizes, we also have definitional zero-populations. Agency population should be defined as the population covered by the police agency, but we know that some agencies are zero-population agencies (e.g. park police), as we have described in Chapter 1. Crimes are still reported to zero-population agencies, even though no population is attributed to them. To avoid double-counting of population, populations are assigned to only one agency. Another reason why a covariate for population may not be useful is that population is not always highly correlated with crime counts. Figure 5.2 shows an example where population and larceny counts are

108 Comparing Population to Crime Counts Population 7000 9000 11000

1960 1970 1980 1990 2000 Year Larceny Counts 0 10 20 30 40

1960 1970 1980 1990 2000 Year

Figure 5.2: First plot shows annual population estimate, and second plot shows monthly larceny counts for the same agency during the same time period.

109 correlated in the beginning of the time series, but are uncorrelated in the last half of the time series. For this agency, the correlation between population estimates and larceny counts is 0.613. In some other cases, we have seen that as population in- creases, crime counts decrease. For all of these reasons, we will dismiss using agency population in our model as a predictor of crime pattern.

Without the use of covariates from data sources outside of the UCR, we turn again to considering functions of time as predictors of crime. Assuming that crime patterns over time are more complicated than linear or quadratic trends, we try to understand what these functions might be by first looking at scatterplot smoothers of the crime series. Locally weighted , LOESS, is the scatterplot smoothing method used in Figure 5.3 to show crime patterns for twelve of the thirteen complete- data agencies in our exploratory analysis. The largest agency was removed to make the plots more helpful visually. We can see from the plots that a linear trend, or even a quadratic trend will do poorly to capture the observed pattern in crime over time. Note also that the range of the counts are different for each crime, with larceny exhibiting the greatest range and vehicle theft exhibiting the smallest range.

Time series models have two important components, trend and seasonality, as seen in the Classical Decomposition model in Chapter 2. By including Hermite polynomials for time in our regression model, we try to capture the trend, or more generally, the long-term pattern, in the data. The time series plots of our data, and corresponding

LOESS curves, suggest that a 4th degree polynomial might be useful to estimate the overall crime pattern. We use the four Hermite polynomials defined [Weisstein, 2008] as

H1(z) = 2z,

110 Twelve Agencies in Minnesota Burglary 0 10 20 30 40 1960 1970 1980 1990 2000 Year Larceny 0 50 100 150 1960 1970 1980 1990 2000 Year Vehicle Theft 0 5 10 15 1960 1970 1980 1990 2000 Year

Figure 5.3: LOESS curves of twelve crime series in MN, for each of the three property Index crimes.

111 H (z) = 4z2 2, 2 − H (z) = 8z3 12z, 3 − and H (z) = 16z4 48z2 + 12, 4 − t 258.5 where z is a standardized variable for time, defined by z = − , so z ranges t 149.1 t from -1.727 at t = 1 to 1.727 at t = 516 with mean equal to zero and equal to one. These polynomials are chosen because they are orthogonal, thereby removing multicollinearity among explanatory variables. Together, we use these four Hermite polynomials to model the overall long-range crime pattern over time for each series, and, in a sense, represent a surrogate for population. Figure

5.4 shows that a 4th degree polynomial makes a noticeable improvement over fitting

2nd or 3rd degree polynomials to the data in Bemidji, Minnesota. For example, the third plot in Figure 5.4 shows that burglary was highest around 1984, but the second plot shows that burglary was highest around 1988, when crimes have clearly declined.

Also, we see that the 4th degree polynomial estimates a more stable crime pattern for

1961–1971 than do the 2nd or 3rd degree polynomials, which inaccurately estimate a steadily increasing trend during that time period. A 5th degree polynomial, or higher, might also be appropriate. Polynomials of higher degree will provide better fits of the crime data series, but the trade-off is a loss of model simplicity. By our discretion, we have chosen to use the first four Hermite polynomials, because we believe that a 4th degree polynomial is fairly simple, yet fits the general pattern of the crime data adequately. The disadvantage of using a polynomial for time is that the curve produced is sometimes not general enough to model the trend, like some smoothers, but has limitations based on the degree of the polynomial. We do not expect our

112 2nd Degree Burglary Counts 0 5 10 15 20 25 30 1960 1970 1980 1990 2000 Year

3rd Degree Burglary Counts 0 5 10 15 20 25 30 1960 1970 1980 1990 2000 Year

4th Degree Burglary Counts 0 5 10 15 20 25 30 1960 1970 1980 1990 2000 Year

Figure 5.4: Fitting mean crime on Hermite polynomials for time in Bemidji, MN.

113 polynomial to fit the data perfectly. We only want the polynomial to reasonably

model the underlying crime pattern over time.

It remains now to discuss ways to model seasonality, specifically using explanatory

variables. In Chapter 2, we saw several options to model seasonal effects using time-

dependent variables in a regression model. Monthly effects, used in the Poisson

regression model in Chapter 4, are statistically significant for some series but not for

others. Considering monthly effects to be too refined, we have decided to use dummy

variables for season in our model. Summer (June, July, August) will be the base

season, so indicators for Fall (September, October, November), Winter (December,

January, February), and Spring (March, April, May) variables are included in the

model. These are not exact seasons, and certainly different regions of the country

will have different “seasons”, but we intend to only to model the big differences

among seasons. Especially for crime types with low rates of occurance, it makes more

sense to imagine differences in rates among seasons, rather than differences among

months. This also seems reasonable for vacation areas, such as ski resorts and beach

areas. This parametrization is certainly more parsimonious than a dummy variable

for months.

At this point, we update our model in Equation 5.3. Now, we include terms in the linear predictor component of our Poisson regression model to express the trend and seasonality in our data. Let the crime counts Y be specified as in Equation { j,k,t} 5.3. Then for each j, k, t,

4 3 (h) (h) (s) (s) log(µj,k,t)= αj,k + βj,k xt + ηj,k wt . (5.5) hX=1 sX=1 The four Hermite polynomials for time are represented by covariates x(1), x(2), x(3), x(4) , { t t t t } and dummy variables for season are represented by w(1),w(2),w(3) . Figure 5.5 shows { t t t } 114 Robbinsdale, MN Burglary Counts 0 10 20 30 40

1960 1970 1980 1990 2000

Year

Figure 5.5: Fitting Poisson regression on burglary counts in Robbinsdale, MN.

an example of how this model fits to the time series of burglary crime counts in Rob-

binsdale, Minnesota. The bold line represents the fitted values of the model. Seasonal

differences are evident, and the overall fit, using the 4th degree polynomial, seems to model the data rather well. Notice also that seasonal differences are larger dur- ing periods of increased crime for the fitted line, due to the inherent mean-variance relationship in Poisson distributed data.

5.2.3 Modeling Autocorrelation

In the literature, Bayesian models are constructed for dependent data in a num- ber of ways. Taking a Bayesian approach to classical ARMA time series models,

115 Thompson and Miller [1986] assume the model parameters to be random. Chen and

Ibrahim [2000] incorporate a latent time series process in their model to account for dependency in the data. In general, we can take any model described in Section 3.2, as in Zeger and Qaqish [1988] or Zeger [1988], and give the parameters hierarchical prior distributions. Time series models can even be constructed as generalized addi- tive models in the Bayesian setting, as seen, for example, in Dominici et al. [2004].

One popular Bayesian time series approach is to allow for time-varying model coeffi- cients. Harvey [1989] defines a structural time series model where the parameters are time-varying, and Pole et al. [1994] describe this dynamic linear modeling technique.

Mixtures of ARMA and dynamic linear modeling components are also possible. Ran- dom effects models can also be constructed as Bayesian models for dependent data

[Diggle et al., 1994]. An example of a longitudinal random effects model for Poisson counts can be seen in Saphire [1991].

Our current model, in which the trend and seasonal effects are a set of fixed coefficients, will place equal weight on all observations when predicting the future

[Congdon, 2001]. Using time-varying coefficients, as in a dynamic linear model, fore- casts will place more weight on recent observations, since forecasts are based on the most up-to-date coefficient values. A dynamic model may be more flexible than mod- els for which regression parameters do not change over time, but a dynamic model often does poorly for long-term forecasts [Congdon, 2001]. We do not take this dy- namic modeling approach because of the complexity already in our hierarchical model.

For example, if we are using prior distributions to pool information across agencies, as in Equation 5.4, then by assuming time-varying coefficients, we are changing our hierarchical assumptions.

116 The strategy we adopt involves expressing our regression model with latent sta- tionary autocorrelated error terms. Since we have turned to Bayesian models, it is now much easier for us to consider this type of parameter-driven time series model. For example, in dealing with disease incidence, Hay and Pettitt [2001] use a parameter- driven model in the form of a Poisson generalized linear mixed model. They propose an AR(1) model for the noise term in the mean of the Poisson regression model.

Congdon [2001, Section 7.5] describes regression with autocorrelated errors in time, and suggests that the errors may follow any ARMA(p, q) process.

Now we again update our model in Equation 5.5 to include a time series com- ponent. Let the data, Y, be defined with the same specification as in Equation 5.3.

Then, 4 3 (h) (h) (s) (s) log(µj,k,t)= αj,k + βj,k xt + ηj,k wt + ǫj,k,t (5.6) hX=1 sX=1 where the latent process ǫ is an autoregressive process defined for each j, k, and { j,k,t} t> 1 as

ǫ Normal(µ , σ ), j,k,t ∼ ǫj,k,t ǫ

where µǫj,k,t = φj,kǫj,k,t−1,

so that φj,k is the autoregression parameter for the autoregressive latent process.

For time series data, we must also consider the assumptions about the initial, time

t = 1, terms, although this issue may not be so critical with such long data series.

Still, we mention that there are several approaches to dealing with this issue. Suppose

that our errors are assumed to follow an AR(1) process. Then we must make some

modification to our model to deal with time t = 1. One way is to treat ǫ1, as an extra

parameter, or missing value to estimate. Another approach is to treat ǫ1 as known

and fixed, and to specify the conditional likelihood of the data given ǫ1. Congdon

117 [2001] suggests that Bayesian estimation of the AR(1) error model is simplified by this strategy of conditioning on the first observation. Alternative approaches described in

Congdon [2001] are to use backcasting to estimate the initial latent data, or to use a composite parameter to model the first mean. If we assume that our error terms make an AR(1) process, then we can use the starting condition at time t =1 to be

(1) (1) σǫj,k ǫj,k,1 Normal(0, σǫ ), where σǫ = , ∼ j,k j,k 1 φ2 − j,k q as described in Cryer [1986, Chapter 4].

Using data from Minnesota, we looked at the sample ACF and sample PACF plots of the residuals of the Poisson regression model (5.5). These models include a 4th degree polynomial for time, and dummy variables for season. To illustrate, Figure

5.6 shows the sample ACF and sample PACF plots for the burglary crime series in Robbinsdale, Minnesota (data seen in Figure 5.5). We see that there is a great amount of autocorrelation in the data, but this is not surprising. Then in Figure 5.7, we see the sample ACF and sample PACF plots of the residuals from model (5.5).

These plots suggest that we have accounted for a majority of the seasonality present in the data by including a fixed season covariate effect. In this example, then, the observed seasonal effect on crime counts over time does not significantly depart from the model’s estimates. These plots also suggest that there is some autocorrelation unaccounted for in the model, as seen in the sample PACF value around 0.3 at lag 1.

We expect that by including an autoregressive AR(1) error term to the model, we will appropriately account for the dependency in the data. This provides some validation for Equation 5.6, although note our AR(1) model will be fit on the log mean, not the observed counts as implies by our residual plots.

118 Sample ACF and PACF Plots of Crime Counts ACF −0.1 0.1 0.3 0.5 0 20 40 60 80 Lag Partial ACF −0.1 0.1 0.3 0.5 0 20 40 60 80 Lag

Figure 5.6: Sample ACF and sample PACF plots of the burglary counts in Robbins- dale, MN.

5.2.4 Specification of Prior Distributions of the Parameters

Specifying a prior distribution for the parameters is a critical step in Bayesian modeling. In this section, we will discuss how we choose prior distributions for each of our model parameters. We distinguish among three types of parameters in our model: regression parameters, autocorrelation parameters, and variance parameters

119 Sample ACF and PACF Plots of Residuals ACF −0.1 0.1 0.3 0.5 0 20 40 60 80 Lag Partial ACF −0.1 0.1 0.3 0.5 0 20 40 60 80 Lag

Figure 5.7: Sample ACF and sample PACF plots of the residuals of the model for burglary counts in Robbinsdale, MN.

(later in Chapter 6 we will also extend this to include missing data). First, we discuss the two broad types of prior distributions: noninformative, and informative.

Prior distributions are commonly classified as noninformative or informative. A noninformative prior implies that there is no prior knowledge about the parameter.

In general, noninformative priors will be improper distributions, i.e. do not integrate

120 to 1, and much research has been done on this topic of finding noninformative pri-

ors. Usually we know the support for the parameters, but noninformative priors are

useful when we do not know very much else about the parameters, apart from the

data. BUGS requires that all prior distributions be proper, so we do not consider

noninformative priors for our model.

Informative priors sometimes make use of expert knowledge or historical data.

Chen and Ibrahim [2000] illustrate how informative priors, based on historical data,

can be used in a time series model for count data. Specification of the prior dis-

tributions will depend on whether we have any prior belief about the parameters.

With very little knowledge about what these might be, we can consider vague priors.

Gelman [2006] characterizes a prior distribution as weakly informative if it is proper

yet weak with respect to prior knowledge. An alternative approach which uses data

to estimate prior parameters is called Empirical Bayes [Gelman et al., 2004].

In many regression problems, normal priors are given to the regression parameters.

Normal priors are commonly used because by using a normal approximation to the

generalized linear model likelihood we can mimic the methods developed for the

hierarchical normal linear model [Gelman et al., 2004]. We can use the following

(h) prior distribution for a regression parameter βj,k :

(h) 2 β Normal(µ (h) , σ (h) ), j,k β β ∼ j j

2 with µβ(h) and σ (h) known, or following hyperprior distributions. A noninformative j βj (h) prior for βj,k may also be considered, and will result in the classical analysis of generalized linear models [Gelman et al., 2004]. A possible alternative prior for the

regression parameters is the t-distribution, which allows for heavier tails, but we have

not attempted to use prior t-distributions in our research.

121 As described in Section 5.2.1, we assume that the parameters in our model are related hierarchically, because they come from a common prior distribution for re- porting agencies within the same grouping. To construct this hierarchy, we need to decide upon any relationships among parameters for different crimes across agen- cies. Using our exploratory analysis involving independent Poisson GLMs for each of the three crimes in each of thirteen jurisdictions in Minnesota, we can summa- rize the relationship among crime-level parameters. Figure 5.8 shows that there is strong correlation in the intercept coefficients among burglary, larceny, and vehicle theft for these jurisdictions. We will put a multivariate normal prior distribution on the intercept coefficients. Letting αj =(αj,1, αj,2, αj,3) be the vector of intercepts for the three crimes in agency j, we will assume α : j = 1, 2,...,J are conditionally { j } independent with

α µ , Σ MVN (µ , Σ ) j| α α ∼ 3 α α

for each j, where J is the number of agencies in the county.

Figure 5.9 shows distributions of the Poisson regression coefficients. We notice that

for the intercept parameter called alpha, larceny is highest, followed by burglary, then

vehicle theft. This ordering likely corresponds to the ordering of crime rates. Larceny

makes up the highest percentage of property crimes, while vehicle theft makes up the

lowest. Looking at the plots for the other regression coefficients corresponding to the

four Hermite polynomials, the betas, we see that there is much less distinction in these

parameters among crime types. We must be careful to infer some relationships from

these plots because the boxplots, most of which do overlap, group all thirteen agencies

together. Therefore, similarities (or differences) in parameters among crime types

may not be the same for every agency. As we expect, the bottom-right plot, which

122 Alpha 2 3 4 5

BUR 1 2 3 4

0.96 LAR 2 3 4 5

0.98 0.97 VTT −1 0 1 2 3 1 2 3 4 −1 0 1 2 3

Figure 5.8: Plots show correlations among alpha coefficients for the three different crimes using the thirteen cities in MN. Pearson correlations are shown in lower trian- gle.

shows the distributions of each beta for all three crimes combined on the same scale, shows that coefficients for higher order terms become increasingly smaller. When specifying our prior distributions for these model parameters, we can see from Figure

5.9 that normal distributions are reasonable distributional assumptions because the boxplots are fairly symmetric about the means. Later, as we think about specifying distributions for our hyperparameters, these plots will again be helpful.

123 Alpha Beta1 −0.1 0.0 0.1 0.2 0.3 −1 0 1 2 3 4 5 BUR LAR VTT BUR LAR VTT

Beta2 Beta3 −0.15 −0.05 0.00 −0.02 0.02 0.06 BUR LAR VTT BUR LAR VTT

Beta4 Betas −0.02 0.00 0.02

BUR LAR VTT −0.2Beta1 0.0 0.1 0.2 0.3 Beta2 Beta3 Beta4

Figure 5.9: Boxplots grouped by crime type show the distributions of the model coef- ficients for the thirteen agencies in MN. The last plot shows the ungrouped boxplots of the beta coefficients on the same scale.

124 Since we are dealing with time series data in the UCR, our models will include

some type of autocorrelation parameter. In the Bayesian setting, we must decide on

a prior distribution to assign to this parameter. First, we consider the conditions for

stationarity. For an AR(1) process, stationarity is maintained when the autoregressive

parameter, φ, is between -1 and +1. We can either constrain our autoregressive coefficients to stationary by making the prior bounds at -1 and +1, or we can choose unconstrained priors. Working with time series count data, Hay and Pettitt [2001], for example, use a Uniform(-1,1) prior for the autoregressive parameter. Congdon [2001] notes that if we choose unconstrained priors, then we can test whether stationarity is appropriate or not. This is subject to changing the distribution of ǫ1 in our model

(5.6) to allow for non-stationary. We can actually calculate the probability that an

AR(1) model is stationary. For our model, we will use an informative beta prior for

φ, with support on (0,1), based on the Empirical Bayes approach. This implies that

are constraining the latent process to be stationary and we have knowledge that the

autocorrelation is positive. We chose a beta distribution that would put more density

away from 1. Figure 5.10 shows the probability density function of a beta(5, 10)

random variable. Putting a beta(5, 10) prior on φj,k would bring φj,k away from 1.

We have seen through trial and error, that the φj,k parameters had a tendency to stay

close to 1, thereby picking up the trend in our data and distorting the effects of other

variables in our model. By pulling φj,k away from 1, we are allowing the Hermite

polynomials to model the trend in our data.

Now we consider prior distributions for the variance parameters in our model.

(h) We model variance through both the variance of the regression parameters βj,k and the variance of the latent process ǫ . If our regression parameter β(h) has a { j,k,t} j,k

125 Beta(5,10) Distribution Probability Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0 0.2 0.4 0.6 0.8 1.0

Phi

Figure 5.10: Beta(5,10) probability distribution.

2 normal prior with mean µβ(h) and variance σ (h) , then we will also want to put a prior j βj 2 2 distribution on σ (h) . There are various noninformative and vague priors for σ (h) βj βj that are seen in the literature. Gelman et al. [2004] use an improper uniform density on the positive real numbers. Widely used is the proper conditionally conjugate inverse-gamma(ν, ν) distribution, with ν taking on a low value such as 0.001. For hierarchical models, Gelman [2006] suggested that the choice of noninformative prior for a variance parameter can have a large effect on inferences if the number of groups is small or the group-level variance is small. Gelman [2006] does not recommend using the inverse-gamma(ν, ν) distribution because it does not result in a proper limiting posterior distribution. As ν goes to 0, the posterior distribution becomes improper. This means that the posterior inferences could be sensitive to the choice

126 2 of ν. The inverse-gamma(ν, ν) does especially poor for small σ (h) since ν will be βj set low. For a proper prior, Gelman [2006] recommends using a uniform distribution for the standard deviation σ (h) , where the range is very wide. For our model, we βj use σ (h) Uniform(0, 10). We have defined this distribution in terms of constant βj ∼ parameter values, a step in our prior specification process that we now discuss.

Once the prior distributions of the model parameters have been decided on, as well as hyperprior distributions of the hyperparameters, we still need further specification of the parameter values for the highest prior distributions. For example, if we have chosen a normal distribution for µ (h) , we must now specify the mean and variance of βj this distribution. We choose fixed values that depend on how informative we would like to be about each µ (h) . Our technique can be described as Empirical Bayes, since βj we set these parameter values based on our results from our exploratory analysis with

Minnesota agencies. Substantial caution needs to be taken in this approach, since we are effectively “double dipping” into the crime data set. Specifically, upon inspection of plots in Figure 5.9, we have knowledge about the parameter values. For example, we see that the average α coefficient for burglary is about 3, so we choose µα for

burglary to be centered at 3 in its prior distribution. We assume that the variance is

1 for each µα, and we allow for correlation among means for the different crimes.

Now that we have determined what prior distributions to use for our parameters,

we can specify our complete model. This is shown in detail in the next section.

5.3 A Proposed Model

Let us consider the three property crime series (burglary, larceny, vehicle theft)

for each of J agencies in one county in one state. We now specify a model for these

127 J groups of observations. Afterward, we will describe how to extend this hierarchical model to include a county level and a state level. Let Yj,k,t represent agency j’s crime count for crime type k in month t. For j = 1, 2,...,J and k = 1, 2, 3, and t = 1, 2,..., 516, conditional on parameters (α , β(h) : h = 1, 2,..., 8 , η(s) : s = j,k { j,k } { j,k ′ 1, 2, 3 ) and latent autoregressive process ǫ ′ : t t , let } { j,k,t ≤ }

Y cond. indep. Poisson(µ ) j,k,t ∼ j,k,t 4 3 (h) (h) (s) (s) where log(µj,k,t)= αj,k + βj,k xt + ηj,k wt + ǫj,k,t. hX=1 sX=1 The four Hermite polynomials for time are represented by the covariates x(1), x(2), x(3), x(4) , { t t t t } and the dummy variables for season are represented by the covariates w(1),w(2),w(3) { t t t } for Fall, Winter, and Spring respectively (Summer is contained in the intercept). The

latent autoregressive process for agency j and crime k is represented by ǫ : t = { j,k,t 1,..., 516 . } For agency j, the intercept coefficients for the three crimes are assumed to follow

a trivariate normal prior distribution. Conditionally on µα and Σα, each intercept αj is drawn for each agency j (j =1, 2,...,J) from the same density:

α indep. MVN (µ , Σ ). j ∼ 3 α α

The hyperprior distribution of µα is assumed trivariate normal and Σα is assumed to

be inverse Wishart, which will model the covariance of intercept coefficients among

the three crimes. We assume that µα and Σα are drawn from independent hyperprior

distributions such that 3 1 .5 .5 µα MVN3  4  , 5  .5 1 .5  ∼ 2 .5 .5 1        

128 1 .5 .5 and Σα InvWish 5  .5 1 .5  , 3 . ∼ .5 .5 1         The remaining regression parameters are given normal prior distributions. For each h, the parameters for the trend β(h) : j = 1, 2, 3,...,J and k = 1, 2, 3 are { j,k } (s) independent conditional on µ (h) and σ (h) and the parameters for seasons ηj,k : βj βj {

j = 1, 2, 3,...,J and k = 1, 2, 3 are independent conditional on µ (s) and σ (s) . For } ηj ηj h =1, 2, 3, 4, j =1, 2, 3,...,J, and k =1, 2, 3, let

(h) indep. βj,k Normal(µ (h) , σ (h) ) ∼ βj βj

(s) indep. and ηj,k Normal(µη(s) , ση(s) ). ∼ j j

We assume that for each h and j, µ (h) is independent from σ (h) . We assume a normal βj βj

prior distribution for each µ (h) and a uniform prior distribution for each σ (h) . For βj βj h =1, 2, 3, 4 and j =1, 2,...,J, let

indep. indep. µβ(h) Normal(0, 1) and σβ(h) U(0, 10). j ∼ j ∼

We assume that for each s and j, µ (s) is independent from σ (s) . We assume a normal ηj ηj

prior distribution for each µ (s) and a uniform prior distribution for each σ (s) . For ηj ηj s =1, 2, 3 and j =1, 2,...,J, let

indep. indep. µ (s) Normal(0, 1) and σ (s) U(0, 10). ηj ∼ ηj ∼

The latent process is modeled as an AR(1) process with autoregressive parameter

′ φ . Conditional on φ , ǫ ′ : t < t , σ and hyperparameters, the ǫ are j,k j,k { j,k,t } ǫj,k { j,k,t} independent. For j =1, 2,...,J and k =1, 2, 3, and t =2, 3,..., 516, let

indep. indep. ǫ ǫ − Normal(µ , σ ), σ U(0, 1), and µ = φ ǫ − . j,k,t| j,k,t 1 ∼ ǫj,k,t ǫj,k ǫj,k ∼ ǫj,k,t j,k j,k,t 1

129 The initial starting condition for t = 1, j =1, 2,...,J and k =1, 2, 3 is

indep. (1) (1) σǫj,k ǫj,k,1 Normal(0, σǫ ) where σǫ = . ∼ j,k j,k 1 φ2 − j,k q The autoregressive parameters φj,k are assumed to follow beta distributions. This implies positive autocorrelation, with a stationary process for the autoregressive com- ponent conditional on φ . Conditional on a and b , the φ are independent. For j,k φ φ { j,k} j =1, 2,...,J and k =1, 2, 3, let

φ indep. Beta(a , b ) j,k ∼ φ φ

with a Gamma(50, 10) , b Gamma(100, 10). φ ∼ φ ∼ Now, suppose we wanted to extend the above hierarchical model to include all counties and all states. Going up a level, we can assume, for example, that the county parameter µ (h) for county v can be thought of as a sample from a common βj , v population distribution for all counties in the state. Note that our indexes now should be nested. Agency is nested in county, and county is nested in state. This means that agency 1 in county 1 is different from agency 1 in county 2. We can assume that

indep. µβ(h), v θ1,u, θ2,u Normal(θ1,u, θ2,u). j | ∼

Now θ1,u and θ2,u are state-level parameters for state u. These parameters also can be drawn from population distributions. For example,

θ ζ ,ζ indep. Normal(ζ ,ζ ). 1,u| 1 2 ∼ 1 2

Of course, the exchangeability assumption within each agency, country, and state, is important as we move up each level in the hierarchical structure. The benefit of this structure is that parameters can borrow strength across groups, and estimates will be pulled toward the overall means.

130 5.4 Checking the Model

When using any MCMC techniques, as discussed in Section 5.1, our first check after running our model fitting program in BRugs, is always to see if our iterative

Markov chain appears to have converged to the posterior distribution. This is not a model check, but it is completely necessary before any meaningful model checks can be performed. For each parameter in the model, we can produce a trace plot which shows what value of the parameter is sampled at each iteration. Examining trace plots takes practice, and it is not always obvious what to look for in them. If the trace plot appears to wander, then the iterative procedure may still be searching for the posterior distribution. If a trace plot suggests that convergence has occurred, our samples may still be misrepresenting the correct distribution. To avoid this mistake, we employ a second check. We can set different initial values for the parameters. If two chains with different starting values appear to converge to the same distribution, we are more confident that we are sampling from the correct posterior distribution.

We also looked at the sample ACF and sample PACF plots from the sequence of posterior samples. From these plots, we may be able to assess if by thinning our sample we are drawing independent samples from the posterior distribution.

Once we have verified that we are sampling from the posterior distribution, we can now evaluate the model. First, there are several graphical checks. We can look at the data along with the fitted values to inspect visually how well the model fits the data. Figure 5.11, for example, shows the fitted values of the model, represented by posterior means (connected by lines), along with the actual observed data (circles).

The model appears to fit the data well. Another model check is to summarize posterior distributions graphically. Figures 5.12 and 5.13 show examples of these posterior

131 Robbinsdale, MN Burlary Counts 0 10 20 30 40

1960 1970 1980 1990 2000

Year

Figure 5.11: The solid line shows posterior means for the distribution of each µt (t = 1, 2,..., 516) on burglary series for Robbinsdale, MN. (Actual values shown as circles.)

summary plots by looking at the distributions of the hyperparameters for parameters

(1) alpha (αj,k) and beta1 (βj,1 for burglary). We can see that posterior distribution summaries are in agreement with results from our exploratory analysis of the data using independent Poisson regression models.

We can check if our model has adequately accounted for the dependence in the data by observing the sample ACF plot of the residuals. Figure 5.14 is a Bayesian

ACF plot, in the sense that each ACF value has a distribution, and is represented by a boxplot. From this plot, there does not appear to be any obvious autocorrelation

132 Mean Alpha 1 2 3 4 5

BUR LAR VTT

Figure 5.12: Posterior summaries of mean alpha (µα) coefficients for each crime. X marks the estimate from the independent Poisson GLMs.

Beta1 for Burglary Beta1 −0.05 0.00 0.05 0.10

1 2 3 4

Agency

Figure 5.13: Posterior summaries of mean beta1 (µ (1) ) coefficients for each agency. βj X marks the estimate from the independent Poisson GLMs.

133 unaccounted for by the model. Together, these graphical checks suggest that our model is appropriate, and useful for the UCR data.

Posterior Residual ACF Plot ACF −0.2 0.0 0.1 0.2 0.3 0.4 0.5

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48

Lag

Figure 5.14: Sample ACF plot of residuals from Bayesian model, represented by boxplots.

Besides graphical checks, there are also numerical checks one can make. Gelman et al. [2004] use discrepancy measures to assess discrepancy between the model and the data. Some discrepancy measures are the posterior predictive p-values, χ2 dis- crepancy, deviance, or the predictive L measure in Chen and Ibrahim [2000]. We will use numerical model checking tools for assessing our model in the next chapter.

For our problem, the most important goal is to make good estimates of the missing values. Having our main objective clearly defined helps us to determine which model checks will be most meaningful for our situation. We will evaluate our model by

134 evaluating the accuracy of our imputations, though we do not describe these model checks in detail in this chapter. In Chapter 6, we first describe our imputation technique in the Bayesian setting. Then we will illustrate our imputation strategy using real data from the UCR and simulated missingness. Finally, we will check our model based on its predictive ability of missing values.

135 CHAPTER 6

BAYESIAN IMPUTATION

As we have emphasized several times, our research is motivated by a real-life problem: missing data in the UCR. We ourselves have no expertise in using the data to answer criminological questions, but we know that other researchers do, and have yet been restricted by the presence of missing data. Our goal has been to develop a method for filling in the missing data with plausible values, thereby facilitating the wide use of the UCR data in research. Our intent is to provide researchers a strategy to use to impute for missing values. Ideally, we would like to provide completely

filled-in UCR data for public use, but this step is beyond the scope of the current research goals.

Thus far, our research path has taken us to the point where we have built an adequate hierarchical model (Chapter 5) of the UCR data, at least for one county.

Now, we discuss using this model to make imputations for missing values. In the area of Bayesian inference, missing data are treated in the same way as parameters, since both are uncertain [Gelman et al., 2004]. Our imputations, or more generally, inferences concerning the missing values, will be based on the posterior distributions that we simulate.

136 The EM algorithm, described in Chapter 3, is an iterative procedure for comput-

ing maximum likelihood estimates when the data are incomplete. Estimates of the

missing values can be found by maximizing the joint density of the missing values

and observed values with respect to the missing values. Taking a Bayesian approach,

we are interested in finding the posterior predictive distribution (Equation 5.2) of

the missing values conditional on the observed data, rather than only the maximum

likelihood estimates of the missing values.

With a posterior distribution for each missing value, we can provide results in the

form of probability statements about the missing values. If the posterior distribution

is difficult to calculate directly, we can approximate the distribution by drawing many

samples from the distribution. Typical point estimates of missing values are means

or modes of the posterior distributions. The variance of a missing value corresponds

to the variance of its posterior predictive distribution.

Before we can find the appropriate posterior predictive distributions, we must

construct a model for the missing data (Section 6.1). After discussing our model, we

will describe the Bayesian methods used to deal with missingness in Section 6.2. We

evaluate our model and method in various ways in Section 6.3.

6.1 A Model for Missing Data

In Section 5.3, we constructed a Bayesian hierarchical model for the UCR data.

This model assumes that all data are observed. In this section, we extend the model specification in Section 5.3 to incorporate missing observations. Clarifying the miss- ingness assumptions is a necessary step to appropriately utilize missing data methods.

137 First, let us define some notation. Let Yj,k,t be agency j’s crime count for crime type k at time t and let Y represent the collection of all Yj,k,t. We define a new variable to indicate whether the response is observed or not. Let

1 if Yj,k,t is observed; Rj,k,t =  0 if Yj,k,t is missing, and let R represent the collection of all Rj,k,t. Using the notation in Gelman et al.

[2004], let ‘obs’ = (j, k, t) : R = 1 index the observed components of Y . { j,k,t } Also, let ‘mis’ = (j, k, t) : R = 0 index the missing components of Y . Now, { j,k,t } Y = Y ,Y represents the complete set of observed response data and missing { obs mis} response data.

We now review the concept of missing at random (MAR), first mentioned in Chap-

ter 2. Let us assume that the elements of R follow some distribution with parameter

vector θR. The distribution of R is called the missing data mechanism or the response

mechanism. Also, let θY represent the parameter vector of the distribution for the

elements of Y . Now, the joint distribution of R and Y is

f(R,Y θ , θ )= f(Y θ )f(R Y, θ ). (6.1) | R Y | Y | R

The data are MAR if the distribution of the missing data mechanism does not depend

on the missing values [Little and Rubin, 2002], that is, if

f(R Y, θ )= f(R Y , θ ) forall Y . (6.2) | R | obs R mis

In dealing with missing data, the concept of ignorability is very important. Having

an ignorable missing data mechanism basically means that we can use the observed

data to make inference on the parameters related to Y . Ignorability has two conditions

in Rubin [1976]. The first condition is that the data are MAR. The second condition

138 is that the parameters for Y and R are distinct. This means that the joint prior distribution of Y and R factors into independent marginal priors, so the models for Y and R are separate, with non-overlapping parameters. If the missing data mechanism is ignorable, then inferences about θY or Ymis can be made based on the posterior

distributions π(θ Y ) and π(Y Y ) respectively. Y | obs mis| obs Now we will propose a model for the missing data mechanism for the UCR data.

First, let us consider the probability distribution of R. It is unlikely that every

observation has equal probability of being missing. For example, if we know that

a crime count is missing in April, then we maybe more likely to believe the crime

count will be missing for May also. Furthermore, if we know that the burglary count

is missing for an agency in April, then we suspect that the agency’s larceny count

may also be missing in April. This implies that the Rj,k,t are not independent, but

neither are the Yj,k,t. What is important to MAR, however, is not that the Rj,k,t

be independent, but rather that R be independent from Ymis. We have mentioned

earlier our concern of the possibility that an agency might decide to not report ‘0’

crime counts. This would indeed make the MAR assumption invalid. In fact, any

situation of missingness depending on the missing values themselves would negate

the MAR assumption. We argue that the MAR assumption is difficult to prove or to

disprove because the missing values are unobservable. We believe that the missingness

is generally unintentional, and is caused by factors unrelated to the missing values

themselves. It is reasonable to assume that the data are MAR as we present our

imputation strategy in this chapter. For ignorability, we also assume that the prior

distributions of Y and R are distinct.

139 At some point, if we must assume that the data are not missing at random

(NMAR), then our missing data mechanism is non-ignorable. In this case, how- ever, we can still make imputations. The multiple imputations paradigm, which we describe and use in the next section, does not require or assume that missingness is ignorable [Schafer, 1999]. To make imputations, we first would need to specify the missing data mechanism; i.e., the probability distribution of R. This distribution may give a lower probability of response to ‘0’ crime counts, for instance. In any case, modeling R would certainly require more work, and the benefit of including a non-ignorable model for missing data may or may not be substantial.

6.2 Missing Data Methods

We now review Bayesian methods used to make inference with incomplete data.

We begin with the Data Augmentation (DA) algorithm, in Section 6.2.1, as we in- troduce the basic strategy of dealing with missing data in the Bayesian setting. In

Section 6.2.2, we describe how the Gibbs’ sampler can produce the same results as

Data Augmentation. Finally, in Section 6.2.3, we discuss multiple imputation, which is incorporated into Data Augmentation, but can also be used as a unique way to present imputation results.

6.2.1 Data Augmentation Algorithm

Data Augmentation refers generally to the strategy of augmenting the observed data to make it easier to analyze. In the missing data problem, DA refers to augment- ing the observed data with the missing data. Congdon [2001, Chapter 6] observes that estimation of all missing data models is implicitly or explicitly based on augmen- tation of the observed data to account for the missing latent data. Tanner and Wong

140 [1987] showed how DA can be used to calculate posterior distributions in situations

where the joint distribution of parameters and observed data is difficult to calculate.

The DA algorithm has many similarities with the EM algorithm, an iterative

missing data method described in Chapter 2. The basic idea for an iterative missing

data method is that we alternate between filling in the missing values, and refitting

the model, until convergence. The goal of the EM algorithm is to provide maximum

likelihood estimates of the parameters, but the goal of DA is to approximate the

posterior distributions of the parameters. The EM algorithm has two steps, the E step

and M step (see Chapter 2), and the DA algorithm has two analogous steps. The I step

involves imputation and the P step involves sampling from the posterior distributions.

At each iteration, DA draws a sequence of samples from the distributions of the

parameters and missing data.

The DA algorithm is given in Tanner and Wong [1987] with the following steps.

Let Y and θY be defined as in Section 6.1 and let us assume data are MAR. (a) I Step (Imputation Step): ′ (a1) Draw θ from the current estimate of π(θ Yobs) ′ | (a2) Draw Y with density π(Y Y , θ ) mis mis| obs Repeat steps (a1) and (a2) M times. (b) P Step (Posterior Step): Set the posterior density of θ equal to the mixed distribution, mixed over the M imputed values of Ymis. Iterate between the I step and the P step until convergence.

In step (a), given the current guess of the posterior of θ given Yobs, we generate a

sample of M latent data patterns from the predictive distribution of Ymis given Yobs.

Step (a) involves multiple imputation, which we discuss in Section 6.2.3. Steps (a1)

and (a2) will generate a draw from the predictive density π(Y Y , θ′). Step (b) mis| obs involves the computation of the posterior distribution of θ based on the augmented data sets. Step (b) updates the posterior distribution of θ, given Y , to be the mixture

141 of the M augmented data posteriors. The algorithm iterates between the imputation

step and the posterior step until convergence.

Little and Rubin [2002, Chapter 10] give a different version of DA where step (a) does not involve multiple imputation and step (b) does not involve multiple posterior sampling. Instead, the I step involves taking one draw from the predictive density of

Ymis given Yobs and the current guess of θ, and the P step involves taking one draw

from the posterior distribution of θ given Yobs and the current guess of Ymis. At its

completion, both versions of the DA algorithm will produce samples from the joint

posterior distribution of Ymis and θ given Yobs, but the second version is slower.

6.2.2 The Gibbs’ Sampler

The Data Augmentation algorithm in Little and Rubin [2002] is a special case of the Gibbs’ sampler. We described the Gibbs’ sampler in Chapter 5 as an MCMC method of simulating from the posterior distribution when we have multiple parame- ters, and the full conditionals are easier to draw from than the joint distribution. Now we are treating Ymis as another unknown parameter. The DA algorithm is essentially sampling from conditional distributions, just like Gibbs’ sampler. The results from

DA can easily be obtained by the versatile Gibbs’ sampler, especially when we have multiparameter models.

To illustrate the Gibbs’ sampler, let us consider a simple example of one crime

series and one missing crime count, Ymis. Let us assume that our model has param-

eters (θ1,...,θp). In the Gibb’s sampler, at iteration i we make the following draws

[Little and Rubin, 2002]:

142 Y (i+1) p(Y Y , θ(i),...,θ(i)) mis ∼ mis| obs 1 p θ(i+1) p(θ Y ,Y (i+1), θ(i),...,θ(i)) 1 ∼ 1| obs mis 2 p θ(i+1) p(θ Y ,Y (i+1), θ(i+1), θ(i),...,θ(i)) 2 ∼ 2| obs mis 1 3 p . .

(i+1) (i+1) (i+1) (i+1) θ p(θ Y ,Y , θ ,...,θ − ). p ∼ p| obs mis 1 p 1 After convergence of the sampling sequence, the result will be a draw from the joint distribution of (Y , θ ,...,θ Y ). OpenBUGS will incorporate a Gibbs’ sampling mis 1 p| obs technique to give a Markov chain of samples for each parameter, and each missing value.

One strategy is to take one draw from the posterior predictive distribution of Ymis for each run of the Gibbs’ sampler. Running the Gibbs’ sampler M times will give us M independent imputations of the missing values. We will describe how these M sets of imputations can be used in the next section.

6.2.3 Multiple Imputation

Following successful simulation of the posterior predictive distributions of the missing values, we can now use these distributions to make imputations. If we were interested in single imputation, we could choose to take the posterior or mean as imputations. We could also estimate the uncertainty of the imputation from the posterior predictive distribution. Alternatively, we can consider making multiple imputations. Multiple imputation was seen in step (a) of the DA algorithm, where

Y (m) π(Y Y ), for m =1, 2,...,M, were M draws from the posterior predictive mis ∼ mis| obs distribution of missing values.

143 Multiple imputation (MI) [Rubin, 1987] is a technique designed for the purpose of estimating population parameters, and their variances, using incomplete data. The idea is that we create more than one set of predictions for the missing values to reflect imputation variability. We make multiple draws from the posterior predictive distribution of the missing values, use these draws as imputations, and use the M completed datasets to make inference on the population parameters. The inferences made from the M completed datasets are combined to form composite estimates of the parameters. MI improves upon single imputation because it incorporates an estimate of uncertainty due to missing values. Little and Rubin [2002, page 90] suggest that a small set of MIs (M 10) can provide very good inferences. ≤ For our problem, we are not so concerned with estimating model parameters, as we are concerned with estimating the missing values themselves. MI is only justified for our situation if it can provide benefit for potential users and analysts of the UCR.

As a final product, we envision providing a number of completed UCR datasets.

Criminologists and other users of the UCR should be trained on how to perform analysis on the M completed datasets. In this way, rather than provide inferences ourselves for the missing values, we will leave inference in the hands of the researchers.

We now illustrate multiple imputation using an example from the UCR. Let us consider fitting the model specified in Section 5.3 using four agencies in Hennepin

County, Minnesota. In St. Louis Park, Minnesota, we have removed the larceny crime counts in 1988, and we use the technique of multiple imputation to make 6 completed datasets. In Figure 6.1, the first plot shows the entire crime series, with the removed year of data represented by a dotted line. The second plot shows boxplots representing the posterior predictive distributions estimated from our model for the

144 St. Louis Park, MN Larceny Counts 50 100 150 200 250

1960 1970 1980 1990 2000

Year Larceny Counts 50 100 150 200 250

JFMAMJJASOND

Month

Figure 6.1: The first plot shows the larceny crime series for St. Louis Park, Min- nesota, with data from 1988 removed. The second plot shows posterior predictive distributions for the missing data with the actual removed data represented by the dotted line.

145 twelve “missing” months in 1988. We can clearly see from this plot how the seasonality

in our model goes into our imputations. We are predicting the highest larceny counts

in the summer months and the lowest larceny counts in the winter months. In this

example, it is more difficult to see how trend is incorporated into our imputations

because the trend is relatively flat in 1988.

Random samples are drawn from the approximate posterior distributions to obtain the multiple imputations. Figure 6.2 shows results from multiple imputation. The evident variability among sets of imputations is an indication of the uncertainty of our predictions for the missing values. By using all six completed datasets, inferences on the model for the crime series will take into account the uncertainty in our im- puted values. For example, if a researcher wanted to estimate St. Louis Park’s average monthly larceny count during the 1980’s, the following multiple imputation calcula- tions could be made. For the six completed datasets, the average monthly larceny counts during the 1980’s are 125.48, 125.96, 126.46, 124.39, 124.31, and 125.40. So our combined estimate is the average of these six averages, which is 125.48. The total variability associated with our combined estimate is 0.73 (7/6)=0.85, where 0.73 is × 1 the between-imputation variance and (7/6) is the finite M adjustment, (1 + M ). The rest of the data is a census, so there is no sampling error component of the multiple

imputation variance as there would be in typical uses. The result is helpful because

we have a variance estimate (0.85) for our estimated average monthly larceny count

during the 1980’s in St. Louis Park (125.48) via a simple calculation.

146 MI: 1 MI: 2 Larceny Counts Larceny Counts 50 100 150 200 250 50 100 150 200 250

1987 1988 1989 1990 1987 1988 1989 1990 Year Year

MI: 3 MI: 4 Larceny Counts Larceny Counts 50 100 150 200 250 50 100 150 200 250

1987 1988 1989 1990 1987 1988 1989 1990 Year Year

MI: 5 MI: 6 Larceny Counts Larceny Counts 50 100 150 200 250 50 100 150 200 250

1987 1988 1989 1990 1987 1988 1989 1990 Year Year

Figure 6.2: There are six sets of imputations shown here for St. Louis Park, MN. The imputed values are represented by circles, and the actual data, removed, is represented by the dotted line.

147 6.3 Evaluating the Method

In this section, we will evaluate our model based on the accuracy of its predictions

for missing values. Obviously, we cannot test our methods on data that is unknown,

because we will not know how accurate they are, so we must simulate missing values

in a complete data series. Graphically, we can see how close the imputed values are to

the actual values. This visual check will tell us if our imputation method appears to

be reasonable. The plots in Figures 6.1 and 6.2 show that our imputations appear to

do well. We also use data from Crystal, MN to illustrate our imputation performance,

shown in Figure 6.3. Also, we can look at plots of prediction errors for any unusual

patterns, as we would with any residual plot. The prediction error distribution is

the actual value subtracted from the posterior distribution of the missing value. For

example, in the prediction error plot in Figure 6.3, the prediction errors appear to float

around the zero line. This is an indication that our method is giving us reasonably

unbiased predictions, and that our model is doing well to predict the missing values.

Besides checks, we can also do model checking using numerical discrepancy measures. We will look at several criteria for measuring the deviation between the predicted values and the actual values. We will use these discrepancy measures to compare three competing imputation methods, each based on a different model. Let Yj,k,t be agency j’s count of crime k at time t. If we have P missing values, let p represent the index set j, k, t of the pth missing value and let Yˆ be a { } p

prediction for Yp. For P predictions, the average bias (BIAS) will be estimated by

1 P BIAS = (Y Yˆ ). (6.3) P p − p pX=1

148 Crystal, MN Larceny Counts 0 20 60 100 140 1960 1970 1980 1990 2000 Year Larceny Counts 50 100 150

JFMAMJJASOND Month

Prediction Error −50 0 50 100 JFMAMJJASOND Month

Figure 6.3: Removed data in 2000 is represented by dotted line. The second plot shows posterior distributions of the missing values for 2000.

149 We will estimate mean square prediction error (MSPE) by

1 P MSPE = (Y Yˆ )2. (6.4) P p − p pX=1 To estimate goodness of fit (GOF), we use the statistic defined by

P (Y Yˆ )2 GOF = p − p . (6.5) Yp pX=1 Now we consider three imputation methods. The first method uses the Bayesian hierarchical model in Section 5.3 and we will impute the posterior mean for each miss- ing value. The second method uses an independent Poisson regression model with the four Hermite polynomials for time and the three dummy variables for season. We will compute predicted mean values based on this model for our imputations. The third method uses two SARIMA (1, 1, 1) (1, 0, 0) models and makes predictions using × 12 the forecast-backcast composite estimate for the imputations. We use the three prop- erty crime series from four agencies in Hennepin County, Minnesota for our example.

We simulated missingness by randomly selecting 23 years of missing data among the

12 crime series. We then made imputations using our three imputation methods. For the Bayesian model, we ran MCMC for 60,000 iterations. After a burn in of 10,000, we sampled at every 20th iteration to get 2,500 posterior samples.

Figure 6.4 shows results from the three imputation methods used on the larcency crime series in Crystal, Minnesota. We see that the three imputation methods produce similar predictions, and all are very close to the actual crime counts. Table 6.1 shows the results of our comparison of the three imputation methods using the discrepancy statistics defined above. The average bias is very low for all three methods. Both the MSPE and GOF are lower for the independent Poisson GLM. The Bayesian method and SARIMA method have comparable results on both MSPE and GOF.

150 These results suggest that the hierarchical model does not show great improvement over the much simpler Poisson regression or the SARIMA time series model. We do understand, however, that these results cannot be generalized to the whole UCR, since we are only using three crimes for four agencies in one county in this particular study. It may be true, for instance, that the Poisson GLM is highly appropriate for these series, but may perform poorly in other series. Questions remain for further investigation, and we discuss these in Chapter 7.

Crystal, MN

Bayesian GLM SARIMA Larceny Counts 40 60 80 100 120

1975 1976 1977 1978

Year

Figure 6.4: Predictions from the three methods in Crystal’s larceny series. Actual crime counts for 1976 are represented by the dotted line.

151 Model Method BIAS MSPE GOF Bayesian Hierarchical Posterior Means -0.19 125.17 1174.6 Indep. Poisson Regression Predicted Mean Values -0.19 118.81 1124.5 SARIMA (1, 1, 1) (1, 0, 0) Forecast-Backcast 0.08 126.20 1173.3 × 12

Table 6.1: Comparison of three imputation methods on 23 years of missing data.

6.3.1 Worst Case Scenario

One characteristic of a good imputation method is that it performs well in the most

extreme cases. We can assess how well our model performs in the worst situations by

simulating missing data for months when crime counts appear to be unpredictably

extreme. We now look at the vehicle theft crime series from Richfield, Minnesota,

shown in the first plot in Figure 6.5. We selected to remove crime counts in 1979

from the data series, because of the unusually large range of crime counts during that

year, represented by a dotted line. The range of vehicle theft counts goes from 1 in

January to 43 in September. This happens to be the largest range within any one

year seen in the crime series. If the 1979 data were missing from the Richfield UCR

series, some of the monthly missing values would be very difficult to predict, so this

example represents a worst case scenario for imputation.

We fit our hierarchical Bayesian model and the second plot in Figure 6.5 shows the imputed posterior means for each missing value represented by circles, and the 2.5th and 97.5th posterior percentiles represented by plus signs. These predictions appear to be very reasonable. If the crime counts from 1979 were truly unknown, we would feel very comfortable about our predictions. However, Figure 6.5 shows how poorly our predictions are for several months in the year when compared with the actual

152 values. We, of course, completely miss the high true crime count for September, but our predictions for January, February, and June are also poor.

For Richfield, we noticed that both larceny and burglary are high in September

1979 also. Our model, however, does not model the dependence among crime counts for different crimes. Instead, we only model dependence among the model parameters for different crimes. These parameters model the overall crime pattern, but do not model extreme or unusual crime counts. We do not see this as a weakness in our model, however, because it is too optimistic to think that we can predict every isolated spike. Our model, remember, emphasizes the basic components of overall crime trends and seasonality.

If crimes are generally higher in September, then we should predict higher crime counts for values that are missing in September. Figure 6.5 shows that we are predict- ing slightly higher crime counts in the Fall than in the Winter. Still, suspecting that our model wasn’t capturing the seasonality appropriately, we reexamined the season- ality in the raw data by visual inspection. Figure 6.6 shows crime counts labeled by

Fall and Winter in the first plot and labeled by January and September in the second plot. In these plots, we see that seasonality is not soundly consistent over time. Based on the weak seasonality we have observed, it is not surprising that the extremely high spike in vehicle theft in September 1979 cannot be adequately predicted. However, we do believe that there is some room for improvement. Our model, in Section 5.3, does not learn about seasonality across crimes within an agency, but we discuss this idea in Chapter 7. For example, if Richfield’s seasonality was seen more strongly in larceny counts, then it is conceivable that we could make a better prediction for the vehicle theft count in September 1979.

153 Richfield, MN

+ ++ ++ + ++++ + +

Vehicle Theft Counts +++ +++++++++ 0 5 10 20 30 1960 1970 1980 1990 2000 Year

+ + + + + + + + + + + + Vehicle Theft Counts + + + + + + + + + + + + 0 10 20 30 40 1978 1979 1980 1981 Year Vehicle Theft Counts 0 10 20 30 40 JFMAMJJASOND Month

Figure 6.5: Imputation in Richfield, MN. Posterior means are represented by circles. 95% posterior intervals are represented by plus signs. Actual data removed is repre- sented by dotted line. Boxplots shows posterior distributions of missing values based on 2500 samples.

154 Richfield, MN

F

F F F W W W F F W W F F W W F F W F F F F W F WW F F F W WWW F F FW F WW W F WF F W FF F W F W F FW F W W WF FF F F F F F W F F WWWF W W F FW WF F FW F WF W W F FF WF W F W F W F F F Vehicle Theft Counts WWF F F F F W FF F F F FF FW WW W W W F W W FWF F F F W F FFW W F F FWW WF W FW F WF WWWW F F FWF F F W W WW WF F FWWF W W W FW F WF F FW W W WW W WFF F WW F W W W WWW W WW W W W FWW F FW W W F FW F W F WW WF F FW W W F WWWF F F F W W W W WWF F WF W W WFFF F W 0 10 20 30 40

1960 1970 1980 1990 2000

Year

S

S

J J J J J S S J S S J S J SS J J J S S S S J S J S J J Vehicle Theft Counts S J S S J J J S J S S JS S S S S S S S J J J S S J J J S S S J J J J J SSJ S J J S J S S SJ J J J J JS S J 0 10 20 30 40

1960 1970 1980 1990 2000

Year

Figure 6.6: In the first plot, Fall crime counts are labeled by “F” and Winter crime counts are labeled by “W”. In the second plot, September crime counts are labeled by “S” and January crime counts are labeled by “J”.

155 Even still, we must resign ourselves to the unfortunate conclusion that some miss- ing crime counts will be impossible for our model to predict accurately. We will have high prediction errors occasionally, just due to random variation, but some unobserv- able factors that cause crime will have an effect too great to go unnoticed, yet too obscure to be predicted. We are not willing to make larger prediction variances for the sake of the worst cases. This would be forfeiting information from a good model.

Instead, we argue that our methods do well, and that the variances are appropriate for the vast majority of missing gaps in the UCR.

156 CHAPTER 7

CONCLUSION AND FURTHER WORK

We dedicate our final chapter to looking back and to looking ahead. We will

reflect upon the successes and limitations of our research, make recommendations to

those in related areas of research, and propose new goals for the future.

The main accomplishment of our research is the application of statistical tech- niques to a practical problem. Statistics is a practical science, and is therefore ad- vanced through application. Driven by the problem of missing data in the UCR, our research represents an innovative investigation into imputation techniques based on hierarchical models. In the end, we have made a notable improvement over the cur- rently used imputation procedure, and we have widened the door for more research.

The FBI’s imputation procedure is based on simplistic, and often unrealistic, assumptions about the UCR data, and ignores several other factors which we find important. The FBI’s method implicitly assumes that crime counts for one year are independent of crime counts from previous or subsequent years. It also assumes that agencies within a population Group share the same average crime rate per capita.

Another assumption in the FBI’s method is that there is no difference in monthly crime counts within a year. We remember, however, that the FBI was only concerned with making annual crime count estimates, so monthly seasonality was not important

157 for their method. Unlike the FBI’s original developers of the UCR imputation method

in 1958, we are interested in monthly crime count estimates, and we have developed a

strategy to impute for missing crime counts so that inferences on agency-level monthly

data can be made.

We now summarize our research conclusions and present some thoughts on future

research. The development of our imputation procedure has involved two important

steps. In the first step, we construct models for our data. In the second step, we

use our models to make imputations. We give final remarks on these two steps in

Sections 7.1 and 7.2 respectively. In Section 7.3 we will discuss future goals for this

research.

7.1 Final Remarks on Modeling

From a modeling perspective, there are several interesting characteristics about the UCR data which are important. First, the data set has a unique hierarchy that is able to organize over 650 million data points, representing a census of crime counts covering the entire country. Second, the data are in the form of crime counts ranging from zero to thousands. Third, the data are collected longitudinally at over five hundred time points, and are therefore dependent over time. Fourth, the data are multivariate, so several variables are measured by each agency. Fifth, the data set is mottled with missing gaps of variable lengths. We have constructed our model to incorporate each of these five characteristics. We have made a hierarchical model to structure the groupings of data. We have assumed the Poisson distribution for count data. We have successfully modeled the time series components of trends and seasonality in the UCR. We have also included a latent process to model the residual

158 autocorrelation. In our multivariate response (for crime counts), since crimes types within an agency are correlated, we have allowed agencies’ crime series to learn from each other about the intercept parameters. Finally, we have clarified assumptions concerning the missing data.

One reason why our research improves on previous studies of UCR missing data is because we have invested greater effort in justifying the assumptions about the missing data mechanism. The imputations will not be accurate unless we assume missing at random or find another model for the missing data mechanism. Having the data be missing at random is important because it allows us to make an ignorability assumption about our response indicator. Unfortunately, we cannot show that the

UCR data are missing at random, but we can discuss the evidence that supports our

MAR claim (see Section 6.1). Still, our claim, whether MAR or NMAR, might not have a large influence on our inferences about the missing values.

One way to test the sensitivity of the MAR assumption is to conduct an where missingness is simulated both at random, and not at random, and then compare prediction error for the imputed values. On the other hand, if we want to assume

NMAR, we have some intuition about the distribution of missingness in the UCR. If we see a missing gap in a crime series with very large crime counts, then we expect the data to be missing at random. If we see a missing gap in a crime series with very sparse crime counts, we are more likely to doubt the MAR assumption. An improvement on our model would be to construct a model for the response indicator.

For example, our model would give a higher probability of being missing for a zero crime count than for a non-zero crime count. We might also consider a model for the response indicator, if we believe that other factors affect

159 whether or not a crime count is observed. More research is needed, therefore, in

strengthening the assumptions about the missing data mechanism in the UCR, but

we have made progress toward this objective.

A cold hard truth in modeling is that there is no perfect model. Some failure is

inevitable in modeling, and we review some notable weaknesses of our model here.

The primary weakness is that our model has only been tested on a chosen subset of

UCR data. We do not yet have solid support to recommend the implementation of

our model to all counties across the country. We anticipate that our model will need

to be more flexible in order to accommodate a wider variety of UCR data. What

might be a very good model for a crime series in one agency may be a very poor

model for a different agency.

One possible adjustment to our model is to include indicators for whether or not

a model term is important. For example, the fourth Hermite polynomial might be

important for some data series, but not for others. In our model, the coefficient for

(4) the fourth Hermite polynomial is represented by βj,k for agency j and crime k. For agency j =1, 2,...,J, and crime k =1, 2, 3, let

δ(4) Bernoulli(λ(4)) j,k ∼ j,k

where λ(4) Uniform(0, 1). j,k ∼

(4) (4) Now if we attach δj,k to βj,k , then the fourth Hermite polynomial will only be included

(4) in the model with probability λj,k. It is also possible to learn about model complexity

(4) across crimes and agencies by fitting a logistic regression model on λj,k for example.

The idea is that, with a more flexible model, we will be able to model different types of series. There are some crime series that have very low crime counts. We

160 have not tried our Bayesian model on these crime series, but we have seen in Chapter

4 that series with sparse crime counts do not always exhibit trend and seasonality.

Even if we decide that different series should have different models, we must still answer the question of how models for each agency should be related to one another.

Perhaps the correlations among crimes differ among agencies. Perhaps there are spatial dependencies among agencies within a county or among counties within a state. These are questions to which we would still like to find answers.

Allowing for more flexibility in our model for different series, we may also need to make other adjustments. In the beginning year of the data we use, 1960, some agencies have not yet been created. Other agencies will cease to exist at some point before 2002. We have seen in Chapter 4 that there are also other types of special cases in the UCR. Some agencies’ crimes are reported by, or ‘covered-by’, another agency for a period of time. Some agencies report aggregated crime counts rather than monthly crime counts. All of these are considerations that will require adjustments to our model if we wish our model to truly be a comprehensive model of the crime counts in the UCR. Fortunately, Bayesian models are very flexible and are capable of handling diverse types of data.

In Chapter 6, we compared imputations from three models, using property crime data from four agencies in Hennepin county, Minnesota. Based on our discrepancy statistics using prediction error, we found that the independent Poisson regression model did slightly better than the Bayesian hierarchical model and the SARIMA model. We expected that the Bayesian hierarchical model would perform better than the Poisson GLM because it specified hierarchical priors for pooling strength across agencies, and it included autocorrelated error terms to account for dependence

161 among observations. This unexpected result may have occurred due to shrinkage in parameter estimates, inherent in hierarchical models, or because our time series component is minimizing our seasonal effects. Another possibility is that we need only try our models on more series to realize the actual differences among models.

7.2 Final Remarks on Imputations

In our research, all of our imputation methods are based on models for the data.

Our imputations will only be as good as our models, so the majority of our research focus has been on building good models of UCR crime counts. As we have seen in Chapter 6, our imputations do not predict accurately extreme or unusual crime counts, but this is a shortcoming that may be difficult to fix.

For classical time series models, an imputation method based on forecasting and backcasting is an intuitive way to estimate a string of missing values within a data se- ries. When we have different patterns of missingness, implementation of the forecast- backcast method becomes troublesome, as we have seen, because we must perform forecasting and backcasting for each missing gap, sometimes based on different mod- els, depending on what data is used. The algorithm used to perform forecast-backcast imputation can become very inefficient if the number of missing gaps is large. We must also be careful if we are using imputed values to estimate a model used to impute for other gaps. The advantage of the forecast-backcast method is that for complicated

SARIMA models forecasts and associated standard errors are obtained automatically using the arima function in R. The disadvantage of the forecast-backcast method is that determining weights of the forecast and backcast to create the composite estimate

162 is not straightforward. We used an ad hoc strategy for determining the weights. Pre- dictions with lower standard errors were given more weight in the composite estimate, but in general, this does not produce the BLUP (best linear unbiased predictors).

For observation-driven Poisson regression models, the forecast-backcast can be used for nonconsecutive missing crime counts, but we have seen that making impu- tations using these models under a Bayesian approach is much simpler, and more reliable. Bayesian imputation uses the entire observed series, and possibly other se- ries, simultaneously, whereas the forecast-backcast approach uses a model for each section of the data series.

In Chapter 6, we considered making imputations based on predicted values from independent Poisson regression models. These models, which model dependence only though a polynomial for time and dummy variables for season, were shown to do very well in imputation for the series in Hennepin county. This result maybe suggests for some series, that autoregressive error terms are not necessary if season and a polynomial for time are already in the model.

Using a Bayesian model, we can simulate a posterior distribution for each missing value. We can use this distribution to make imputations in several different ways.

We recommend using these posterior distribution to make multiple draws, i.e., mul- tiple imputations, for the missing crime counts. The advantage of providing multiple imputations to researchers is that they can make inference according to their own objectives. The other advantage of multiple imputation is that the imputations are integers and resemble actual crime counts. Of course, this means that imputations should be used with special caution to avoid treating them as actual observed val- ues. We suggest that if multiple imputations are provided for public use, then some

163 instruction should be given on the proper way to use the multiple completed sets in data analysis.

7.3 Further Work

In this section, we propose a research agenda for the future. The first goal we describe is related to the modeling step in our imputation procedure. The second goal is related to the testing of our imputation procedure.

Further research will involve a continuation of model construction. The model in

Section 5.3 is incomplete because only the intercept parameters are learning across crimes. Hierarchically, the regression parameters are pooling strength across agencies, but there is no correlation structure for regression parameters among crimes within an agency. With this in mind, we can think about changing our model in Section 5.3 in the following way.

Using simplified notation, let ~x(1) be a vector of 1’s of length N. The four Hermite polynomials for time are represented by covariates ~x(2), ~x(3), ~x(4), ~x(5) , and dummy { } variables for season are represented by ~x(6), ~x(7), ~x(8) . The latent process for agency { } j and crime k is represented by ǫ : t = 1,...,N and is specified as in Section { j,k,t } 5.3. For agency j =1, 2,...,J, crime k =1, 2, 3, and t =1, 2,...,N, let

(p) ′ cond. indep. Y β : p =1, 2,..., 8 , ǫ ′ : t t Poisson(µ ) j,k,t|{ j,k } { j,k,t ≤ } ∼ j,k,t 8 (p) (p) where log(µj,k,t)= βj,k xt + ǫj,k,t. pX=1 For agency j, the regression parameters for the three crimes are assumed to follow ~(p) (p) (p) (p) a multivariate normal prior distribution. Let βj = (βj,1 , βj,2 , βj,3 ). Conditionally

164 (p) (p) ~(p) on ~µβ and Σβ , each βj is drawn from the same density.

β~(p) MVN (~µ(p), Σ(p)) for j =1, 2, 3,...,J and for p =1, 2, 3,..., 8 j ∼ 3 β β

(p) (p) The hyperprior distribution of ~µβ is also multivariate normal and Σβ is assumed to

be inverse Wishart, which will model the covariance of model coefficients among the

three crimes. For example, we might take

0 1 .5 .5 (p) ~µβ MVN3  0  ,  .5 1 .5  , for p =1, 2, 3,..., 8, ∼ 0 .5 .5 1         and 1 .5 .5 (p) Σβ InvWish  .5 1 .5  , 3 , for p =1, 2, 3,..., 8. ∼ .5 .5 1       Now, under this revised specification, crime series within an agency are allowed to

learn from each other about crime trend and seasonality. We expect that this model

will be a better representation of the correlation that exists among crimes in an

agency.

One possible reason we have seen the hierarchical model do worse than other methods in Chapter 6 could be that we have not allowed the regression parameters other than the intercept term to learn across crimes within an agency. Another reason might be that learning across crimes is only effective when data in the series needing imputation is largely unknown. The FBI’s method, for example, only looks to other series when the first series is mostly missing. This idea brings us to our second area for further research.

A possible area of future work is in the persistent testing of our imputation pro-

cedure. We would like to see how our imputation method will perform for all UCR

series. We know, for example, that our methods have not been investigated to a great

165 extent on series with low crime counts. We expect our method to produce reasonable

predictions for the missing values, regardless of the size of the agencies’ populations,

but we would like to show evidence supporting the use of our method UCR-wide. We

know that there are greater limitations, however, to imputations in series with small

crime counts. As we have suggested earlier, we do not expect that monthly imputa-

tions will be meaningful alone for these types of series, but only when combined to

create aggregate estimates.

We also have questions concerning other factors which may affect the accuracy of our imputations. Do our imputations do better for short gaps? Do our imputations do better if we have more observed data for other agencies within the county? Will our imputations be reasonable for series that are mostly missing? To answer these questions we can design an experiment. The MSPE statistic is a possible choice for our response variable. One factor might be the gap size, and we could set the levels at

1, 12, and 24 months, for example. Another factor might be the amount of missingness within the series. Other factors might be the amount of missingness in other crime series within the same agency, or the amount of missingness in other agencies. These experimental factors can be set at different levels. For each treatment combination, we would randomly assign missing data according to the level of missingness set for each factor. We can analyze our experimental data using ANOVA statistical methods, and we could publish the results, potentially with a caution to UCR users. This future research goal represents a thorough study of the quality of our imputation methods for the purpose of ethically educating the community of researchers of the UCR.

The UCR is a data set that is rich with diversity and structure. The many

facets of the UCR create endless opportunities for fascinating data exploration and

166 research. One thing we have gained, throughout the course of our research, is a great appreciation for the UCR data set. The immensity of the UCR has both humbled us and captivated us. Inevitably, our vision for the future involves extracting even more information from the observed UCR data to better predict the missing data. We have begun the process of constructing a comprehensive model for crime counts in the

UCR, but we believe that there are still dependencies among UCR data that have not yet been incorporated into the model. Of course, there is a point at which additional model improvement is insubstantial from a practical point of view. Conceivably,

UCR imputation can evolve as a result of the every changing UCR. The UCR grows larger every year. A prospective solution is a model that can anticipate the change and is able to update itself. Whether UCR imputations will be subject to regular modification or not, we will look forward to learning about the implications this poses on studies involving UCR statistics, and we will welcome the challenges it brings to future statistical research.

167 BIBLIOGRAPHY

B. Abraham. Missing observations in time series. Communications in Statistics, 10:

1643–1653, 1981.

A. Agresti. An Introduction to Categorical Data Analysis. Wiley, 1st edition, 1996.

G. Box and G. Jenkins. Time Series Analysis: Forecasting and Control. Holden-Day,

1976.

P.J. Brockwell and R.A. Davis. Introduction to Time Series and Forecasting. 2nd

edition, 2002.

M-H Chen and J.G. Ibrahim. Bayesian predictive inference for time series count data.

Biometrics, 56(3):678–685, 2000.

P. Congdon. Bayesian Statistical Modelling. Wiley, 1st edition, 2001.

D.R. Cox. Statistical analysis of time series: Some recent developments. Scandinavian

Journal of Statistics, 8:93–115, 1981.

J.D. Cryer. Time Series Analysis. PWS-Kent, 1986.

R.A. Davis, W.T.M. Dunsmuir, and S.B. Streett. Observation-driven models for

Poisson counts. Biometrika, 90:777–790, 2003.

168 R.A. Davis, W.T.M. Dunsmuir, and Y. Wang. Modelling time series of count data.

pages 63–114, 1999.

F.X. Diebold. Serial correlation and the combination of forecasts. Journal of Business

and Economic Statistics, 6(1):105–111, 1988.

P.J. Diggle, K-Y. Liang, and S.L. Zeger. Analysis of Longitudinal Data. Oxford

University Press, 1994.

F. Dominici, J.M. Samet, and S.L. Zeger. Combining evidence on air pollution and

daily mortality from the 20 largest US cities: A hierarchical modelling strategy.

Journal of the Royal Statistical Society, Series A: Statistics in Society, 163:263–

284, 2000.

F. Dominici, A. Zanobetti, S.L. Zeger, J. Schwartz, and J.M. Samet. Hierarchical

bivariate time series models: A combined analysis of the effects of particulate

matter on morbidity and mortality. , 5:341–360, 2004.

J. Durbin and S.J. Koopman. Time series analysis of non-Gaussian observation based

on state space models from both classical and Bayesian perspectives. Journal of

the Royal Statistical Society, Series B: Statistical Methodology, 62:3–56, 2000.

G.D. Faulkenberry. A method of obtaining prediction intervals. Journal of the Amer-

ican Statistical Association, 68(342):433–435, 1973.

FBI. Crime in the United States. U.S. Department of Justice, Washington, D.C.,

2000.

A. Gelman. Prior distribution for variance parameters in hierarchical models.

Bayesian Analysis, 1(3):515–533, 2006.

169 A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis.

Chapman-Hall, 2nd edition, 2004.

J.K. Ghosh, M. Delampady, and T. Samanta. An Introduction to Bayesian Analysis.

Springer, 2006.

M. Ghosh, N. Nangia, and D.H. Kim. Estimation of median income of four-person

families: A Bayesian time series approach. Journal of the American Statistical

Association, 91(436):1423–1431, 1996.

C.R. Goodall, K. Kafadar, and J.W. Tukey. Computing and using rural versus urban

measures in statistical applications. American , 52:101–111, 1998.

A.C. Harvey. Forecasting, Structural Time Series Models and the Kalman Filter.

Cambridge University Press, 1989.

T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science, 1(3):

297–310, 1986.

J.L. Hay and A.N. Pettitt. Bayesian analysis of a time series of counts with covariates:

An application to the control of an infectious disease. Biostatistics, 2(4):433–444,

2001.

JRSA. Status of NIBRS in the states, 2006. URL

http://www.jrsa.org/ibrrc/background-status/nibrs states.html. Ac-

cessed December 4, 2007.

J.F. Lawless. Negative binomial and mixed Poisson regression. The Canadian Journal

of Statistics, 15(3):209–225, 1987.

170 W.K. Li. Time series models based on generalized linear models: Some further results.

Biometrics, 50(2):506–511, 1994.

R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. Wiley-

Interscience, 2nd edition, 2002.

A.M. Lyons, M.W. Packer, Jr., M.B. Thomason, J.C. Wesley, P.J. Hansen, J.H.

Conklin, and D.E. Brown. Uniform Crime Report “SuperClean” data cleaning

tool. Systems and Information Engineering Design Symposium, 2006.

M.D. Maltz. Bridging gaps in police crime data. Technical Report Report No. NCJ-

1176365, Bureau of Justice Statistics, Office of Justice Programs, U.S. Department

of Justice, Washington, D.C., 1999.

M.D. Maltz. Analysis of missingness in UCR crime data. Technical Report 215343,

U.S. Department of Justice, 2006.

M.D. Maltz and J. Targonski. A note on the use of county-level crime data. Journal

of Quantitative Criminology, 18:297–318, 2002.

M.D. Maltz and H.E. Weiss. Creating a UCR utility. Technical report, Final Report

to the National Institute of Justice, 2006.

P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall, 2nd

edition, 1989.

J. Nolan, S.M. Haas, T.K. Lester, J. Kirby, and C. Jira. Establishing the ‘statistical

accuracy’ of the Uniform Crime Reports (UCR) in West Virginia. Technical report,

Criminal Justice Statistical Analysis Center, 2006.

171 A. Pole, M. West, and J. Harrison. Applied Bayesian Forecasting and Time Series

Analysis. Chapman & Hall, 1994.

R Development Core Team. R: A Language and Environment for Statistical Com-

puting. R Foundation for Statistical Computing, Vienna, Austria, 2008. URL

http://www.R-project.org. ISBN 3-900051-07-0.

M.R. Rand and C.M. Rennison. True crime stories? accounting for differences in our

national crime indicators. Chance Magazine, 15(1), 2002.

D.B. Rubin. Inference with missing data. Biometrika, 63(3):581–592, 1976.

D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.

D.G. Saphire. A longitudinal random effects model with covariates. American Sta-

tistical Association - Section, 1991.

J.L. Schafer. Multiple imputation: A primer. Stat Methods Med Red, 8(3):3–15, 1999.

J. Shao and B. Zhong. Last observation carry-forward and last observation analysis.

Statistics in Medicine, 22(15):2429–2441, 2003.

D. Spiegelhalter, A. Thomas, N. Best, and W. Gilks. BUGS:

Bayesian Inference Using Gibbs Sampling, Version 0.30, 1994. URL

citeseer.ist.psu.edu/spiegelhalter95bugs.html.

M.A. Tanner and W.H. Wong. The calculation of posterior distributions by data

augmentation. Journal of the American Statistical Association, 82(398):528–540,

1987.

172 P.A. Thompson and R.B. Miller. Sampling the future: A Bayesian approach to

forecasting from univariate time series models. Journal of Business and Economic

Statistics, 4(4):427–436, 1986.

Naitee Ting and Allison Brailey. Why last observation carried forward? Or why not?

CT Chapter Mini Conference. ASA, 2005.

R.W.M. Wedderburn. Quasi-likelihood functions, generalized linear models, and the

Gauss-Newton method. Biometrika, 61(3):439–447, 1974.

E.W. Weisstein. Hermite polynomial, 2008. URL

http://mathworld.wolfram.com/HermitePolynomial.html. Accessed April

10, 2008.

S.L. Zeger. A regression model for time series of counts. Biometrika, 75:621–629,

1988.

S.L. Zeger and B. Qaqish. Markov regression models for time series: A quasi-likelihood

approach. Biometrics, 44:1019–1031, 1988.

173