GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

(Under the Direction of Lan Mu and Marguerite Madden)

ABSTRACT

Geographic information systems (GIS) are increasingly recognized as an effective and efficient tool to deal with geographic questions in health studies. The overarching research question of this dissertation asks how GIS and can be used to facilitate studies. Three aspects of health studies are included: spatial disease cluster detection, spatio-temporal disease mapping, and health service planning. New methods or models are proposed and implemented with GIS in this dissertation to address an important problem in each of the three aspects.

First, a redesigned spatial scan statistic (RSScan) is proposed to quickly detect disease clusters in arbitrary shapes. The experimental results indicate that the improved RSScan method generally has higher power and accuracy than three existing methods for detecting the clusters in irregular shapes. Second, to explore the spatio-temporal patterns of lung cancer incidence risks in

Georgia between 2000 and 2007, a total of seven hierarchical Bayesian models are developed and compared at the census tract level using a two-year time period as the temporal unit. The study shows the northwest region of Georgia has stably elevated lung cancer incidence risks for all the population groups by race and sex. It also shows that there are strong inverse relationships

between socioeconomic status and lung cancer incidence risk in males and weak inverse

relationships in females in Georgia. Finally, two transportation models that address the modular

capacitated maximal covering location problem (MCMCLP) are proposed and used to optimally

site ambulances for Emergency Medical Services (EMS) Region 10 in Georgia. As a component of the allocation-location problems for health service planning, spatial demand representation is

discussed and three representation approaches are empirically compared in both problem

complexity and representation error.

Results of this dissertation contribute to the advancement of geospatial analysis in disease

surveillance and health service decision making. Future research could include using GIS and

spatial analysis to improve the accuracy of detected clusters, explore the environmental factors

related to the spatio-temporal patterns of lung cancer incidence risks in Georgia, and integrate

population movement in health service planning.

INDEX WORDS: GIS, Public health, Cluster detection, Disease mapping, Health planning

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

B.E., Tsinghua University, China, 2002

M.E., Tsinghua University, China, 2005

A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial

Fulfillment of the Requirements for the Degree

DOCTOR OF PHILOSOPHY

ATHENS, GEORGIA

2012

© 2012

Ping Yin

All Rights Reserved

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

Major Professor: Lan Mu Marguerite Madden

Committee: Xiaobai Yao Thomas Jordan John Vena

Electronic Version Approved:

Maureen Grasso Dean of the Graduate School The University of Georgia August 2012

ACKNOWLEDGEMENTS

Five years’ Ph.D. study in the Department of Geography at the University of Georgia

(UGA) is great experience to me. I am grateful to all of those people who supported and helped me to finish my dissertation research. First and foremost, my deepest gratitude goes to my major professors, Dr. Lan Mu and Dr. Marguerite Madden, for their excellent guidance and full supports. Without their endless input, timely feedbacks, and great inspiration, I cannot have my research finished today. I really appreciate their dedication and generous help to my research and other academic activities.

I would thank Dr. John Vena in the Department of and Biostatistics at

UGA for providing me the health data for my research. His invaluable advice from an epidemiological perspective greatly improves my research.

I would also acknowledge Dr. Xiaobai Yao and Dr. Thomas Jordan for their insightful advices and suggestions on this research and other academic areas.

I want to thank Dr. Andrew Herod. He made me realize that how important correct citations are in academic writing.

The institutions that sponsored my research deserve special notice. They are the UGA research foundation and the UGA graduate school with the dean’s award in social sciences and the dissertation completion award.

Finally, I deeply thank my parents and my wife, Jing. It is their unconditional love and endless patience that encourage me to finish my dissertation.

iv

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS ...... iv

LIST OF TABLES ...... viii

LIST OF FIGURES ...... x

CHAPTER

1 INTRODUCTION AND LITERATURE REVIEW ...... 1

1.1 Background ...... 1

1.2 Research Objectives ...... 6

1.3 Literature Review...... 8

1.4 Dissertation Structure...... 12

References ...... 13

2 DETECTING DISEASE CLUSTERS IN ARBITRARY SHAPES WITH A

REDESIGNED SPATIAL SCAN STATISTIC ...... 18

Abstract ...... 19

2.1 Introduction ...... 20

2.2 Existing Methods for Detection of Disease Clusters ...... 21

2.3 Redesigned Spatial Scan Method (RSScan) ...... 24

2.4 Performance Evaluation ...... 28

2.5 Application: Georgia Lung Cancer, 1998 -2005 ...... 37

2.6 Discussion and Conclusions ...... 38

v

References ...... 41

3 HIERARCHICAL BAYESIAN MODELING OF THE SPATIO-TEMPORAL

PATTERNS OF LUNG CANCER INCIDENCE RISKS IN GEORGIA, 2000-2007 44

Abstract ...... 45

3.1 Introduction ...... 46

3.2 Study Area and Data ...... 48

3.3 Methods...... 50

3.4 Results ...... 57

3.5 Discussions ...... 67

3.6 Conclusions ...... 68

References ...... 70

4 MODULAR CAPACITATED MAXIMAL COVERING LOCATION PROBLEM

FOR THE OPTIMAL SITING OF EMERGENCY VEHICLES ...... 73

Abstract ...... 74

4.1 Introduction ...... 75

4.2 Modular Capacitated Maximal Covering Location Problem (MCMCLP) ..... 78

4.3 Spatial Demand Representation ...... 84

4.4 Applications: Optimal Siting of Ambulances ...... 85

4.5 Discussion ...... 96

4.6 Conclusion ...... 98

References ...... 99

5 AN EMPIRICAL COMPARISON OF SPATIAL DEMAND REPRESENTATIONS

IN MAXIMAL COVERAGE MODELING ...... 102

vi

Abstract ...... 103

5.1 Introduction ...... 104

5.2 Representation Error in Covering Location Modeling ...... 106

5.3 The MCLP Model and Problem Complexity ...... 110

5.4 Service Area Spatial Demand Representation ...... 112

5.5 Experimental Design ...... 117

5.6 Results and Discussions ...... 120

5.7 Conclusions ...... 130

References ...... 133

6 CONCLUSIONS...... 136

6.1 Summary and Conclusions ...... 136

6.2 Future Research ...... 139

References ...... 142

APPENDICES

I LIST OF ACRONYMS ...... 143

vii

LIST OF TABLES

Page

Table 2.1: Test statistics and search strategies of four spatial scan methods ...... 25

Table 2.2: Information of simulated cluster models ...... 31

Table 2.3: Estimated power of four spatial scan methods (significance level=0.05) ...... 33

Table 2.4: Contingency table for detected cluster estimates and true clusters ...... 34

Table 2.5: KIAs between the most likely clusters and true clusters for four spatial scan methods 36

Table 2.6: Average Type I error of four spatial scan methods ...... 37

Table 3.1: Total number of cases of individuals over 20 years old and the percentage of included

cases in the analyses by sex and race ...... 49

Table 3.2: Variables incorporated in the modified Darden-Kamel Composite Index ...... 51

Table 3.3: Components of logarithms of RRs in the seven Bayesian spatio-temporal models .... 54

Table 3.4: DICs of the seven models ...... 57

Table 3.5: Posterior median (95% CI) of the shared temporal components and differential

temporal components ...... 66

Table 3.6: Posterior median (95% CI) of the RRs for SES quintile ...... 67

Table 3.7: Correlations between the posterior median RRs using model 2 with two different

types of hyperpriors ...... 67

Table 4.1: Information for roads ...... 89

Table 4.2: Count of the facilities with varied numbers of ambulances ...... 96

Table 5.1: Numbers of demand objects in 45 SASDRs ...... 121

viii

Table 5.2: Numbers of demand objects in all demand representations for comparison ...... 124

Table 5.3: Minimum numbers of facilities reported by models for covering 100% demand ..... 125

Table 5.4: Cost and optimality errors between grid-point-based demand representations and

SASDRs ...... 127

Table 5.5: Cost and optimality errors between grid-rectangle-based demand representations and

SASDRs ...... 128

ix

LIST OF FIGURES

Page

Figure 1.1: GIS functions and GIS applications in public health ...... 4

Figure 1.2: Logical structure of the dissertation research ...... 9

Figure 2.1: Graph-based representation of a region map ...... 27

Figure 2.2: Population 2000 by counties in GA in the United States ...... 30

Figure 2.3: Locations of simulated clusters: (a) circular shape (b) linear shape (c) trifurcate shape 30

Figure 2.4: Estimated average power of four spatial scan methods ...... 34

Figure 2.5: Average KIAs of four spatial scan methods ...... 36

Figure 2.6: SIRs and the detected cluster of lung cancer incidence in GA, 1998-2005 ...... 38

Figure 3.1: Population density by census tract and the 10 most populous cities in Georgia 2000 48

Figure 3.2: Quintile map of SES in Georgia 2000 ...... 52

Figure 3.3: Maps of crude standardized incidence ratios (SIRs) by race and sex during 2000-

2001 ...... 58

Figure 3.4: Maps of the posterior median RRs for white males in each time period ...... 60

Figure 3.5: Maps of the posterior median RRs for white females in each time period ...... 61

Figure 3.6: Maps of the posterior median RRs for black males in each time period ...... 62

Figure 3.7: Maps of the posterior median RRs for black females in each time period ...... 63

Figure 3.8: Maps of elevated RR frequency by race and sex during 2000-2007 ...... 64

Figure 3.9: Maps of the posterior median of the shared spatial component and differential spatial

components ...... 65

x

Figure 4.1: Illustration of three demand types: unallocated demand (da and db), covered allocated

demand (dc), and uncovered allocated demand (dd) ...... 78

Figure 4.2: Example of the SASDR with circular facility service area (a) demand space U (the

square) and two potential service areas S1 and S2 (the circles) (b) four demand objects

in the SASDR result of demand space U partitioned by service areas S1 and S2 ...... 85

Figure 4.3: Population density of Georgia EMS Region 10 (study area) by census block group

and existing ambulance facility locations ...... 87

Figure 4.4: Road network in EMS Region 10 in GA ...... 89

Figure 4.5: Eight-minute service areas (non-white polygons) of all potential ambulance facility

sites (red points) based on the road network ...... 90

Figure 4.6: SASDR result for the study area with demand (population) distribution ...... 92

Figure 4.7: Results of the MCMCLP models siting 58 ambulances in 82 potential facility

locations with w= 6 ×10−8 (the facility location is rendered in the same color as its

allocation area) (a) the MCMCLP-NFC model (b) the MCMCLP-FC model with 20

facilities ...... 95

Figure 5.1: Examples of spatial demand representations with (a) census blocks or their centroids,

and (b) rectangle grid or its centroids ...... 108

Figure 5.2: Illustration of overlay operation A▲B: (a) set A and set B (b) the result from A▲B 114

Figure 5.3: The SASDR with circular facility service area: (a) demand space U and two potential

service areas S1 and S2, (b) the partition of demand space U with service area S1, and

(c) the partition of demand space U with both service areas S1 and S2 ...... 116

Figure 5.4: Three modes of potential facility sites: (a) regular grid points with spacing R, (b)

centroids of census blocks, and (c) intersections of major roads ...... 118

xi

Figure 5.5: Examples of grid-point-based and grid- rectangle-based demand representations for

comparison with SASDR ...... 120

Figure 5.6: Relationship between Site-Service Index and demand object density in SASDR with

circular service coverage ...... 123

Figure 5.7: Percentages of covered demand reported by the MCLP models with 3 types of

demand representations when the configuration of potential facility sites include: (a) 66

grid points, (b) 272 grid points (c) 66 block centroids, and (d) 272 block centroids ..... 126

xii

CHAPTER 1

INTRODUCTION AND LITERATURE REVIEW

1.1 Background

Because all fields are changing all along, the debate on the definitions and scopes of subfields such as “medical geography”, “health geography” and “spatial epidemiology” still continues (Brown et al. 2010). However, it cannot be denied that more and more attention from the researchers in health, geography, and other fields are drawn to the geographic component of health, i.e., the question “where”. Where are populations at risk? Where are hotspot areas with elevated disease risks? Where can we intervene to eliminate or reduce disease risks? Where can we locate healthcare facilities to improve health services delivery? Geographic information systems (GIS), which were originally used within the formal discipline of geography, are increasingly recognized as an effective and efficient tool to deal with these geographic questions in research and practices in epidemiology and public health (Rushton 2003, Najafabadi 2009,

Nykiforuk and Flaman 2011, Cromley and McLafferty 2012).

Actually, over 150 years ago, early public health professionals learned that maps could be used to explore patterns of diseases and relationships between diseases and risk factors. In 1840,

Robert Cowan used a map to show the relationship between fever and overcrowding in Glasgow

(Melnick 2002). The famous story about , one of the fathers of modern epidemiology, is often used in current textbooks in epidemiology, disease mapping and GIS to illustrate the one of the first uses of a map to identify a disease source (Melnick 2002, Koch 2005, Longley et al.

1 2005). In 1854, John Snow plotted a map showing the cholera deaths in the Soho district of

London, by which he demonstrated the association between these deaths and contaminated water

supplies from a public water pump in the center of the outbreak.

Since the development of the first real GIS, the Canada Geographic Information System

in the mid-1960s, there has been a rapid increase and great improvement in the functions of GIS

based on the advances in computer science, cartography, computational geometry, and spatial

statistics. Cromley and McLafferty (2012) define GIS as computer-based systems for the

integration and analysis of geographic data. They classify GIS functions into three broad

categories based on what people want to do with spatial data: 1) spatial database management; 2)

visualization and mapping; and 3) spatial analysis. In the past, GIS was regarded as a technology

as discussed above. Nowadays, GIS has been attached with multiple labels, such as GIS software,

GIS data, GIS community, and doing GIS (Longley et al. 2005). Goodchild (1992) coined the

term of “GIScience” that refers to the research field about the fundamental principles and

questions underlying the activities of using GIS as a technology.

Nykiforuk and Flaman (2011) reviewed GIS applications in public health and classified four content categories in order of descending prevalence in the literature: disease surveillance, risk analysis, health access and planning, and community health profiling. Disease surveillance is the compilation and tracking of data on the incidence prevalence, and spread of disease (Wall and Devine 2000). Cluster detection, disease mapping, and disease modeling are several interrelated components of disease surveillance. Cluster detection is an analysis process that aims to identify hotspot areas with elevated disease risks. Disease mapping is used to understand the distribution of disease or disease risk in the past or present. Disease modeling extends the disease mapping to identify factors associated with disease risks in order to predict the future spread of

2 disease. These components of disease surveillance that are important for disease prevention and control can be conducted in spatial or spatio-temporal dimensions. Risk analysis includes some aspect(s) of risk – assessment, management, communication, or monitoring – relative to impacts on health (Nykiforuk and Flaman 2011). Health access and planning is to evaluate and improve health services delivery. Community health profiling is the compilation of mapping of information regarding the health of a population in a community. These four categories are overlapping. For example, in a disease mapping application, risk analyses could also be conducted.

Figure 1.1 shows GIS functions and GIS applications in public health based on Cromley and McLafferty’s (2012) and Nykiforuk and Flaman’s (2011) classifications discussed above. It is impossible to completely describe all of GIS functions and how they can be used in public health studies because the use of GIS functions is usually application-dependent and both GIS and health studies are evolving all along. Here, we only briefly list several aspects to show how

GIS can greatly facilitate health studies, including population estimation, data integration, exposure assessment, healthcare access evaluation, and communication.

(1) Population estimation

It is important for health studies to understand the distribution of a population at risk.

Because of the economic and social processes that structure residential development, age, sex and race-ethnicity of the population are usually not uniform throughout the region of settlement

(Cromley and McLafferty 2012). GIS makes it possible to view residential distributions in great detail. In addition to residence, GIS can help to model people’s activity in space and their migration processes to understand the exposure people experienced, which is important for the studies of diseases with a long latency period such as cancers. Sometimes, population data are

3 not available in some regions or some time periods, GIS can be used to interpolate or modeling the population with available data in other regions or time periods.

GIS functions Public health studies

Spatial database Disease surveillance • • Store Cluster detection • • Join Disease mapping • • Query Disease modeling • Edit • Delete Risk analysis

• Assessment • Management • Communication Visualization and mapping • Monitoring • Tables • Graphs Health access and planning • Maps • Statistics • Market segmentation • Client catchment areas • Market utilization • Location-allocation modeling Spatial analysis

• Measurement Community health profiling • Topological analysis • Network analysis • Mapping health and setting • Surface analysis variables in a community • Spatial statistics • Multilevel, ecological links between people and settings

Figure 1.1. GIS functions and GIS applications in public health

4 (2) Data integration

The strong capability of spatial data management of GIS makes it easy to integrate multiple geographic data of health outcomes and environmental, socioeconomic, and behavioral factors based on geographic information (location). These spatial data may be collected by different local, state, or federal agencies, public and private, using different devices or technology. Linking all of these data can give a more comprehensive context or settings of the disease of interest, which is essential to identify relationships between diseases and all kinds of factors and develop etiological hypotheses.

(3) Exposure assessment

Accurate estimation and mapping of exposures is clearly vital if valid inferences are to be drawn either about the spatial distribution of risk factors, or about their geographic relationship with health outcome (Elliott et al. 2000). Suitable measures, such as biomarkers, tend to be costly and invasive. Therefore, especially for population-based research, it is common to estimate exposure based on environmental monitoring data, such as air pollutant concentrations, or using proxy measures of exposure, such as distance from source. These indirect methods can be easily conducted in GIS using interpolation methods and measuring functions.

(4) Healthcare access evaluation

Evaluating current status of health service delivery is important for health policy making and utilization of resources. The network analysis functions in GIS provide convenient ways to calculate client catchment areas of healthcare facilities and the shortest distance from population to healthcare facilities. Some measures for healthcare accessibility, such as the two-step floating catchment area method (2SFCA) for assessing the local availability of services in relation to

5 population need (Luo and Wang 2003), can easily be implemented in GIS using join and sum functions.

(5) Communication

Preparing and displaying maps of health information are among the most important functions of public health GIS (Cromley and McLafferty 2012). By portraying the results of analysis on a map, GIS technology gives communities an easily understandable visual picture of community health (Melnick 2002). Maps are recognized as one of the most important communication tools among researchers, decision makers, and public. With the development of

Internet GIS, the health information can be quickly published using interactive web mapping to anyone with access to the Internet (Theseira 2002, Boulos 2003, Boulos 2005).

Based on the above examples of GIS applications in health, we can see that GIS can be used as a natural and effective means to approach a variety of program, policy, and planning issues in health promotion and public health (Nykiforuk and Flaman 2011).

1.2 Research Objectives

The overarching research question of this dissertation asks how GIS and spatial analysis can be used to facilitate public health studies. Understanding health status and then effectively and efficiently providing health care service are necessary to promote public health. Therefore, this research involves three aspects of health studies related with heath surveillance and health service planning: spatial disease cluster detection, spatio-temporal disease mapping, and optimal siting of health facilities. The first two are both techniques used to describe the distribution of a disease. Spatial disease cluster detection is to quickly identify the hotspot areas with elevated risks. Usually, it only requires health outcome data and basic population data. It is very useful for health departments to maintain surveillances on disease outbreaks. However, it cannot provide

6 detailed information on the spatial patterns of disease risks within hotspot areas and other areas of interest. Spatio-temporal disease mapping can complement cluster detection analysis. It can provide the spatio-temporal patterns of disease risks across the whole study area and the time period. These health patterns can be linked to all kinds of factors to develop etiological hypotheses. Knowing the patterns of disease risks is not the end. The goal of health study is to prevent and control the spread of disease and promote public health. Given the patterns of disease risks obtained from disease mapping analyses, we can easily identify areas with high health service needs. Then, based on the spatial distribution of the needs, health service can be planned more effectively and efficiently.

This dissertation research includes three main objectives, each of which addresses an important problem in the three aspects of health studies by developing new methods or models that are implemented with GIS and spatial analysis. More specifically, these three objects are:

(1) To develop a new method to detect disease clusters in arbitrary shapes with higher statistical power and more accurate geographic boundaries;

(2) To develop hierarchical Bayesian models to explore the spatio-temporal patterns of lung cancer incidence risks by race and sex in Georgia (2000-2007) at a fine spatio-temporal scale;

(3) To develop a new location-allocation model to optimally site ambulances so that the emergency medical services (EMS) can be delivered more effectively and efficiently.

In the study of the location-allocation model for health service planning, a sub-problem – spatial demand representation – is worth discussing since it is highly related to modeling errors and problem complexity. Therefore, this dissertation research is also to empirically compare

7 three existing spatial demand representation approaches to provide some implications on how to choose appropriate one for a specific application.

In general, Figure 1.2 shows the logical structure of the dissertation research.

1.3 Literature Review

1.3.1 Detection of Irregular Disease Clusters

Detection of disease clusters in time, space or space-time has generated considerable interests within disciplines of geography and public health for many decades (Besag and Newell

1991, Maheswaran and Craglia 2004, Lawson 2006). The shape of the geographic area of a true disease cluster may be arbitrary. For example, air pollution diffusing from an incinerator may cause an arbitrary disease cluster due to the wind strength and direction. To detect clusters in irregular shapes, several methods have been proposed in (Duczmal and Assunção 2004, Tango and Takahashi 2005, Aldstadt and Getis 2006, Duczmal et al. 2006, Kulldorff et al. 2006,

Yiannakoulias et al. 2007, Duczmal et al. 2008, Duczmal et al. 2009, Cançado et al. 2010).

Seeking methods for detection of clusters in irregular shapes with higher statistical power and more accurate geographic boundary is still a hot topic in current health research.

1.3.2 Spatio-temporal Mapping of Disease Risks

Lung cancer is not only the second most commonly diagnosed cancer in men and women, but also the leading cause of cancer-related death in Georgia (Georgia Department of Public

Health 2008). However, as far as we know, the lung cancer studies in Georgia are very few, and most of them mainly focus on descriptive analyses using crude rates at a coarse spatio-temporal scale, such as the 5-year incidence rates at the health district or county level. Such analyses are not useful for assessing the health of diverse communities, and could introduce inferential biases on etiological hypotheses. In addition, they can only provide limited help for healthcare

8

GIS for public health studies

Component Component

Health surveillance Health service planning

Component Component Component

Spatial disease Spatio-temporal Optimal siting of Sub-problem Spatial demand cluster detection disease mapping health facilities representation

Research Topic Research Topic Research Topic Research Topic

New method for Spatio-temporal Bayesian New location-allocation Comparison of three detection of clusters models for Georgia lung model for ambulance spatial demand with irregular shapes cancer mapping at fine scales siting representations

Figure 1.2. Logical structure of the dissertation research

9 performance assessment and health policy making to improve the efficiency of interventions and the distribution of resources. The low reliability of the disease rates for small population areas is one of the challenges for mapping disease risk at a fine spatio-temporal scale. Recently, hierarchical Bayesian models have been widely used to map disease risk spatially or spatio- temporally to overcome or mitigate the small number problem (Bernardinelli et al. 1995, Waller et al. 1997, Xia and Carlin 1998, Knorr-Held 2000, Mollié 2001, Wakefield et al. 2001, Best et al. 2005, Richardson et al. 2006, Abellan et al. 2008, Lawson 2009, Fortunato et al. 2011).

When mapping one disease for multiple population groups or multiple diseases that have common risk factors, a joint modeling framework can be used (Knorr-Held and Best 2001, Held et al. 2005, Richardson et al. 2006, Downing et al. 2008). In this modeling framework, a set of shared random components exists in each model.

1.3.3 Capacitated Maximal Covering Location Problems

Given a covering standard for a service, such as a distance or travel-time maximum, the objective of the maximal covering location problem (MCLP) is to locate a fixed number of facilities to provide the service to cover as many demands as possible. MCLP modeling, after being put forward by Church and ReVelle (1974), has been a powerful and widely used tool in many planning processes to optimally distribute limited resources to maximize social and economic benefits. Chung et al. (1983) and Current and Storbeck (1988) published two early papers dealing with the capacitated versions of the MCLP where the demands allocated to a facility will not exceed the capacity of that facility. In all capacitated MCLP models, only one fixed capacity level of the facility is considered for each potential facility site. However, many situations arise where each potential facility site could have several possible maximum capacity levels for a facility to choose. For example, the capacity limit of an emergency facility (e.g.,

10 ambulance base or fire station) can be assumed to be determined by its stationed emergency

vehicles (e.g., ambulances or fire trucks). Therefore, varied numbers of emergency vehicles will

provide a series of possible maximum capacity levels for the emergency facility to choose.

1.3.4 Spatial Demand Representations

For covering location modeling, it is common to assume that aggregated or continuous spatial demand is concentrated on a set of points or uniformly distributed within areal units.

Different from the traditional area-based representations using census units or regular polygons,

such as triangles or rectangles, as demand objects, Cromley et al. (2012) proposed a new area-

based demand representation that partitions a continuous demand space into a set of the least

common demand coverage units (LCDCUs) by overlaying demand coverage areas at potential

facility sites. This representation approach, without complicated model formulations, could

reduce or eliminate some errors associated with the traditional point-based and area-based

representations.

Many covering location models, such as the maximal covering location problem (MCLP),

have been proven to be nondeterministic polynomial time (NP)-hard (Megiddo et al. 1981), which means that no algorithm has been discovered yet to solve it in polynomial time in the worst case. Actually, the size of a covering location problem is highly related to the demand representation it adopts. Therefore, even if a demand representation approach may theoretically reduce or eliminate some representation errors in a problem, it probably could make the problem difficult, if not impossible, to solve using exact methods in current optimization software.

Relying on some heuristic algorithms to solve such a complicated problem may introduce other errors in modeling results. It is worth noting that the complexity of problems associated with demand representations is rarely discussed in current literature.

11 1.4 Dissertation Structure

The dissertation structure is organized into six chapters. Chapter 1 is a brief introduction of the background and objectives of the dissertation research, and literature review of the topics covered in this dissertation, including the detection of irregular disease cluster, spatio-temporal mapping of disease risks, capacitated maximal covering location problems, and spatial demand representations. The following four chapters are separate papers published in or to be submitted to journals. In Chapter 2, a redesigned spatial scan statistic is proposed to detect disease clusters with irregular shapes. Chapter 3 develops seven hierarchical Bayesian models under separate and joint modeling frameworks to explore the spatio-temporal patterns of lung cancer incidence risks in Georgia (2000-2007) at the census tract level with a two-year temporal unit. Chapter 4 develops modular capacitated maximal covering location problem (MCMCLP) models to optimally site emergency vehicles (e.g. ambulance). In Chapter 5, three spatial demand representation approaches are compared in both representation error and problem complexity using the MCLP as an example. Chapter 6 provides conclusions of this dissertation and shows the future work.

12 References

Abellan, J.J., Richardson, S. & Best, N., 2008. Use of space–time models to investigate the stability of patterns of disease. Environmental health perspectives, 116 (8), 1111.

Aldstadt, J. & Getis, A., 2006. Using amoeba to create a spatial weights matrix and identify spatial clusters. Geographical analysis, 38 (4), 327-343.

Bernardinelli, L., Clayton, D., Pascutto, C., Montomoli, C., Ghislandi, M. & Songini, M., 1995. Bayesian analysis of space—time variation in disease risk. Statistics in Medicine, 14 (21 22), 2433-2443.

Besag, J. & Newell, J., 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society. Series A (Statistics in Society), 154 (1), 143-155.

Best, N., Richardson, S. & Thomson, A., 2005. A comparison of bayesian spatial models for disease mapping. Statistical Methods in Medical Research, 14 (1), 35.

Boulos, M.N.K., 2003. The use of interactive graphical maps for browsing medical/health internet information resources. International Journal Of Health Geographics, 2 (1), 1.

Boulos, M.N.K., 2005. Web gis in practice iii: Creating a simple interactive map of england's strategic health authorities using google maps api, google earth kml, and msn virtual earth map control. International Journal Of Health Geographics, 4 (1), 22.

Brown, T., Mclafferty, S. & Moon, G. eds. 2010. A companion to health and medical geography, Chichester, UK: Wiley-Blackwell.

Cançado, A.L.F., Duarte, A.R., Duczmal, L.H., Ferreira, S.J., Fonseca, C.M. & Gontijo, E.C.D.M., 2010. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters. International Journal of Health Geographics, 9 (1), 55.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

13 Cromley, E.K. & Mclafferty, S.L., 2012. Gis and public health, 2nd ed. New York: The Guilford Press.

Cromley, R.G., Lin, J. & Merwin, D.A., 2012. Evaluating representation and scale error in the maximal covering location problem using gis and intelligent areal interpolation. International Journal of Geographical Information Science, 26 (3), 495-517.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Downing, A., Forman, D., Gilthorpe, M., Edwards, K. & Manda, S., 2008. Joint disease mapping using six cancers in the yorkshire region of england. International Journal of Health Geographics, 7 (1), 41.

Duczmal, L. & Assunção, R., 2004. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45 (2), 269- 286.

Duczmal, L., Cançado, A.L.F. & Takahashi, R.H.C., 2008. Geographic delineation of disease clusters through multi-objective optimization. Journal of Computational & Graphical Statistics, 17, 243-262.

Duczmal, L., Duarte, A.R. & Tavares, R., 2009. Extensions of the scan statistic for the detection and inference of spatialclusters. Scan Statistics, 153-177.

Duczmal, L., Kulldorff, M. & Huang, L., 2006. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics, 15 (2), 428-442.

Elliott, P., Wakefield, J.C., Best, N.G. & Briggs, D.J., 2000. Spatial epidemiology: Methods and applications. In Elliott, P., Wakefield, J.C., Best, N.G. & Briggs, D.J. eds. Spatial epidemiology: Methods and applications. New York: Oxford univeristy press, 3-14.

Fortunato, L., Abellan, J.J., Beale, L., Lefevre, S. & Richardson, S., 2011. Spatio-temporal patterns of bladder cancer incidence in utah (1973-2004) and their association with the presence of toxic release inventory sites. International Journal of Health Geographics, 10 (1), 16.

Georgia Department of Public Health, 2008. Cancer program and data summary. Atlanta,GA.

14 Goodchild, M.F., 1992. Geographical information science. International Journal of Geographical Information Systems, 6 (1), 31-45.

Held, L., Natário, I., Fenton, S.E., Rue, H. & Becker, N., 2005. Towards joint disease mapping. Statistical Methods in Medical Research, 14 (1), 61-82.

Knorr-Held, L., 2000. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine, 19 (17-18), 2555-2567.

Knorr-Held, L. & Best, N.G., 2001. A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society: Series A (Statistics in Society), 164 (1), 73-85.

Koch, T., 2005. Cartographies of disease : Maps, mapping, and medicine Redlands, California: ESRI Press.

Kulldorff, M., Huang, L., Pickle, L. & Duczmal, L., 2006. An elliptic spatial scan statistic. Statistics in Medicine, 25 (22), 3929-3943.

Lawson, A., 2006. Statistical methods in spatial epidemiology, 2nd ed. Chichester, England ; Hoboken, NJ: Wiley.

Lawson, A.B., 2009. Bayesian disease mapping: Hierarchical modeling in spatial epidemiology: Chapman & Hall/CRC.

Longley, P.A., Goodchild, M.F., Maguire, D.J. & Rhind, D.W., 2005. Geographic information systems and science, 2nd ed.: John Wiley & Sons, Ltd.

Luo, W. & Wang, F., 2003. Measures of spatial accessibility to health care in a gis environment: Synthesis and a case study in the chicago region. Environment and Planning B, 30 (6), 865-884.

Maheswaran, R. & Craglia, M., 2004. Gis in public health practice Boca Raton: CRC Press.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

15 Melnick, A.L., 2002. Introduction to geographic information systems in public health Gaithersburg, Maryland: Aspen Publishers.

Mollié, A., 2001. 15.. Bayesian mapping of hodgkins disease in france. Spatial Epidemiology, 1 (9), 267-286.

Najafabadi, A.T., 2009. Applications of gis in health sciences. Shiraz E Medical Journal, 10 (4), 221-230.

Nykiforuk, C.I.J. & Flaman, L.M., 2011. Geographic information systems (gis) for health promotion and public health: A review. Health Promotion Practice, 12 (1), 63-73.

Richardson, S., Abellan, J. & Best, N., 2006. Bayesian spatio-temporal analysis of joint patterns of male and female lung cancer risks in yorkshire (uk). Statistical Methods in Medical Research, 15 (4), 385.

Rushton, G., 2003. Public health, gis and spatial analytic tools. Annual Review of Public Health, 24, 43-56.

Tango, T. & Takahashi, K., 2005. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4, 11-15.

Theseira, M., 2002. Using internet gis technology for sharing health and health related data for the west midlands region. Health & Place, 8 (1), 37-46.

Wakefield, J., Best, N. & Waller, L., 2001. 7.. Bayesian approaches to disease mapping. Spatial Epidemiology, 1 (9), 104-128.

Wall, P.A. & Devine, O.J., 2000. Interactive analysis of the spatial distribution of disease using a geographic information systems. Journal of geographical systems, 2 (3), 243.

Waller, L., Carlin, B., Xia, H. & Gelfand, A., 1997. Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association, 607-617.

Xia, H. & Carlin, B., 1998. Spatio-temporal models with errors in covariates: Mapping ohio lung cancer mortality. Statistics in Medicine, 17 (18), 2025-2043.

16 Yiannakoulias, N., Rosychuk, R.J. & Hodgson, J., 2007. Adaptations for finding irregularly shaped disease clusters. International Journal of Health Geographics, 6 (1), 28.

17

CHAPTER 2

DETECTING DISEASE CLUSTERS IN ARBITRARY SHAPES WITH A REDESIGNED

SPATIAL SCAN STATISTIC1

1 Yin, P. and Mu, L. To be submitted to Geographical Analysis.

18

Abstract

Detection and surveillance of spatial disease clusters in arbitrary shapes have generated considerable interest within disciplines of geography and public health. However, most of existing methods have drawbacks such as enormous computing workloads, peculiar-shape clusters detected, multiple testing problem, and among others. In this study, the commonly-used

Kulldorff’s circular spatial scan statistic (CSScan) was redesigned to quickly detect spatial disease clusters in arbitrary shapes by using Tango’s restricted likelihood ratio as the test statistic combined with Assunção et al.’s dynamic Minimum Spanning Tree (dMST) search strategy. Six cluster models and two non-cluster scenarios were designed and five hundred replications for each model were simulated to test and compare the performances of the redesigned spatial scan statistic method (RSScan) with Tango’s method, Assunção et al.’s method, and Kulldorff’s

CSScan method to detect the statistically significant clusters and identify the boundaries of clusters. Besides the metric of power, the Kappa Index of Agreement (KIA) was used to indicate the degree of match between a cluster estimate and the true cluster. The results from the performance experiment indicate that the RSScan method with appropriate parameters, which were explored in this study, generally has a higher or similar capability to rapidly detect spatial disease clusters in arbitrary shapes than other three methods. RSScan method was then applied to detecting the cluster of lung cancer in the State of Georgia in United States for the period of 1998 to 2005. Limitations of RSScan method are also discussed.

Keywords: Spatial scan statistic, Restricted likelihood ratio, Disease cluster, Arbitrary shape,

Dynamic Minimum Spanning Tree

19

2.1 Introduction

Detection of disease clusters in time, space or space-time has generated considerable interest within disciplines of geography and public health for many decades (Besag and Newell

1991, Maheswaran and Craglia 2004, Lawson 2006). Lawson (2006) described a disease cluster as “any area within the study region of significant elevated risk” of a particular disease. It is also referred to as hot-spot cluster. The causes of disease clusters may include the communicability of some diseases, adverse effects from physical, socioeconomic, or psychosocial environment, certain kinds of lifestyles which are commonly considered harmful to health, such as smoking, and poor accessibility to healthcare (Maheswaran and Craglia 2004). Detecting disease clusters not only aids the analysis of disease etiology, but also enables public health departments improve their surveillance, distribute funding and other resources and control for possible disease outbreaks.

It is well accepted that the spatial variation of disease incidence is highly related with the background population at risk. For example, the occurrence of a kind of disease in an urban area is higher than that in a rural area, maybe only due to the larger population in the urban area. If two cities have the same size of population, but the proportion of population over age 60 in the first city is much higher than that in the second city, it is not surprising that the incidence of cardiovascular disease in the first city is higher. In addition, the geographic area’s shape of a true disease cluster may be arbitrary. For example, air pollution diffusing from an incinerator may cause an arbitrary disease cluster due to the wind strength and direction. Therefore, detection of the spatial disease clusters should not only take account of the spatial variation of population at risk, but also be able to catch arbitrary shapes of detected disease clusters.

20

In the following sections, Section 2 is a brief review of several well-known methods for

detecting spatial disease clusters. Section 3 proposes a redesigned spatial scan method (RSScan)

using Tango’s (2008) restricted likelihood ratio as the test statistic combined with Assunção et

al.’s (2006) dynamic Minimum Spanning Tree (dMST) search strategy to quickly detect spatial

disease clusters in arbitrary shapes. Section 4 tests the performance of RSScan with simulated

data, which is followed by an application in Section 5 using RSScan to detect the cluster of lung

cancer in Georgia from 1998 to 2005. Section 6 concludes the paper.

2.2 Existing Methods for Detection of Disease Clusters

Local Moran’s I is an index which has been widely used to identify clusters (Anselin

1995, Jacquez and Greiling 2003, Rogerson and Yamada 2009, Goovaerts 2010). However, there

are several issues concerned with using Local Moran’s I to detect disease clusters. As the design

of Local Moran’s I is to test the similarity of the attributive values between the region of interest

and its neighbors, the clusters detected with Local Moran’s I may be not the areas with

significant elevated disease risk. Local Moran’s I is incapable of detecting the clusters which

only involve a single region. Conducting a separate statistical test with Local Moran’s I for each

region in the study area results in a multiple testing problem that some clusters may be detected

just by chance even if the real pattern of disease incidence is random (Rogerson and Yamada

2009). In addition, crude rates, such as Standardized Incidence Ratio (SIR), are usually directly used as the attribute in Local Moran’s I to detect the disease clusters (Jacquez and Greiling 2003,

Rogerson and Yamada 2009), which may cause the test to be unstable due to low reliability of disease rate with a small population at risk.

Different from Local Moran’s I, Openshaw et al.’s (1987) Geographical Analysis

Machine (GAM) is an exploratory and graphical method that allows to detect clusters with

21

significant elevated disease risk. A fine regular lattice is laid on the study region, and many circles of various radii are constructed on each lattice point. The number of disease cases in each circle is then counted and compared with the number of disease cases which would be expected under the null hypothesis that all disease incidences are spatially distributed randomly within the underlying structure of population at risk. With Monte Carlo testing (Dwass 1957) where the probability distribution of the expected number of cases in each circle is generated based on simulations, if the null hypothesis is rejected, the corresponding circle will be drawn on the map.

Finally, an idea about where and how large the disease clusters may be can be obtained by looking at the plotted circles. Each circle is regarded as having a significantly elevated risk.

Since there are usually thousands of circles with various radii tested simultaneously, the multiple testing problem and enormous computational workload need to be addressed. Turnbull et al.

(1990) proposed a method, Cluster Evaluation Permutation Procedure (CEPP), which only tests the circle with maximum count of disease cases among all moving circles covering the same predefined population. This method solves the multiple testing problem, but the input threshold, a predefined population, may be hard to determine.

Based on Openshaw et al.’s (1987) and Turnbull et al.’s (1990) methods, Kulldorff and

Nagarwalla (1995) developed a circular spatial scan statistic which is denoted as the CSScan method in the following part. A circular scan window with various radii is constructed and moved over the space of study area. The null hypothesis is defined as the probability of being a case in the circle, p, is the same as that in the rest of the study region, q. The alternative hypothesis is p > q. Given the number of cases and population inside and outside the circle, maximum likelihood ratio between these two hypotheses is selected as the test statistic, which can be derived with two stochastic models, Bernoulli and Poisson (Kulldorff 1997). The circular

22

window with the maximum test statistic is regarded as the most likely cluster. Its significance is then tested using Monte Carlo testing method (Dwass 1957). The spatial scan statistic based on

Poisson model λ is shown as below (Equation 2.1, Kulldorff 1997):

()zn − ()znn   ()zn   − ()znn  ()zn − ()znn sup    if >  z Ζ∈  ()ze   − ()zen  ()ze − ()zen λ =  Equation 2.1    1 otherwise

where sup denotes supremum (least upper bound), z denotes the zone within the circular scan window which is included in the zone set Z, n(z) and e(z) denote the actual number of disease cases and the null expected number of cases within the specified zone z, respectively. n is count of total disease cases in study area. CSScan method is one of the widely-used methods for cluster detection until now possibly because it addresses the problems existing in such methods as Local

Moran’s I, GAM, and CEPP. In addition, the latest version of the tool for this method,

SaTScanTM, can be easily accessed over the Internet (Kulldorff and Information Management

Services Inc. 2010).

Since Kulldorff’s CSScan uses a circular window to scan the study region, it is difficult to detect clusters of irregular shapes. In order to solve this problem, many methods have been developed which mainly modify the search strategy of the scan window or the construction of a test statistic. Duczmal and Assunção (2004) proposed a simulated annealing search strategy for detection of arbitrarily shaped spatial clusters. In this method, however, it tends to be arbitrary

23

when choosing one of the four strategies with different levels of randomness for the successor of

the current subgraph at each step. Tango and Takahashi (2005) proposed a flexibly shaped spatial

scan statistic which exhaustively searches all cluster candidates within a given radius of any area.

However, there is an exponential increase in running time of their algorithm with the increase of

search radius. Several penalty parameters were incorporated into the maximum likelihood ratio

function in different methods to either enable the method to find irregular shaped clusters, such

as the “eccentricity penalty” in Kulldorff et al. (2006) for elliptical-shaped clusters, or penalize

the detected clusters that are very irregular in shape, such as the “non-compactness” in Duczmal

et al. (2006) and “non-connectivity penalty” in Yiannakoulias et al (2007). In spite of all the efforts, these methods are still plagued with a large dose of subjectivity in these penalty parameters.

2.3 Redesigned Spatial Scan Method (RSScan)

From the review of existing methods in the previous section, it can be summarized that spatial scan methods mainly consist of two components: a search strategy and a test statistic such as the spatial scan statistic λ. The objective of spatial scan is to find zone z which maximizes the test statistic over all zones in the set Z and identifies the one that constitutes the most likely cluster (Duczmal and Assunção 2004). A search strategy mainly defines the zone set Z and in turn determines the possible shape of a cluster estimate and the running time of an algorithm. A test statistic, combined with the search strategy, determines the performance of the method. In order to rapidly detect arbitrarily shaped spatial disease clusters for count data, and at the same time to address the issues identified in the above-mentioned methods, we redesigned Kulldorff’s

CSScan method by using Assunção et al.’s (2006) dMST method as the search strategy and

Tango’s (2008) restricted likelihood ratio as the test statistic in our RSScan method, which will

24

be described in the following subsections (2.3.1 and 2.3.2), respectively. Table 2.1 shows the test statistics and search strategies used in four spatial scan methods including our RSScan method,

Tango’s method, Assunção et al.’s method, and Kulldorff’s CSScan method.

Table 2.1. Test statistics and search strategies of four spatial scan methods

Test Statistic

Tango’s Restricted Kulldorff’s Maximum Likelihood Ratio Likelihood Ratio Assunção et al.’s Assunção et al.’s RSScan dMST method Search Strategy Circular Tango’s method CSScan Scan Window

Although Tango (2008) mentioned the restricted likelihood ratio could be used with a non-circular scan window, and his latest version of software FleXScan v3.1 (Takahashi et al.

2010), released just after this study was finished allows the restricted likelihood ratio to be combined with his flexible scan method, the current literature lacks work testing and discussing such kind of combination. Tango (2008) designed four cluster models to test the statistical power of restricted likelihood ratio with circular scan windows. However, using this method it is difficult to explain the performance of restricted likelihood ratio as a test statistic under other situations, such as different levels of disease cases in study area or various shapes of clusters.

The choice of the screening level α1 in the restricted likelihood ratio needs also to be explored when combined with the non-circular scan window such as the dMST search strategy in our

RSScan method.

25

2.3.1 Test Statistic

It is reasonable to think that not only should the disease clusters be areas of significantly

elevated risk as a whole, but also the risks of individual regions within the clusters should not be

very low. Therefore, we adopt the restricted likelihood ratio proposed by Tango (2008) as the test statistic λT in our RSScan method (Equation 2.2, Tango 2008).

()zn − ()znn  ()zn   − ()znn   ()zn − ()znn  λT = sup    I >  ()pI i < α1 Equation 2.2 Ζ∈ − − ∏ z  ()ze   ()zen   ()ze ()zen  ∈zi

where I(·) is an indicator function. The only difference between Tango’s restricted likelihood

ratio function (Equation 2.2) and Kulldorff’s maximum likelihood ratio function (Equation 2.1)

is the product of indicator functions: ∏ ()pI < αii , in which α1 is a screening level specified by ∈zi

users for the risk of any individual region, and pi is the one-tailed mid-p value of region i under

the test for null hypothesis H0: E(Ni) = ei , which is defined as below (Equation 2.3, Tango 2008).

1 Pr{ +≥= NnNp Pois()e }~|1 =+ NnN Pois()e }~|Pr{ Equation 2.3 i ii i i 2 iii i

where Ni is a random variable which denotes the number of disease cases in region i, ni and ei

denote the actual number of cases and null expected number of cases in region i, respectively. In

Tango’s restricted likelihood ratio function, if the one-tailed mid-p value of a region is less than

the prespecified screening level α1, this region will be regarded as being of elevated risk.

Otherwise, this region will not be considered in the disease cluster estimate. It should be noted

26

that Kulldorff’s maximum likelihood ratio is the special case of the restricted likelihood ratio

when the screening level α1=1.

Although the problem of noninterpretability in the parameters is addressed and the cluster

size is effectively controlled with the restricted likelihood ratio function, the choice of screening

level α1 is totally up to users. Tango (2008) provides a guideline regarding the choice of α1 for a test of the nominal α level of 0.05, and recommends α1=0.2 as a default value. However, this guideline is derived only from the testing results with four simulated cluster models using a circular scan window. The recommendation of α1 value in our RSScan method for detecting the clusters in arbitrary shapes will be explored in Section 4.

2.3.2 Search Strategy

In order to detect arbitrarily shaped clusters and guarantee the spatial contiguity, we use graph G (V, E ) to represent a region map, where V is a set of n vertices (each representing such a region as census tract or county), and E is a set of edges (each connecting a unique pair of adjacent regions) (Figure 2.1).

Figure 2.1. Graph-based representation of a region map

27

The exclusion of the regions of low risks in the restricted likelihood ratio function is realized by removing all edges of those regions in the graph. This screening step also reduces the amount of calculation in the algorithm. Therefore, the final cluster estimate will only include the regions which are connected in the graph. Similar to the Kulldorff’s CSScan method, the RSScan method will find the most likely cluster with the largest value of the test statistic to address the multiple testing problem.

Assunção et al.’s (2006) dMST method is used as the search strategy in our RSScan method. Given a graph G and an empty collection T, for any vertex u, the steps can be described as follows:

1) Put vertex u into T.

2) Among all the vertices not in T but adjacent to any vertex in T, identify the vertex v

adding which T has the largest value of the test statistic at current step, and then put

vertex v into T. All vertices in current T constitute one zone (i.e. a potential cluster) for

scan.

3) Repeat step 2 until all vertices connected to vertex u in graph G are added into T.

Above steps are executed for each vertex not isolated in the graph G, and then we can get the zone set Z where the one with the maximum test statistic will be regarded as the most likely cluster . In order to reduce calculating intensity, a search radius K is set so that at most K-1 nearest neighboring vertices are involved into the zones when scanning each vertex.

2.4 Performance Evaluation

2.4.1 Experimental design

An experiment was designed with six single-cluster models based on simulated data in order to evaluate the performance of the RSScan method. For each cluster model, the location of

28

the disease cluster was first located in the study area, and then a relative risk r>1 was assigned to

the regions within the disease cluster and r=1 to the rest regions. Given the total number of

disease cases in the study area, the number of disease cases in region i follows a multinomial

m distribution with the probability of / ∑ prpr iiii where ri and pi are the relative risk and i=1 population at risk in region i, respectively. m is the total number of regions in the study area.

Based on the criterion used by Kulldorff et al. (2003), the relative risk for all regions that constitutes a cluster is determined using a one-sided binomial test with significance level of 0.05 such that the null hypothesis is rejected with probability of 0.999 when the alternative is a cluster with unknown risk but with known location. This choice of relative risks provides an upper limit of 0.999 for the power attainable by any test.

Three types of shapes are designed for simulated cluster models: round, line and trifurcate shape. The study area (Figure 2.2) is the State of Georgia (GA) in the United States including

159 counties with a total population of 9,210,790 (year 2000). Three locations in this area

(Figure 2.3) are chosen for simulated clusters. Two levels of disease case numbers are designed:

Low (500 cases) and High (5000 cases). Combining the types of disease cases and cluster shape,

there are total six cluster models. A code format as ‘X_Shape’ was used to label these cluster

models. The first ‘X’ indicates the level of disease case numbers with L for low and H for high.

Table 2.2 lists all detailed information of each cluster model. We also simulated a scenario where

there is no cluster for each level of disease case numbers (all regions have a relative risk r=1) so

that the capability of the method to control Type I error could be tested.

29

Figure 2.2. Population 2000 by counties in GA in the United States

Figure 2.3. Locations of simulated clusters: (a) circular shape (b) linear shape (c) trifurcate shape

30

Table 2.2. Information of simulated cluster models

Cluster Cluster Count of Population in Cluster Size Relative Shape Type ID Code Cases Cluster (count of counties) Risk 1 L_Round 500 1.63 1,802,970 7 Round 2 H_Round 5000 1.18 3 L_Line 500 1.64 1,721,370 5 Line 4 H_Line 5000 1.18 5 L_Tri 500 Trifurcate 2.30 427,594 7 6 H_Tri 5000 shape 1.33

For each type of cluster and non-cluster scenario, 500 replications were simulated, each

of which has the same cluster location and total number of disease cases over the whole study

area but different disease cases in every region. The nominal significance level was selected as

0.05, which means that clusters with p-values larger than 0.05 are considered not significant.

Monte Carlo testing method (Dwass 1957) with 999 repetitions were used to test the significance

of the observed test statistic. So the p-value can be calculated with the rank of the observed test

statistic among the total 1000 tests. In order to explore the effect of screening level α1 in restricted likelihood ratio function, five different values: 0.05, 0.1, 0.2, 0.3 and 0.4 were set.

Since the RSScan method is a hybrid between Tango’s (2008) method and Assunção et

al.’s (2006) method, these two methods were chosen for comparison in an experiment.

Considering Kulldorff’s CSScan method is probably the most widely-used method for detecting

spatial clusters, it also was added into the comparison. A 20% population in study region was set

as the upper limit covered by the circular scan window in CSScan method, and the search radius

K in other three methods are correspondingly set to 30 counties .

2.4.2 Experimental Results

Power is the most important evaluation criterion for cluster detection tests, which

indicates how effective methods are in identifying the presence of statistically noteworthy

clusters (Kulldorff et al. 2003, Tango and Takahashi 2005, Assunção et al. 2006, Tango 2008). In

31

order to understand how well these methods identify the correct boundaries of a cluster, Kappa

Index of Agreement (KIA, De Smith et al. 2007) is chosen as a complimentary metric to the

power in this study since it not only shows the match degree between the detected cluster

estimates and the true clusters, but also excludes the probability that the cluster regions are

detected by chance. In this case, the KIA decreases the impacts on the evaluation caused by

different cluster model properties, such as study region size and cluster size. In order to easily

compare the performances of different methods or different screening level values in RSScan and

Tango’s method, the results of six cluster models were averaged in terms of the levels of disease

cases and shapes of clusters.

2.4.2.1 Estimated Power of Methods

The power in this study is defined as the ratio of statistically significant clusters detected

(significance level=0.05) to the count of replications for each cluster model (500). The results of

the power analysis for four spatial scan methods are shown in Table 2.3. The highest value for

each scenario (column in the table) is bold. The test statistics in Assunção et al.’s method and

CSScan method can be regarded as the restricted likelihood ratio with α1=1.

We can see that all four methods have higher power to detect significant clusters with

lower level of disease cases (L_Cas) than those with higher level of disease cases (H_Cas). With the increase of α1 from 0.05 to 0.4, RSScan method is easier to detect the significant clusters in

the shapes varying from linear shape (Line) to round shape (Round) and then to trifurcate shape

(Tri), while Tango’s method is easier to detect the significant clusters in the shapes varying from

linear shape (Line) to round shape (Round) but more difficult for the trifurcate shaped clusters

(Tri) whatever the value of α1 is. Assunção et al.’s method and CSScan method both have highest

powers for trifurcate shaped clusters (Tri). However, Assunção et al.’s method is more difficult to

32

detetct significnat round clusters (Round) while CSScan method has the lowest power for linear

clusters (Line).

Table 2.3. Estimated power of four spatial scan methods (significance level=0.05)

Number of Cases Cluster Shape Average H_Cas L_Cas Line Round Tri RSScan 0.74 0.795 0.788 0.757 0.758 0.768 α = 0.05 1 Tango’s 0.661 0.725 0.741 0.693 0.645 0.693 RSScan 0.773 0.802 0.824 0.796 0.743 0.788 α = 0.1 1 Tango’s 0.669 0.733 0.752 0.71 0.64 0.701 RSScan 0.788 0.835 0.831 0.831 0.773 0.812 α = 0.2 1 Tango’s 0.683 0.743 0.754 0.718 0.668 0.713 RSScan 0.79 0.831 0.807 0.817 0.807 0.81 α = 0.3 1 Tango’s 0.693 0.765 0.754 0.741 0.693 0.729 RSScan 0.823 0.847 0.811 0.825 0.869 0.835 α = 0.4 1 Tango’s 0.719 0.775 0.748 0.78 0.712 0.747 Assunção’s 0.866 0.887 0.873 0.855 0.901 0.876 α = 1 1 CSScan 0.779 0.798 0.716 0.756 0.894 0.789

Figure 2.4 shows the estimated average power for each method considering all scenarios.

The figure shows that Assunção et al.’s method has the highest average power (0.876) among

these four methods for the clusters with any level of disease cases and any type of shape. RSScan

method has a good power especially when α1 is large such as 0.4 (0.835). CSScan method has a

relatively low power (0.789), and Tango’s method has the lowest power whatever the value of α1 is.

2.4.2.2 Kappa Index of Agreement

In order to evaluate the agreement between the most likely cluster detected and true clusters to understand how well these methods identify the correct boundaries of a cluster, KIA was used as another metric to evaluate the performance of these four methods. One advantage of

KIA is that it excludes the probability of detected cluster regions caused merely by chance. There

33

are two categories of regions: inside cluster and outside cluster. Given the study area size (S), the

true cluster size (T), the detected cluster estimate size (D), and the size of the intersection

between the cluster estimate and the true cluster (I), Table 2.4 shows the contingency table for detected cluster estimates and true clusters.

1 0.95 0.9

0.85 0.8

Power 0.75 0.7 0.65 0.6 0.05 0.1 0.2 0.3 0.4 1

Screening level α1

RSScan Tango's Assunção's CSScan

Figure 2.4. Estimated average power of the four spatial scan methods

Table 2.4. Contingency table for detected cluster estimates and true clusters

Cluster Estimate Inside Cluster Outside Cluster Total True Inside Cluster I T-I T Cluster Outside Cluster D-I S-T-D+I S-T Total D S-D S

Based on above contingency table, the KIA equation can be derived for this study

(Equation 2.4):

34

− EO κ = 1− E Equation 2.4

( +−−+ IDTSI ) ()()−×−+× TSDSTD O = , E = S S 2

where O is the observed proportion of matching values (the contingency table diagonal) and E is

the expected proportion of matches in this diagonal assuming the two categories in true cluster

are independent from the two categories in cluster estimate. KIA ranges from 0 to 1, and 1 means

a perfect agreement.

With the highest KIA value for each scenario (column in the table) in bold, Table 2.5 indicates that all methods have higher or close performance to identify the correct boundaries of a cluster when there is a relatively low level of disease cases in the study region (L_Cas). With the increase of α1 from 0.05 to 0.4, both RSScan and Tango’s methods are good at identifying the

boundaries of the clusters in the shapes varying from line (Line) to round (Round). The

boundaries of trifurcate shaped clusters (Tri) are difficult to be correctly identified by both

methods. Assunção et al.’s method is relatively better for clusters with trifurcate shape (Tri) than

other shapes, and CSScan method is good for round cluster (Round).

Figure 2.5 shows the average KIA value for each method considering all scenarios. The

figure indicates that RSScan method has a better performance to detect the boundaries of clusters

in various shapes than other three methods and peaks when α1 is 0.2 (0.614). The performance of

Tango’s method peaks when α1 is 0.4 and has a similar KIA value with CSScan method (about

0.47). Assunção et al.’s method has a relatively low power (0.435) possibly due to many low-risk

regions being involved into the cluster estimates.

35

Table 2.5. KIAs between the most likely clusters and true clusters for four spatial scan methods

Number of Cases Cluster Shape Average H_Cas L_Cas Line Round Tri RSScan 0.506 0.526 0.598 0.511 0.438 0.516 α = 0.05 1 Tango’s 0.365 0.373 0.47 0.354 0.283 0.369 RSScan 0.571 0.581 0.661 0.603 0.464 0.576 α = 0.1 1 Tango’s 0.391 0.397 0.498 0.386 0.298 0.394 RSScan 0.601 0.628 0.683 0.667 0.492 0.614 α = 0.2 1 Tango’s 0.416 0.426 0.499 0.425 0.338 0.421 RSScan 0.56 0.599 0.612 0.638 0.489 0.58 α = 0.3 1 Tango’s 0.441 0.457 0.493 0.48 0.374 0.449 RSScan 0.506 0.546 0.527 0.571 0.481 0.526 α = 0.4 1 Tango’s 0.47 0.475 0.493 0.548 0.377 0.473 Assunção’s 0.424 0.445 0.383 0.444 0.477 0.435 α = 1 1 CSScan 0.468 0.481 0.457 0.577 0.391 0.475

0.65

0.6

0.55

0.5 KIA 0.45

0.4

0.35 0.05 0.1 0.2 0.3 0.4 1

Screening level α1

RSScan Tango's Assunção's CSScan

Figure 2.5. Average KIAs of four spatial scan methods

2.4.2.3 Non-cluster Scenario Results

For non-cluster scenario, Table 2.6 shows that all methods averagely detected about 5% clusters out of 500 non-clustered replications. Considering the significance level of 0.05 used for these tests, the results indicate that all methods have good capabilities to control Type I error.

36

Table 2.6. Average Type I error of four spatial scan methods

RSScan Tango’s Assunção’s CSScan

α1 = 0.05 0.04 0.05 - - α1 = 0.1 0.044 0.046 - - α1 = 0.2 0.035 0.044 - - α1 = 0.3 0.043 0.045 - - α1 = 0.4 0.048 0.041 - -

α1 = 1 - - 0.046 0.042

2.5 Application: Georgia Lung Cancer, 1998 -2005

Based on above experimental results, the RSScan method with appropriate screening

level α1 value was found to usually have a higher capability than other three methods to detect

the significant clusters and identify the boundaries of clusters in arbitrary shapes. 0.2 could be

recommended as the default α1 value.

We use the RSScan method to detect the cluster of lung cancer diagnosed in GA in the

period of 1998-2005. The health data from Georgia Comprehensive Cancer Registry show that

the lung cancer cases in GA from 1998 to 2005 total 42,521 among which male cases are 25,615

and female cases are 16,906. The expected number of cases for county i is calculated based on

GA population in 2000 (Figure 2.2) and adjusted by the age and sex.

Figure 2.6 shows standardized incidence ratio (SIR) for each county in GA and the detected cluster result using RSScan method with screening level α1 = 0.2. The detected cluster is

found to be located in north-western GA including total 8 counties: Bartow, Gordon, Haralson,

Murray, Polk, Walker, Whitfield, and Paulding. The p-value of the cluster is 0.002, and total

3,177 cases occurred within the cluster area during that time. The SIR of the cluster is 1.31.

37

Figure 2.6. SIRs and the detected cluster of lung cancer incidence in GA, 1998-2005

2.6 Discussion and Conclusions

It should be noted that the performances of both the RSScan method and the other three methods vary under different situations such as counts of disease incidence cases and cluster shapes. This finding corresponds well with the power analysis given by Waller and Gotway

(2004) that most tests to detect clusters have spatially heterogeneous power. The high estimated power in the experiment indicates that these methods could be competent in the exploratory study which indicates the questionable areas for further study. However, the relatively low KIA

38

values indicate that these methods may be inappropriate for the applications which require

accurate boundaries of clusters, such as the analysis of the change of spatial clusters over time. In

order to get deeper insights about the spatio-temporal disease risk pattern, disease risk modeling, such as spatio-temporal multilevel models, may be a better way.

Tango’s restricted likelihood ratio has good interpretability and strong power in detecting disease clusters with circular scan window (Tango 2008). To our knowledge, however, there is no previous work discussing its performance in detecting clusters in arbitrary shapes with other search strategies. For the first time, this study implements and tests restricted likelihood ratio combined with Assunção et al.’s dMST search strategy to quickly detect disease clusters in arbitrary shapes. In order to understand the performance of this redesigned hybrid method in various situations, more cluster models than Tango (2008) and Assunção et al. (2006) were designed in this performance test, which includes six cluster models and two non-cluster scenarios. These cluster models consider different numbers of disease cases in a study area and various shapes of clusters. The choice of the screening level α1 in restricted likelihood ratio is

also explored when combined with Assunção et al.’s dMST search strategy in the RSScan

method. Besides the metric of power, this study proposes using KIA to evaluate and compare the

performances of cluster detection methods to identify the boundaries of clusters in order to avoid

the effects due to the different cluster model properties. Finally, the application of the RSScan

method was applied in a case of detecting the cluster of lung cancer in Georgia between 1998

and 2005.

The experimental results indicate that the RSScan method with appropriate screening

level α1 generally has higher or similar capability to quickly detect statistically significant

disease clusters and identify the boundaries of clusters than Tango’s method, Assunção et al.’s

39

method, and Kulldorff’s CSScan method under the same situation, especially for the clusters in irregular shapes. Based on results of this study, 0.2 is recommended as a default for the screening level α1 in the RSScan method.

40

References

Anselin, L., 1995. Local indicators of spatial association-lisa. Geographical analysis, 27 (2), 93- 115.

Assunção, R., Costa, M., Tavares, A. & Ferreira, S., 2006. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25 (5), 723-742.

Besag, J. & Newell, J., 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society. Series A (Statistics in Society), 154 (1), 143-155.

De Smith, M., Goodchild, M. & Longley, P., 2007. Geospatial analysis: A comprehensive guide to principles, techniques and software tools: Troubador Publishing.

Duczmal, L. & Assunção, R., 2004. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45 (2), 269- 286.

Duczmal, L., Kulldorff, M. & Huang, L., 2006. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics, 15 (2), 428-442.

Dwass, M., 1957. Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28 (1), 181-187.

Goovaerts, P., 2010. Geostatistical analysis of county level lung cancer mortality rates in the southeastern united states. Geographical analysis, 42 (1), 32-52.

Jacquez, G. & Greiling, D., 2003. Local clustering in breast, lung and colorectal cancer in long island, new york. International Journal of Health Geographics, 2 (1), 3.

Kulldorff, M., 1997. A spatial scan statistic. Communications in Statistics-Theory and Methods, 26 (6), 1481-1496.

Kulldorff, M., Huang, L., Pickle, L. & Duczmal, L., 2006. An elliptic spatial scan statistic. Statistics in Medicine, 25 (22), 3929-3943.

41

Kulldorff, M. & Information Management Services Inc., 2010. Satscantm v9.1: Software for the spatial and space-time scan statistics. http://www.satscan.org/

Kulldorff, M. & Nagarwalla, N., 1995. Spatial disease clusters - detection and inference. Statistics in Medicine, 14 (8), 799-810.

Kulldorff, M., Tango, T. & Park, P.J., 2003. Power comparisons for disease clustering tests. Computational Statistics & Data Analysis, 42 (4), 665-684.

Lawson, A., 2006. Statistical methods in spatial epidemiology, 2nd ed. Chichester, England ; Hoboken, NJ: Wiley.

Maheswaran, R. & Craglia, M., 2004. Gis in public health practice Boca Raton: CRC Press.

Openshaw, S., Charlton, M., Wymer, C. & Craft, A., 1987. A mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information Systems, 1 (4), 335 - 358.

Rogerson, P. & Yamada, I., 2009. Statistical detection and surveillance of geographic clusters Boca Raton: CRC Press.

Takahashi, K., Yokoyama, T. & Tango, T., 2010. Flexscan v3.1: Software for the flexible scan statistic. http://www.niph.go.jp/soshiki/gijutsu/download/flexscan/index.html

Tango, T., 2008. A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics, 29 (2), 75-95.

Tango, T. & Takahashi, K., 2005. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4, 11-15.

Turnbull, B.W., Iwano, E.J., Burnett, W.S., Howe, H.L. & Clark, L.C., 1990. Monitoring for clusters of disease - application to leukemia incidence in upstate new-york. American Journal of Epidemiology, 132 (1), S136-S143.

Waller, L. & Gotway, C., 2004. Applied spatial statistics for public health data: Wiley- Interscience.

42

Yiannakoulias, N., Rosychuk, R.J. & Hodgson, J., 2007. Adaptations for finding irregularly shaped disease clusters. International Journal of Health Geographics, 6 (1), 28.

43

CHAPTER 3

HIERARCHICAL BAYESIAN MODELING OF THE SPATIO-TEMPORAL PATTERNS OF

LUNG CANCER INCIDENCE RISKS IN GEORGIA, 2000-20072

2 Yin, P., Mu, L., Madden, M. and Vena, J. To be submitted to International Journal of Health

Geographics.

44 Abstract

Lung cancer is the second most commonly diagnosed cancer in men and women in

Georgia. However, the related studies about the patterns of lung cancer in Georgia at a fine

spatio-temporal scale are very limited. In this study, hierarchical Bayesian models are used to

explore the spatio-temporal patterns of lung cancer incidence risks by race and sex in Georgia for the period of 2000 to 2007. With the census tract level as the spatial scale and the two-year

period aggregation as the temporal scale, we propose and compare a total of seven Bayesian

spatio-temporal models including two under the separate modeling framework and five models

under the joint modeling framework. One of these models is finally chosen and its results clearly

show that the northwest region of Georgia has stably elevated lung cancer incidence risks for all

population groups during the study period. Showing more detailed and reliable variations of the

lung cancer incidence risks in space and time, our study aims to better support assessing

healthcare performance, establishing etiological hypotheses, and making effective and efficient

health policies. In addition, our study shows that there are strong inverse relationships between

the socioeconomic status (SES) and the lung cancer incidence risk in Georgia males, especially

white males, and weak inverse relationships in both white and black Georgia females. The study

results are expected to lead to further studies including, the spatial and temporal random effects

in the models that may provide some implications on the potential disease risk factors for further

ecological studies. The limitations of this study including the lack of smoking data and

population estimation error are also discussed in the end.

Keywords: Bayesian model, Spatio-temporal pattern, Lung cancer, Socioeconomic status,

Georgia

45 3.1 Introduction

Lung cancer is not only the second most commonly diagnosed cancer in men and women, but also the leading cause of cancer-related death in Georgia in the United States (Georgia

Department of Public Health 2008). However, as far as we know, the lung cancer studies in

Georgia are very few, and most of these mainly focus on descriptive analyses using crude rates at a coarse spatio-temporal scale, such as the 5-year incidence rates at the health district or county level. Such analytical results usually obscure the detailed variations of lung cancer risks in space and time, and could introduce inferential biases on etiological hypotheses. In addition, they can only provide limited help for healthcare performance assessment and health policy making to improve the efficiency of interventions and the distribution of resources.

The small number problem is one of the challenges for mapping lung cancer risk at a fine spatio-temporal scale. For rare diseases such as cancers, the total counts of cases could become very sparse at some fine spatio-temporal scales, especially when more demographic dimensions are also considered, such as sex, age, race, among others. With the sparseness of the counts, some traditional estimates of disease risk or relative risk, such as the Standardized Incidence

Ratio (SIR), could become unreliable and may lead to a large misunderstanding of the true disease risk due to high sampling variability. Recently, hierarchical Bayesian models have been widely used to map disease risk spatially or spatio-temporally (Bernardinelli et al. 1995, Waller et al. 1997, Xia and Carlin 1998, Knorr-Held 2000, Mollié 2001, Wakefield et al. 2001, Best et al. 2005, Richardson et al. 2006, Abellan et al. 2008, Lawson 2009, Fortunato et al. 2011). For sparse count data, integrating both data fit and subjective prior information makes Bayesian models possible to mitigate the inferential biases of frequentist methods that totally depend on data fit. In addition, it is easy to develop model-based spatial and spatio-temporal smoothing

46 methods under the Bayesian framework that not only consider the effects of disease risk factors,

but also borrow strengths from neighboring areas and/or time periods.

In this study, we use hierarchical Bayesian models to explore the spatial-temporal

patterns of lung cancer incidence risks in Georgia. The analyses are conducted for four

population groups stratified by sex and race at the census tract level over four two-year periods from 2000-2007. A total of seven spatio-temporal models under two modeling frameworks were proposed and compared. One framework is to model the relative risks (RRs) of each population group separately, and the other framework is to jointly model the RRs of each population group under the assumption that some common disease risk factors exist in all population groups. One model is finally chosen based on some criterion and its results are interpreted. The aim of the study is to obtain reliable spatio-temporal patterns of lung cancer incidence risks by sex and race in Georgia at a fine scale, which are expected to identify the spatio-temporal hot-spots of the disease risks of a specific population group for further study, and help to facilitate the related health policy making in Georgia. In addition, evaluating the effects of area-based socioeconomic status (SES) on the lung cancer incidence risks in each population group is also explored in the modeling. The understandings of the socioeconomic gradients in lung cancer incidence risks by race and sex could provide some implications on how to reduce the lung cancer disparities in

Georgia. This paper will be organized as follows. In the next section, the study area and data sources are described. Then, the method for population estimation, the area-based SES measure, and the seven Bayesian spatio-temporal models under the two modeling frameworks are explained. Next, the modeling results and discussions are given, followed by some conclusions.

47 3.2 Study Area and Data

Our study area is the state of Georgia with 1,618 census tracts in 2000. Figure 3.1 shows the distribution of population density by census tract in Georgia 2000. The 10 most populous cities in 2000 are also shown in this map. We can see that the population is mainly concentrated in the north region of Georgia, especially in the metropolitan Atlanta area that includes the cities of Atlanta, Sandy Spring, Rowswell, and Marietta. All of the population data and socioeconomic data come from the U.S. Census.

Figure 3.1. Population density by census tract and the 10 most populous cities in Georgia 2000

48 The lung cancer data (primary site codes from C340-C349 in ICD-O-3) are extracted from the Georgia Comprehensive Cancer Registry (Georgia Department of Public Health 2011).

A total of 44,671 lung cancer cases were diagnosed in Georgia from 2000-2007. In this study, we only consider the cases among white and black individuals over 20 years old and the total number is 43,504. A total of 3,219 cases are excluded from the analyses due to their lower spatial accuracy than the census tract level. Therefore, 40,285 cases are finally included and aggregated to the 1,618 census tracts in the geography of the Census 2000. Table 3.1 shows the total number of cases of individuals over 20 years old and the percentage of included cases in the analyses by sex and race. We can see that the lowest percentage of included cases is 89.81% for black males.

Table 3.1. Total number of cases of individuals over 20 years old and the percentage of included cases in the analyses by sex and race

White Black Total cases Included cases (%) Total cases Included cases (%) Male 20,547 90.59 5,557 89.81 Female 14,882 91.36 3,362 91.73

To avoid a high level of sparseness while keeping the temporal dimension, cases are

aggregated to four two-year periods, 2000-2001, 2002-2003, 2004-2005, and 2006-2007, for the

analyses. The average number of cases per census tract per two-year period is 2.9 for white

males, 2.1 for white females, 0.77 for black males, and 0.48 for black females. The expected

numbers of cases by census tract by two-year period by sex and race are calculated based on the reference rates that are the average age-specific incidence rates by sex and race across the whole

Georgia and over the time period 2000-2007. In the calculation of the reference rates, a total 10

age groups are considered including age groups from 20-39 and 40-49, 7 five-year age groups

from 50-84 and one group from 85 and over.

49 3.3 Methods

3.3.1 Population Estimation for Intercensal Years

The population at risk is important to accurately calculate expected cases and estimate disease risk. However, the census population data at the tract level are only available at the census years (e.g. 2000 and 2010). It is also noted that the geographic boundaries of census tracts vary every census year. For example, there are a total of 1,618 tracts in Census 2000, while a total of 1,969 tracts exist in Census 2010. At the county level, the Census Bureau (Population

Estimates Program 2011) provides the estimates of population by race, sex and age for each intercensal year. In this study, the boundaries of census tracts in 2000 are used as the standard geography for the whole study period. With the census population data currently available, one of the interpolation methods proposed by Best and Wakefield (1999) is used to estimate the population by race, sex and age at the census tract level for each intercensal year.

The steps of the population estimation are as follows. First, we use the overlay function in the Geographical Information System (GIS), ArcGISTM (ESRI, Inc.) and the areal weighting

interpolation method (Goodchild and Lam 1980) to estimate the population in 2010 using the

geography of the 2000 census tracts. To improve the accuracy, we use the 2010 population data

at the block level instead of the tract level since blocks are at a finer spatial scale. Then, for each

population group by race, sex and age in a county, we assume the population N are

multinomially distributed to the census tracts in that county with a vector of apportionment

T probabilities p=(p1,…,pI) , where I denotes the number of census tracts in that county and pi is

the proportion of the population in census tract i in the population of the county N. The

probabilities p for each intercensal year is estimated via a simple linear interpolation between the

censuses (i.e., 2000 and 2010).

50 Based on the population estimates, the reference rates for all population groups are then calculated. Using the U.S. 2000 standard population for standardization, the direct age-adjusted

(over 20 years old) lung cancer incidence annual rates (per 100000 population) in Georgia (2000-

2007) are 132.7 for white males, 75.3 for white females, 135.2 for black males, and 54.5 for black females.

3.3.2 Area-based SES Measure

Due to the relative homogeneity, the area-based SES measure at the census tract level could be a good surrogate of individual SES in a health study when individual SES is unavailable

(Krieger 1992). Detailed discussions of area-based SES measures can be found in the literature

(Krieger et al. 1997, Carstairs 2001, Krieger et al. 2002, Darden et al. 2009). Various single variable or composite measures can capture different aspects of socioeconomic characteristics. In this study, we use the modified Darden-Kamel Composite Index (Darden et al. 2009) to measure the SES at the census tract level, and evaluate its relationships with the lung cancer incidence risks by race and sex. The modified Darden-Kamel Composite Index is an average Z score of total nine socioeconomic variables in U.S. census data (Table 3.2).

Table 3.2. Variables incorporated in the modified Darden-Kamel Composite Index

Modified Darden-Kamel Composite Index 1. Percentage of residents with university degrees 2. Median household income 3. Percentage of managerial and professional positions 4. Median value of dwelling 5. Median gross rent of dwelling 6. Percentage of homeownership 7. Percentage below poverty 8. Unemployment rate 9. Percentage of households with vehicle

51 Based on Census 2000 data, the modified Darden-Kamel Composite Indexes for the census tracts in Georgia are calculated and their range is from -31.05 to 24.77. A larger value means a higher SES. Based on the index, the census tracts in Georgia are divided into five SES groups with equal number of census tracts. Group 1 has the highest SES and group 5 has the lowest. Figure 3.2 shows the distribution of the SES by census tract. We can see that the higher

SES regions are mainly concentrated in the large cities in Georgia.

Figure 3.2. Quintile map of SES in Georgia 2000

52 3.3.3 Bayesian Spatio-temporal Models

Bayesian models have naturally hierarchical structures. At the first level, the number of

observed cases yitk for census tract i =1,…,1618, time period t =1,…,4 and population group by

race and sex k =1,…,4 is assumed to follow a Poisson distribution with mean EitkRitk, where Eitk

and Ritk are respectively the known expected number of cases and the unknown RR compared to

the corresponding reference risk (measured by the reference rate of specific population group) in

census tract i, time period t and population group k. At the second level, the logarithms of RRs

are decomposed into fixed effects for those measured risk factors such as the SES, and random

effects for those unmeasured or unobserved risk factors. In Bayesian spatio-temporal models,

three random effects are usually considered: spatial random main effect, temporal random main

effect and spatio-temporal interaction random effect. Both spatial and temporal random main

effects could be further divided into a structured component and an unstructured component,

which reflect the dependent and heterogeneous variations of risks in space and time, respectively.

In the Bayesian paradigm, prior distributions are needed to be assigned to the model parameters and the random effects. Then, the references are made based on the posterior distributions of the parameters and random effects derived from simulations.

In this study, we model the RR of each population group individually under two modeling frameworks. The first framework is separate modeling where each population group has an independent set of random effects. The second framework is joint modeling where there are shared random effects representing some common unmeasured or unknown risk factors among all the population groups. This joint modeling framework has been used to map one disease for multiple population groups or multiple diseases that have common risk factors

(Knorr-Held and Best 2001, Held et al. 2005, Richardson et al. 2006, Downing et al. 2008). We

53 compare a total of seven models including two separate models and five joint models. Table 3.3

shows the components of the logarithms of RRs in each model.

Table 3.3. Components of logarithms of RRs in the seven Bayesian spatio-temporal models

Model Type Model # Logarithms of RRs Model1 log R )( α β T x +++= ξλ itk k k i tkik Separate T Model2 log Ritk )( α k βk xi ++++= υξλ itktkik T Model3 log Ritk )( α k βk xi ,1 ,2 tkik ++++= ωςδφδ itk T Model4 log Ritk )( α k βk xi ,1 ,2 tkik +++++= ξλςδφδ tkik T Joint Model5 log Ritk )( α k βk xi ,1 ,2 tkik ++++++= ξλθςδφδ tkikit T Model6 log Ritk )( α k βk xi ,1 ,2 tkik ++++++= ωξλςδφδ itktkik T Model7 log Ritk )( α k βk xi ,1 ,2 tkik +++++++= ωξλθςδφδ itktkikit

In each model, αk is the overall log-RR for population group k across the whole study area over the whole study period, and βk are the coefficients associated with the SES group vector xi

for population group k. The difference among the seven models is in the components of random

effects. Separate models 1 and 2 both have spatial random main effect λik for population group k

in census tract i and temporal random main effect ξtk for population group k in time period t.

Model 2 also considers the spatio-temporal interaction υitk in census tract i and time period t for

population group k. In addition to the population-group-specific random effects like those in

separate models 1 and 2, joint models 3-7 also consider shared random effects across the four

population groups by race and sex. In these shared components of the joint models, ϕi represents

the shared spatial component in census tract i, and ϛt represents the shared temporal component in time period t. The coefficients δ1,k and δ2,k allow gradients of the shared spatial and temporal components among all the population groups. In models 5 and 7, a shared spatio-temporal interaction θit is also considered. With respect to the population-group-specific random effects,

54 model 3 only considers a spatio-temporal interaction random effect ωitk for population group k, and models 4 and 5 only consider specific spatial and temporal random main effects λik and ξtk.

For the two components λik and ξtk in models 4-7, We set them equal to 0 in white male models

(k=1) so that these two components in other population group models (k=2, 3 and 4) actually are

the differentials of the spatial and temporal random main effects between that population group

and white males.

Some early experiments show that only considering structured components in spatial and

temporal random main effects have better modeling results than considering both structured and

unstructured components. Therefore, the widely used Gaussian intrinsic conditional

autoregression normal (CAR normal) prior proposed by Besag et al. (1991) are assigned to the

spatial random main effects λik and ϕi and the temporal random main effects ξtk and ϛt to represent

the dependent variations of RRs over space and time. For a spatial random effect in an area, CAR

normal specifies that its conditional distribution, given all other spatial effects, is a normal

distribution with mean equal to the average spatial effects of its neighboring areas and variance

inversely proportional to the number of these neighbors. In this study, the spatial neighbors are

defined if they share a border or a vertex. For a temporal random effect in a time period, CAR

normal smoothes it towards the temporal effects of its temporal neighbors (i.e. the previous and

the next time periods).

Due to the lack of strong prior knowledge, vague prior distributions are used for other

parameters in the models based on current literature. We assign a flat prior on the overall log-RR

5 terms, αk, and assign independent Normal (0, 10 ) prior distributions to fixed effects βk. The

logarithms of the scaling parameters δ1,k and δ2,k are assigned independent Normal (0, 5) prior

distributions (Downing et al. 2008). With respect to the spatio-temporal interaction random

55 effects, independent normal prior distributions with means equal to 0 and precisions τυk, k

=1,…,4, are assigned to υitk in model 2 for each population group, independent normal prior

distributions with means equal to 0 and precisions τθ are assigned to θit in models 5 and 7, and a

multivariate normal prior distribution with covariance matrix Σ is assigned to ωitk in models 3, 6

and 7 to allow correlations amongst the population groups (Richardson et al. 2006, Downing et

al. 2008). Following the previous studies (Kelsall and Wakefield 1999, Best et al. 2005,

Downing et al. 2008), independent conjugate hyperpriors Gamma (0.5, 0.0005) are assigned to

all of the precision parameters in the normal priors for shared components τϕ, τϛ, τθ and for

population-group-specific components τλk, τξk, τυk, k =1,…,4. The covariance matrix Σ in the multivariate normal prior is assigned a Wishart (Q, 4) distribution, where Q is set to be a diagonal matrix with 0.01s (Richardson et al. 2006).

All of the models are constructed and run using WinBUGS software (Lunn et al. 2000).

For each model, two independent chains are run. The first 50,000 iterations are discarded as burn-in to make sure inferences can be made based on converged simulations of the models.

Then, 10,000 iterations are run and every 10th is kept for reference. Therefore, the modeling

results are based on thinned samples of 2,000. Brooks-Gelman-Rubin diagnostics (Brooks and

Gelman 1998) and visual checks are used to assess convergence.

Similar to the joint mapping of male and female lung cancer risks by Richardson et al

(2006), the scaling parameters δ2,k are difficult to converge during the data fitting of models. This

could be because only four time periods cannot provide enough information to differentiate the

shared and specific temporal patterns. So, we fixed δ2,k = 1 for all joint models.

We use the deviance information criterion (DIC) to compare the seven models and choose

the best one to interpret. The DIC was proposed by Spiegelhalter et al (2002) as the sum of D

56 and pD, where D is the posterior mean of the deviance measuring the goodness-of-fit of a model, and pD is the number of effective model parameters measuring model complexity. The model with a smaller DIC is preferred.

3.4 Results

From Table 3.4, we can see that joint model 6 has the smallest DIC value of 64155.6 among the seven models. The best data fit is model 7 and the simplest model is model 4. All of the joint models except for model 3 are better than the separate models based on their DICs. In the following, we choose the results of model 6 to interpret. In model 6, both the shared and the specific components include the spatial and temporal random main effects, and the specific spatio-temporal interaction random effect is also considered.

Table 3.4. DICs of the seven models

Model Type Model # D pD DIC

Model1 63349.2 962.636 64311.8 Separate Model2 63029.5 1264.91 64294.4 Model3 62996.6 1383.51 64380.1 Model4 63328.4 869.157 64197.6 Joint Model5 63099.8 1064.9 64164.7 Model6 62908.1 1247.48 64155.6 Model7 62904.5 1347.36 64251.9

As we know, the crude standardized incidence ratio (SIR), the ratio of the number of

observed cases to the number of expected cases, is the best maximum likelihood estimate for RR

in frequentist methods. For comparison, Figure 3.3 shows the spatial patterns of crude SIRs by

race and sex in the first period 2000-2001. Due to the uneven population distribution and

possible missing in data collection, in these SIR maps, especially for black males and black

females, many census tracts have zero cases observed in that tract in that time period which

57 cause zero SIRs. However, it is impossible that there are no disease risks in these census tracts in reality. In addition, it is obvious that the SIR surfaces are not smooth across the whole area since most of the RRs fall into either very high or very low category.

Figure 3.3. Maps of crude standardized incidence ratios (SIRs) by race and sex during 2000-2001

58 Figures 3.4-3.7 show the maps of posterior median RRs by race and sex in the four time periods. Compared to the crude SIRs in Figure 3.3, the posterior median RRs show much smoother spatial patterns without RRs equal to 0. For white males and white females, the high

RRs are mainly concentrated in the northwest, southeast, and middle regions of Georgia. For black males, the high RRs are mainly concentrated in the northwest, east, and south in Georgia.

The high RRs for black females are mainly concentrated in the northwest of Georgia. Comparing the maps of different time periods, we can see that, for white males and black males, more census tracts with moderate and low RRs emerge and the number of census tracts with high RRs decreases over the time; while the situations inverse for white females and black females.

Following Richardson et al.’s (2004) study evaluating the sensitivity and specificity of

Bayesian hierarchical disease mapping models, we use a cut-off rule of 0.8 on the posterior

probability that an area has a RR greater than 1 to pick out the areas with truly elevated RRs.

Figure 3.8 shows the maps indicating how many times each census tract has an truly elevated

RRs during the 4 time periods based on the rule of prob( RR>1) > 0.8. The frequency associated

with each census tract reflects the stability of elevated RR in that area over the whole time period.

From these maps, we can see that the northwest of Georgia and the area near Augusta have stably

elevated RRs for all population groups. White males have the largest number of census tracts

with stably elevated RRs, and black females have the smallest number. These results could be

helpful to establish some etiological hypotheses.

59

Figure 3.4. Maps of the posterior median RRs for white males in each time period

60

Figure 3.5. Maps of the posterior median RRs for white females in each time period

61

Figure 3.6. Maps of the posterior median RRs for black males in each time period

62

Figure 3.7. Maps of the posterior median RRs for black females in each time period

63

Figure 3.8. Maps of elevated RR frequency by race and sex during 2000-2007

Figure 3.9 shows clearer spatial patterns of RRs by the maps of the posterior median of the shared spatial component and the differential spatial components. Taking white males as the reference with its scaling parameter equal to 1 for the shared spatial component, the posterior median of the scaling parameters for white females, black males, and black females are 0.743,

0.538, and 0.571, respectively. The white female-white male differential and the black males-

white males differential are relatively flat (less contrast) across the whole area, which indicates

64 that the pattern of the shared spatial component can well capture the variations of the spatial effects on RRs for both white females and white males. The strong contrast of the black female- white male differential reflects that there is an obvious difference in the patterns of spatial effects on RR between white males and black females.

Figure 3.9. Maps of the posterior median of the shared spatial component and differential spatial components

65 Table 3.5 shows the posterior medians and 95% credible intervals (CIs) of the shared temporal component and the differential temporal components. We can see that the shared temporal trend keeps flat and slightly decreases after 2004. This trend well captures the temporal trend in the RRs of black males, but is different from those of white females and black females.

Table 3.5. Posterior median (95% CI) of the shared temporal components and differential temporal components

Shared Time White female-White Black male-White Black female-white temporal period male differential male differential male differential components 2000-2001 1.04 (1.02, 1.07) 0.93 (0.90, 0.97) 1.01 (0.98, 1.06) 0.92 (0.86, 0.98) 2002-2003 1. 04 (1.01, 1.06) 0.97 (0.94, 1.00) 1.00 (0.97, 1.04)) 0.97 (0.92, 1.02) 2004-2005 0.98 (0.96, 1.00) 1.02 (0.99, 1.05) 1.00 (0.97, 1.04) 1.03 (0.98, 1.08) 2006-2007 0.95 (0.92, 097) 1.09 (1.05, 1.13) 0.98 (0.94, 1.02) 1.09 (1.03, 1.16)

To understand the relationships between SES and RR by race and sex, Table 3.6 shows the posterior median of the RRs for SES quintile. The highest SES group is taken as the reference. We can see that the general trend for all population groups is that lower SES leads to a higher RR. However, the gradients of SES effects on the RRs in males, especially white males, are larger than those in females. That means the socioeconomic disparities in lung cancer RR are more obvious in males in Georgia. We also note that the RRs of SES groups 2 and 3 in black females are not statistically significant from that of SES group 1.

Bayesian modeling is sensitive to the choice of priors and hyperpriors. Following

Downing et al’s (2008) work, we perform a sensitivity analysis using an alternative hyperprior distribution Gamma (1,1) to replace Gamma (0.5, 0.0005) for the precision parameters in model

2. The Gamma (0.5, 0.0005) distribution makes the variances (inverse of precision) have a 99% probability of lying between 0.000151 and 6.25 with a mode at 0.00033. For the Gamma (1, 1)

66 distribution, the 99% probability range of the variances is from 0.217 to 100 and the mode is at

0.5. Table 3.7 shows the correlations between the posterior median RRs using model 2 with the two types of hyperpriors. We can see that the two groups of results show a good concordance in general, but the correlations in black indivduals are slightly lower than those in white individuals.

These differences may be due to the different degrees of the sparseness of counts in races.

Table 3.6. Posterior median (95% CI) of the RRs for SES quintile

SES group White males White females Black males Black females 1 (highest) 1 1 1 1 2 1.28 (1.20, 1.36) 1.11 (1.04, 1.18) 1.19 (1.04, 1.36) 1.01 (0.87, 1.19) 3 1.51 (1.41, 1.62) 1.20 (1.12, 1.30) 1.42 (1.24, 1.63) 1.13 (0.96, 1.33) 4 1.58 (1.46, 1.70) 1.16 (1.07, 1.26) 1.51 (1.32, 1.72) 1.23 (1.06, 1.44) 5 (lowest) 1.76 (1.61, 1.92) 1.32 (1.20, 1.44) 1.73 (1.52, 1.98) 1.41 (1.22, 1.65)

Table 3.7. Correlations between the posterior median RRs using model 2 with two different types of hyperpriors

Time period White males White females Black males Black females 2000-2001 0.998 0.992 0.988 0.990 2002-2003 0.998 0.991 0.988 0.989 2004-2005 0.998 0.991 0.987 0.988 2006-2007 0.998 0.991 0.987 0.988

3.5 Discussions

One of the limitations in this study is the lack of suitable smoking data at the fine spatial scale. It is well known that an individual’s smoking behavior is an important risk factor for lung cancer. To some extent, the random effects in our hierarchical Bayesian spatio-temporal models can approximate the total effects of unmeasured or unknown risk factors including smoking.

However, we believe that integrating suitable smoking data into the models can greatly improve the accuracy of the models.

67 For the diseases with a long latency period such as cancers, lifetime exposures could be important. In this study, we measure the area-based SES with Census 2000 data and assume they could reflect the individual SES during the long latency period. This assumption could introduce biases into the model inferences. In addition, the analysis of the relationship between disease RR and SES is subject to the modifiable area unit problem (Openshaw and Taylor 1981). It means that the references based on the analyses at current scale and/or unit definition may not be generalized to other scales and/or unit definitions.

Estimation of population in small areas is a hot research topic in geography and statistics recently. In our study, we use an apportionment method to estimate the population by race, sex and age in each census tract in each intercensal year. Improvement of population estimation model could greatly benefit the disease mapping models.

3.6 Conclusions

Facing the fact that there are a limited number of lung cancer studies in Georgia, especially at a fine spatio-temporal scale, we use hierarchical Bayesian models to explore the spatio-temporal patterns of lung cancer incidence risks in Georgia for the period 2000-2007. The study is conducted at the census tract level using two-year time period as the temporal unit. The fine spatial and temporal scales enable the study to show more detailed variations of lung cancer incidence risks in space and time, which can better support healthcare performance assessment, thereby establishing potential etiological hypotheses and making effective and efficient health policies. Compared to the crude SIR, use of the Bayesian spatio-temporal model can provide a more reliable estimate of disease risk in a fine spatio-temporal scale. The study also shows that there are strong inverse relationships between SES and lung cancer incidence risk in males and

68 weak inverse relationships in females in Georgia. This could lead to further studies on the underlying reasons such as occupational risk factors.

A total of seven Bayesian spatio-temporal models under the separate and joint modeling frameworks are proposed and compared. In this study, the joint models generally have better performance than the separate models using DIC as the criterion. Currently, our study is primarily focusing on mapping the patterns of disease risks. However, the spatial and temporal random effects in these disease mapping models may provide some implications on the potential disease risk factors for further ecological studies.

69 References

Abellan, J.J., Richardson, S. & Best, N., 2008. Use of space–time models to investigate the stability of patterns of disease. Environmental health perspectives, 116 (8), 1111.

Bernardinelli, L., Clayton, D., Pascutto, C., Montomoli, C., Ghislandi, M. & Songini, M., 1995. Bayesian analysis of space—time variation in disease risk. Statistics in Medicine, 14 (21 22), 2433-2443.

Besag, J., York, J. & Mollié, A., 1991. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43 (1), 1-20.

Best, N. & Jon, W., 1999. Accounting for inaccuracies in population counts and case registration in cancer mapping studies. Journal of the Royal Statistical Society. Series A (Statistics in Society), 162 (3), 363-382.

Best, N., Richardson, S. & Thomson, A., 2005. A comparison of bayesian spatial models for disease mapping. Statistical Methods in Medical Research, 14 (1), 35.

Brooks, S.P. & Gelman, A., 1998. Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434-455.

Carstairs, V., 2001. 4.. Socio-economic factors at areal level and their relationship with health. Spatial Epidemiology, 1 (9), 51-68.

Darden, J., Rahbar, M., Jezierski, L., Li, M. & Velie, E., 2009. The measurement of neighborhood socioeconomic characteristics and black and white residential segregation in metropolitan detroit: Implications for the study of social disparities in health. Annals of the Association of American Geographers, 100 (1), 137-158.

Downing, A., Forman, D., Gilthorpe, M., Edwards, K. & Manda, S., 2008. Joint disease mapping using six cancers in the yorkshire region of england. International Journal of Health Geographics, 7 (1), 41.

Fortunato, L., Abellan, J.J., Beale, L., Lefevre, S. & Richardson, S., 2011. Spatio-temporal patterns of bladder cancer incidence in utah (1973-2004) and their association with the presence of toxic release inventory sites. International Journal of Health Geographics, 10 (1), 16.

70 Georgia Department of Public Health, 2008. Cancer program and data summary. Atlanta,GA.

Georgia Department of Public Health, 2011. Georgia comprehensive cancer registry [online]. http://www.health.state.ga.us/programs/gccr/ [Accessed Access Date 2011].

Goodchild, M.F. & Lam, N.S., 1980. Areal interpolation: A variant of the traditional spatial problem. Geo-Processing, 1, 297-312.

Held, L., Natário, I., Fenton, S.E., Rue, H. & Becker, N., 2005. Towards joint disease mapping. Statistical Methods in Medical Research, 14 (1), 61-82.

Kelsall, J. & Wakefield, J., 1999. Discussion of ' bayesian models for spatially correlated disease and exposure data', by best et al. In Bernardo, J., Berger, J., Dawid, A. & Smith, A. eds. Bayesian statistics 6. Oxford, UK: Oxford University Press, 151.

Knorr-Held, L., 2000. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine, 19 (17-18), 2555-2567.

Knorr-Held, L. & Best, N.G., 2001. A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society: Series A (Statistics in Society), 164 (1), 73-85.

Krieger, N., 1992. Overcoming the absence of socioeconomic data in medical records: Validation and application of a census-based methodology. American Journal of Public Health, 82 (5), 703.

Krieger, N., Chen, J.T., Waterman, P.D., Soobader, M.J., Subramanian, S. & Carson, R., 2002. Geocoding and monitoring of us socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter? American Journal of Epidemiology, 156 (5), 471.

Krieger, N., Williams, D.R. & Moss, N.E., 1997. Measuring social class in us public health research: Concepts, methodologies, and guidelines. Annual Review of Public Health, 18 (1), 341-378.

Lawson, A.B., 2009. Bayesian disease mapping: Hierarchical modeling in spatial epidemiology: Chapman & Hall/CRC.

71 Lunn, D.J., Thomas, A., Best, N. & Spiegelhalter, D., 2000. Winbugs-a bayesian modelling framework: Concepts, structure, and extensibility. Statistics and computing, 10 (4), 325- 337.

Mollié, A., 2001. 15.. Bayesian mapping of hodgkins disease in france. Spatial Epidemiology, 1 (9), 267-286.

Openshaw, S. & Taylor, P.J., 1981. The modifiable areal unit problem. In Wrigley, N. & Bennett, R. eds. Quantitative geography: A british view. London and Boston: Routledge and Kegan Paul, 60-69.

Population Estimates Program, 2011. County intercensal estimates (2000-2010) [online]. http://www.census.gov/popest/data/intercensal/county/county2010.html [Accessed Access Date 2012].

Richardson, S., Abellan, J. & Best, N., 2006. Bayesian spatio-temporal analysis of joint patterns of male and female lung cancer risks in yorkshire (uk). Statistical Methods in Medical Research, 15 (4), 385.

Richardson, S., Thomson, A., Best, N. & Elliott, P., 2004. Interpreting posterior relative risk estimates in disease-mapping studies. Environmental health perspectives, 112 (9), 1016.

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. & Van Der Linde, A., 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64 (4), 583-639.

Wakefield, J., Best, N. & Waller, L., 2001. 7.. Bayesian approaches to disease mapping. Spatial Epidemiology, 1 (9), 104-128.

Waller, L., Carlin, B., Xia, H. & Gelfand, A., 1997. Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association, 607-617.

Xia, H. & Carlin, B., 1998. Spatio-temporal models with errors in covariates: Mapping ohio lung cancer mortality. Statistics in Medicine, 17 (18), 2025-2043.

72

CHAPTER 4

MODULAR CAPACITATED MAXIMAL COVERING LOCATION PROBLEM FOR THE

OPTIMAL SITING OF EMERGENCY VEHICLES3

3 Yin, P. and Mu, L. 2012. Applied Geography 34: 247-254. Reprinted here with permission of the publisher.

73 Abstract

To improve the application of the maximal covering location problem (MCLP), several

capacitated MCLP models were proposed to consider the capacity limits of facilities. However,

most of these models assume only one fixed capacity level for the facility at each potential site.

This assumption may limit the application of the capacitated MCLP. In this article, a modular

capacitated maximal covering location problem (MCMCLP) is proposed and formulated to allow

several possible capacity levels for the facility at each potential site. To optimally site emergency

vehicles, this new model also considers allocations of the demands beyond the service covering

standard. Two situations of the model are discussed: the MCMCLP-facility-constraint (FC), which fixes the total number of facilities to be located, and the MCMCLP-non-facility-constraint

(NFC), which does not. In addition to the model formulations, one important aspect of location modeling—spatial demand representation—is included in the analysis and discussion. As an example, the MCMCLP is applied with Geographic Information System (GIS) and optimization software packages to optimally site ambulances for the Emergency Medical Services (EMS)

Region 10 in the State of Georgia. The limitations of the model are also discussed.

Keywords: Modular capacitated MCLP, Spatial demand representation, GIS, Emergency vehicle

74 4.1 Introduction

Given a covering standard for a service, such as a distance or travel-time maximum, the objective of the maximal covering location problem (MCLP) is to locate a fixed number of facilities to provide the service to cover as many demands as possible. MCLP modeling, after being put forward by Church and ReVelle (1974), has been a powerful and widely used tool in many planning processes to optimally distribute limited resources to maximize social and economic benefits, such as the placement of emergency warning sirens (Current and O'Kelly

1992), fire stations (Indriasari et al. 2010), distribution centers for humanitarian relief (Balcik and Beamon 2008), health centers (Bennett et al. 1982, Verter and Lapierre 2002, Griffin et al.

2008, Ratick et al. 2009), and ecological reserves (Church et al. 1996). Among many different versions of MCLP models that have been proposed, a basic underlying assumption is that the facilities to be sited are uncapacitated. Under this assumption, the demand will be served as long as it is within the service covering standard of any facility. However, this assumption of uncapacitated facilities severely limits the application of covering models (Current and Storbeck

1988). Many service facilities have finite capacities to ensure an acceptable level of service and spatial equity (Murray and Gerrard 1997, Liao and Guo 2008). For example, an ambulance base can only respond to a limited number of demands within its service covering standard (e.g., 8- min driving distance) at one time because of the availability status of the ambulances stationed at the base. Therefore, the capacity limit—the main constraint addressed in this article—is an important consideration in location problems, especially for the siting of emergency facilities.

Chung et al. (1983) and Current and Storbeck (1988) published two early papers dealing with the capacitated versions of the MCLP. Both groups of authors added maximum capacity constraints into the mathematical formulations of the MCLP to ensure that the demands allocated

75 to a facility will not exceed the capacity of that facility. However, these two capacitated MCLP models only consider the allocation of the demands within the service covering standard of facilities. Many systems, particularly public services, are typically available to all demands within their jurisdiction. For example, even if a demand is located in an area where no ambulances can reach the demand within a time standard, the demand must still be responded to and be counted as part of some facility’s workload. Therefore, Pirkul and Schilling (1991) proposed an extension of the capacitated MCLP where all demands are assigned to facilities, regardless of whether that demand lies within the service covering standard. Such an idea of allocating all demands to facilities is also shown in some uncapacitated MCLP models, such as the generalized maximal covering location problem of Berman and Krass (2002). Following the work of Pirkul and Schilling (1991), Haghani (1996) proposed a multi-objective capacitated

MCLP model where the objective function maximizes the weighted covered demand while simultaneously minimizing the average distance from the uncovered demands to the located facilities. He showed how to ensure the maximization of the weighted covered demand to be the primary objective in the model by adjusting its weight in the objective function.

In all of the above capacitated MCLP models, only one fixed capacity level of the facility is considered for each potential facility site. However, many situations arise where each potential facility site could have several possible maximum capacity levels for a facility to choose. For example, the capacity limit of an emergency facility (e.g., ambulance base or fire station) can be assumed to be determined by its stationed emergency vehicles (e.g., ambulances or fire trucks).

Therefore, varied numbers of emergency vehicles will provide a series of possible maximum capacity levels for the emergency facility to choose. Correia and Captivo (2003) called the location problems with such capacity constraints modular capacitated location problems.

76 However, their model is an extension of the capacitated plant location problem, the objective of

which is to minimize total costs, including fixed costs and operating costs, associated with plant

and transportation costs, among others. For emergency services, the objective is often stated as

the minimization of losses to the public, which is equivalent to the maximization of benefits

(Indriasari et al. 2010). Cost is usually not the first consideration in these services. Therefore, the

capacitated MCLP is more suitable than the capacitated plant location problem for emergency

services. Although Griffin et al. (2008) considered three capability levels for each type of health

care facility in their capacitated MCLP model, there is no composing relationship for the

capacity levels of facilities, such as that between emergency vehicles and emergency facilities. In

addition, their model did not consider the allocation of demands outside the service covering

standard.

To apply the capacitated MCLP model to the emergency facility siting problem in which

an emergency facility could have different possible capacity levels with varied numbers of

stationed emergency vehicles, we propose an extension of the MCLP called the modular

capacitated maximal covering location problem (MCMCLP). Similar to the multi-objective

function in the model of Haghani (1996), the MCMCLP aims to maximize the weighted covered demand while simultaneously minimizing the average distance from the uncovered demands to the located facilities.

The remainder of this article is organized as follows: In the next section, the concepts, formulations, and related issues of the MCMCLP are introduced and discussed in terms of two situations. The first situation involves a fixed total number of facilities to be located; in the

second situation, the total number of facilities is not fixed. Subsequently, we briefly review the

approaches for spatial demand representation that could influence the accuracy of the problem

77 solutions. The method called service area spatial demand representation (SASDR) is briefly described. Next, the MCMCLP and the SASDR are applied to the optimal siting of ambulances for the Emergency Medical Services (EMS) Region 10 in the State of Georgia (GA). Finally, a discussion and conclusions are provided.

4.2 Modular Capacitated Maximal Covering Location Problem (MCMCLP)

Because of the capacity limit of a facility, the allocation problem (i.e., how to allocate demands to facilities) sometimes must be solved in conjunction with the location problem (i.e., where to site facilities) (Haghani 1996). Under the assumption that one demand can only be allocated to, at most, one facility, we define three demand types and use them in the following part of this article: 1) unallocated demand, which is not allocated to any facility (e.g., the demands da and db in Figure 4.1); 2) covered allocated demand, which is located within the service covering standard of a facility and is allocated to that facility (e.g., the demand dc in

Figure 4.1); 3) uncovered allocated demand, which is located beyond the service covering standard of a facility but is allocated to that facility (e.g., the demand dd in Figure 4.1).

da dd

Facility f Demand Allocated to db dc Service Covering

Standard

Figure 4.1. Illustration of three demand types: unallocated demand (da and db), covered allocated demand (dc), and uncovered allocated demand (dd)

78 Following the work of Pirkul and Schilling (1991) and Haghani (1996), and in light of a

different perspective of the capacitated plant location problem of Correia and Captivo (2003), we

present an extension to the capacitated MCLP called MCMCLP and utilize it for siting

emergency services. In addition to the basic concept of the MCLP that the covered allocated

demands should be maximized by optimally siting a fixed number of facilities, the MCMCLP

also includes the following considerations: 1) the facility at each potential site has a maximum

capacity, which will be chosen from a finite and discrete set of available capacity levels; 2) all

demands need to be allocated to facilities (i.e., no unallocated demands exist), and the uncovered

allocated demands could be assigned on the basis of their proximity to facilities; 3) the demands

within a demand object, which is a spatial point or areal unit derived by abstracting or

partitioning continuous demand space, may be divided and allocated to multiple facilities.

An area with a larger population usually has a higher frequency of calls for emergency service than an area with a smaller population. In addition, one emergency vehicle can only respond to one call at a time and will be available only after that task is finished. Therefore, the larger population an ambulance serves, the higher the busyness probability it usually has, the longer the average response time for a call is, and the poorer the service it will provide. To ensure an acceptable average response time for a call, each emergency vehicle can be thought to have a maximum population that it can serve. In this article, we take population as demands, and the upper limit of the population served by an emergency vehicle is defined as the capacity of that vehicle. In fact, the calculation of an emergency vehicle’s capacity needs to consider multiple factors, including the requirement for the average response time, the average frequency of calls in the population that it will serve, and the average treatment time for a task, among others. For simplicity, in this article, all emergency vehicles are assumed having the same

79 capacity, and the capacity of a facility can be assumed as the total capacities of all vehicles

stationed in that facility. For example, if there could be at most p vehicles stationed in a facility,

there are p possible levels of capacity from which to choose. A facility will not be established in

a location unless at least one emergency vehicle needs to be stationed there.

There are two situations for the MCMCLP. If there is no constraint on the total number of

emergency facilities that will be established to station vehicles, then we call such a non-facility-

constraint problem MCMCLP-NFC. This situation mainly focuses on how to allocate a given

number of vehicles to a set of predefined potential facility sites. If the total number of facilities is

fixed, such facility-constraint problem is termed MCMCLP-FC. This situation needs to select the sites for a given number of facilities and then allocate a given number of vehicles to these facilities. Consider the following notation:

I = the set of demand objects {1, ..., i, …,m;

J = the set of potential facility sites {1, ..., j, …,n};

S = the service covering standard of facility (i.e., maximum distance or time);

dij= the travel distance or time from potential facility site j to demand object i;

Ji = the set of potential facility sites j within the service covering standard of which

demand object i lies, i.e., { | ij ≤ Sdj };

ai = the amount of service demands at demand object i;

p = the total number of emergency vehicles to be located;

c = the capacity of one emergency vehicle (assuming all vehicles have the same capacity);

w = the weight associated with all the uncovered allocated demands;

80 xj = the number of emergency vehicles stationed at potential facility site j; a facility is

located on site j when x j > 0;

yij = the percentage of demands at demand object i that is allocated to the facility on site j.

The formulation of the MCMCLP-NFC is

Maximize ∑i ij − ∑ ∑∑ ij i yadwya ij Equation 4.1 ∈Ii ∈Jj i ∈Ii ∉ Jj i

Subject to:

∑ i ya ij cx j ∈∀≤ Jj Equation 4.2 ∈Ii

∑ j = px Equation 4.3 ∈Jj

∑ yij 1 ∈∀= Ii Equation 4.4 ∈Jj

x j = p0,1,2,..., ∈∀ Jj Equation 4.5

Ii yij 10 ∈∀≤≤ Ii Equation 4.6

Among Equations 4.1 to 4.6, 4.1 is a multiple objective function that seeks to maximize the amount of the covered allocated demands ( ∑ ∑ i ya ij ) while simultaneously minimizing the ∈Ii ∈ Jj i total distance between the uncovered allocated demands and the sites to which they are assigned

( ∑ ∑ ij i yad ij ). In this function, the weight w≥0 can be varied to adjust the preference on each ∈Ii ∉ Jj i objective. Constraints 4.2 ensure that all demands allocated to any facility cannot exceed the

81 maximum capacity of that facility (i.e., the total capacities of the emergency vehicles stationed

there). If no facility (i.e., no vehicle) is located on a site, no demand will be allocated to that site.

Constraint 4.3 specifies the total number of emergency vehicles to be located. Constraints 4.4

ensure that all demands at each demand object will be allocated to a facility. Constraints 4.5

indicate that the decision variable xj is a non-negative integer. Constraints 4.6 restrict the

continuous decision variable yij, which ranges from 0 to 1.

We use min{p, n} to denote the smaller value between the total number of emergency vehicles, p, and the total number of potential facility sites, n. In the MCMCLP-NFC, emergency vehicles could be stationed in the facilities located on the sites as many as min{p, n}, whereas the

MCMCLP-FC considers fixing the total number of facilities to be sited. To present the formulation of the MCMCLP-FC, we need to introduce additional notations:

q = the total number of facilities to be sited;

K = the set of possible facility sizes (i.e., the number of vehicles) on each potential

facility site (1,…, k,…, p);

1 if a facility with k vehicles is loated potentialon sitefacility j x jk =  0 otherwise

The MCMCLP-FC has the same objective function Equation 4.1 and constraints 4.4 and 4.6 as

the MCMCLP-NFC formulation. The other constraints include:

∑ x jk 1 ∈∀≤ Jj Equation 4.7 ∈Kk

∑i ya ij ≤ ∑kcx jk ∈∀ Jj Equation 4.8 ∈Ii ∈Kk

82 ∑∑kx jk = p Equation 4.9 ∈Jj ∈ Kk

∑∑ jk = qx Equation 4.10 ∈Jj ∈ Kk

x jk ∈{}0, 1 , ∈∈∀ KkJj Equation 4.11

Constraints 4.7 ensure that no more than one facility can be located on each potential facility site.

Constraints 4.8 ensure that all the demands allocated to a facility cannot exceed the maximum capacity of that facility. Constraint 4.9 specifies the total number of emergency vehicles to be stationed. Constraint 4.10 specifies the total number of facilities to be sited. Constraints 4.11 impose integrality restriction on the decision variable xjk.

In objective function Equation 4.1 for both MCMCLP models, the weight w associated with uncovered allocated demands can be varied to trade off the two objectives: the maximization of covered allocated demands and the minimization of the total distance of uncovered allocated demands to facilities. When w = 0, the model considers only the former objective, and the service level for the uncovered allocated demands will not be assured because they may be allocated to a further facility instead of to a nearer one. With w increases, the service level for the uncovered allocated demands will improve because more preference is given to the latter objective while the covered allocated demands may not be maximized by as many as demands as when w = 0. In general, maximization of the covered allocated demands would be the primary objective in emergency service planning, which means that, for a model with an appropriate weight w, the optimal solution will provide as good or better coverage of the covered allocated demand than any other feasible solutions (Haghani 1996). With the similar proof given by Haghani (1996), we can prove that, to ensure maximization of the covered allocated demands

83 is the primary objective, the weight w must meet the following condition when assuming integer

demands:

1 0 w ≤≤ Equation 4.12 ()max − ddA min

where A is the total demands ∑ai , and dmax and dmin are the maximum and minimum distances, ∈Ii respectively, between any pairs of demand object i and potential facility site j.

4.3 Spatial Demand Representation

Taking residents as demands, the aggregated census data may be the spatial information

of demands that we can easily obtain. When information on individual activity or tracking data is

not available, a practical consideration is to assume that the demands are distributed continuously

within the census units. For such continuous area demands, some spatial demand representation

has to be adopted so that the MCLP model can be applied. The widely used point-based

abstractions may be prone to measurement and coverage errors (Murray and O'Kelly 2002, Tong

and Murray 2009). The areal representations with census units or grids of regular polygons often

complicate the model because of the explicit processing of partial coverage caused by the

mismatch between the boundaries of service covering areas and the demand areal units. To

maintain both the simplicity and the high degree of accuracy of the maximal coverage model, the

SASDR, which was proposed by Yin and Mu (2011), is used in this article to represent demand

space.

The SASDR is a polygon-overlay-based representation for continuously spatial demands.

In this representation, the demand objects are created by using the service areas of all potential

facility sites to partition the whole demand space. Figure 4.2(a) shows an example where a

84 square demand space U will be partitioned into the SASDR by two potential facilities f1 and f2

with circular service areas S1 and S2. Figure 4.2(b) shows the four resulting demand objects in the

final SASDR, which includes − () SSU 21 , ()−  SSU 21 , ()−  SSU 12 , and  SSU 21 . The

biggest advantage of the SASDR is that all the demand objects lie either within or beyond the

service covering standard of any potential facility site, which can avoid partial coverage in the

model. With the basic functions in GIS software packages, such as buffer, overlay and network

analysis, the SASDR can be easily realized.

(a) (b) Figure 4.2. Example of the SASDR with circular facility service area (a) demand space U (the square) and two potential service areas S1 and S2 (the circles) (b) four demand objects in the SASDR result of demand space U partitioned by service areas S1 and S2

4.4 Applications: Optimal Siting of Ambulances

Because of its important social and economic objectives, the ambulance location problem

has been widely studied over the past 40 years (Eaton et al. 1985, Adenso-Díaz and Rodríguez

1997, Brotcorne et al. 2003, Daskin and Dean 2005, Henderson and Mason 2005). Because ambulances are usually stationed in fire departments or parking lots with little additional

85 construction or administrative costs, it is unnecessary to limit the total number of facilities to be

sited. Given this practical consideration, the MCMCLP-NFC model may be more appropriate

than the MCMCLP-FC model. However, to better compare the performances of these two

models, we here apply both MCMCLP-NFC and MCMCLP-FC to the optimal siting of ambulances for EMS Region 10 in GA.

4.4.1 Study Area and Data

EMS Region 10 is one of the 10 EMS regions in GA, which is in the northeastern section of GA and is composed of 10 counties (Figure 4.3). The region serves 405,231 people (2000 census data) in a 3,006 total square mile area with 13 licensed ambulance services and 58 vehicles (OEMS 2006). The population in 2010 was 460,189, and the quartile map of the population density (persons/km2) by census block group is shown in Fig. 3. The population data,

boundary maps of census units, and street map are all taken from US 2010 census data because

we need to reflect well the variation in demand across the study area with the population data at a

relatively low spatial aggregation level, such as at the block group or block level, which are only

available in census years. The Georgia EMS stations data from 2005 to 2007 are the only EMS

data that we can obtain thus far; these data come from the Homeland Security Infrastructure

Program (HSIP) and were downloaded from the website of the Georgia Department of

Community Affairs (DCA 2011). These data consist of the information of the locations where

the EMS personnel are stationed or based, or where the equipment that such personnel use in

performing their jobs is stored for ready use. According to these data, a total of 82 EMS stations

provide ambulance service in our study area (Figure 4.3). Among these stations, only two

(Madison County Emergency Medical Services Station 4 and Greene County Emergency

Medical Service) are not stationed in the fire departments. The count of EMS stations (82) is

86 larger than the count of ambulances (58). This result may be due to the inconsistency in the time periods for which the data were collected. In addition, it is common for ambulances to be periodically relocated among facilities to insure a good coverage at all times, which is an important difference between the operations of emergency medical services and other emergency services, such as those of fire departments or police departments (Brotcorne et al. 2003).

Therefore, some EMS stations may not site the vehicles all the time. Although the population data and EMS data for different time periods are used, the time interval between these data is short; the time inconsistency is therefore ignored in this application until better-quality data become available. This data input is not the critical part of the models and should not significantly influence the illustration and validation of our models and their applications.

Figure 4.3. Population density of Georgia EMS Region 10 (study area) by census block group and existing ambulance facility locations

87 4.4.2 Tasks

To test the application of the MCMCLP for emergency services, a total of 58 ambulances

will be allocated to maximize the covered allocated demands within 8-min driving distance from

the facilities. The locations of 82 existing EMS stations are regarded as the potential facility sites.

The demands are represented by the census population in 2010 by census block group. To ensure

the existence of a feasible solution to the problem, we define the capacity of each ambulance as

8000 persons so that 58 ambulances have total capacity of 464,000, which exceeds the total

demand of 460,189. We assume that the capacity of 8000 persons per ambulance can meet the

requirement of the average response time to the calls for service in this region. In the MCMCLP-

NFC model, the 58 vehicles could be allocated to, at most, 58 facility sites. In the MCMCLP-FC

model, only 20 potential facility sites will be chosen, and the 58 vehicles will be allocated to

these 20 sites. ArcGISTM v9.3.1 is used to realize the SASDR. Programming with Visual Basic

for Applications (VBA) for ArcObjects in ArcGISTM v9.3.1 is used to structure the optimization

model files. The optimization problems are then solved using the commercial mixed integer programming (MIP) software package CPLEX v12.2. All analyses are performed on a personal computer equipped with an Intel Core Quad 2.4 GHz CPU and 3 GB of RAM.

4.4.3 Results

4.4.3.1 Realization of SASDR

In the realization of SASDR, three types of roads are used to create the road network and then to create the 8-min service area for each potential facility site. The information for roads is listed in Table 4.1 and includes the MAF/TIGER Feature Class Codes (MTFCC) defined in the census data, road descriptions and hypothetical speed limits. Figure 4.4 shows the road network in the study area.

88 Table 4.1. Information for roads

MTFCC Description Speed limit(miles/hour) S1100 Primary Road 70 S1200 Secondary Road 55 S1400 Local Neighborhood Road, 40 Rural Road, City Street

Figure 4.4. Road network in EMS Region 10 in GA

After the road network is created, a service layer that includes the 8-min service polygons

for the 82 potential facility sites is created from the road network using the network-analysis functions in ArcGIS (Figure 4.5). The white areas indicate that no vehicles can reach these locations within 8 minutes from any potential facility location. Each service polygon was identified by the ID of its corresponding facility site.

89

Figure 4.5. Eight-minute service areas (non-white polygons) of all potential ambulance facility sites (red points) based on the road network

With the polygon overlay tool “Identity” in ArcGIS, the service layer is used to partition

the study area to derive the partition layer that includes all intersecting units among the service

polygons and the study area. Because of possible overlap among the service polygons, the

partition layer may include duplicate intersecting units that have the same location and shape but

different facility site IDs. A new field, “DO_ID”, is created in the partition layer, and the “Field

Calculator” function in ArcGIS with VBScript is used to compare the centroid coordinates and the area of each unit to identify the duplicate units. All units that represent the same demand object will be assigned the same demand object ID in the field “DO_ID”. In the attribute table of the partition layer, both facility site ID and demand object ID now exist in each record. The facility site j in the record of the demand object i indicates that the demand object i can be

90 completely covered by the service from the potential facility site j. This information will later be used to construct the model input file for CPLEX to solve the problem. A total of 2,721 demand objects are obtained for the study area. We export them from the partition layer to create the demand object boundary layer.

The next step for the realization of SASDR is to calculate the amount of demands in each demand object, which will be interpolated from the census block group population data and assumed to be distributed uniformly within the demand object. When the polygon overlay tool

“Intersect” in ArcGIS is used to overlay the layer of population density by block group on the demand object boundary layer, many intersecting units will emerge. The population in each unit is calculated by timing its population density with the size of that unit. Finally, the population of the intersecting units is aggregated to the demand objects. Fig. 6 shows the final SASDR result for the study area with demand (i.e., population) distribution. Because of the round-off error, a total aggregated population of 460,219 in the study area is obtained, which is then used as the total amount of demands in the subsequent model. There are 623 demand objects with no people because of their small sizes and low population densities. These zero-population demand objects are first excluded from the optimization problem to reduce the computing complexity. After the optimization problem is solved by CPLEX, these demand objects will be brought back and allocated to their nearest facilities.

91

Figure 4.6. SASDR result for the study area with demand (population) distribution

4.4.3.2 Model Construction and Solution

The distance between demand object and facility location is measured from the centroid of the demand object to the facility location point in kilometers. The maximum distance in this study area is 33.377 km and the minimum distance is ×10683.2 −2 km. According to Equation

4.12, the value of weight w should be within the range [0, ×10515.6 −8 ] to ensure that the maximization of the covered allocated demands is the primary objective. In fact, as long as the value of weight w falls in this range and does not equal zero, the solutions of each model will be the same, irrespective of the weight w. Therefore, we set w= ×106 −8 for both the MCMCLP-

NFC and MCMCLP-FC models.

92 The model input files were constructed with the VBA program of ArcObjects in ArcGIS.

These models were then solved in CPLEX, which uses a branch-and-cut technique to find the

optimal solution (CPLEX Help 2011). The run time is 3,361 seconds for the MCMCLP-NFC

model and 706 seconds for the MCMCLP-FC model. The solutions obtained from CPLEX were

finally visualized as maps in ArcGIS.

Figure 4.7 shows the results of two MCMCLP models using the choropleth maps overlaid

with selected facility sites. In these maps, the facility and the demands allocated to it are

represented in the same colors, and larger facility symbols indicate more ambulances. With such

maps, the location-allocation patterns of the problem solution can be easily understood. For those

demand objects whose demands will be divided and allocated to more than one facility, the

strategy here is to split the demand object into multiple parts. For each facility that partially

serves the demand object, there is a part in the demand object trying to be close to that facility,

and its size is proportional to the percentage of demands served by that facility. In Figure 4.7(a),

in which the MCMCLP-NFC is applied, a total of 51 out of 82 potential sites are chosen to set up

the facilities, and 402,365 demands (87.4% of total demands) are covered within the 8-min

service covering standard. In Figure 4.7(b), in which the MCMCLP-FC is applied, 20 facilities

are required by the problem specification, and 358,477 demands (77.9% of total demands) are

covered within the service covering standard. As expected, the amount of the covered allocated

demands obtained by the MCMCLP-NFC is greater than that obtained by the MCMCLP-FC

because more facilities in the MCMCLP-NFC provide greater flexibility for siting the ambulances. Because the proximity of the uncovered allocated demands to the facilities is considered in both models (i.e., w= ×106 −8 ), the demands allocated to a facility are generally

distributed more compactly and more continuous than those in the models with w=0 (results not

93 shown). However, the allocations of many facilities are still dispersed into several parts that may

be far away from one another. For example, there are two major demand patches with varied

sizes (filled with diagonals) allocated to the facility at site 13 in Figure 4.7(a). One reason for

this allocation is that the primary objective of the models is to maximize the covered allocated

demands instead of the proximity of the uncovered allocated demands to the facilities. The

splitting operation of the demand objects to represent the partial coverage could also cause the

noncontinuous demand allocations in the maps. Because of the smaller number of facilities

established, the MCMCLP-FC shows a more compact and continuous distribution of the

demands than the MCMCLP-NFC shows.

Table 4.2 shows the counts of the facilities with varied numbers of ambulances in these

two models. The maximum number of ambulances in a facility is 3 (site 45 in Figure 4.7(a)) in the MCMCLP-NFC model and 12 (site 35 in Figure 4.7(b)) in the MCMCLP-FC model.

94

(a)

(b)

Figure 4.7. Results of the MCMCLP models siting 58 ambulances in 82 potential facility locations with w= ×106 −8 (the facility location is rendered in the same color as its allocation area) (a) the MCMCLP-NFC model (b) the MCMCLP-FC model with 20 facilities

95 Table 4.2. Count of the facilities with varied numbers of ambulances

Number of Count of facilities ambulances in a facility MCMCLP-NFC MCMCLP-FC 1 45 2 2 5 10 3 1 5 4 0 1 5 0 1 12 0 1 Total 51 20

4.5 Discussion

Several assumptions are made in this article to apply the MCMCLP models to optimally site emergency vehicles such as ambulances. One assumption is that a facility has a capacity that is related to the vehicles stationed there. This assumption is simple but reasonable. If the population in the jurisdiction of a facility is too large, one of the important indicators for the emergency service quality, the average response time to the calls for emergency service, will be too long. When the population exceeds a limit, the quality of the emergency service provided by that facility will be unacceptable. Given a requirement on the average response time to the calls, a facility with more vehicles may serve a greater population. In our application, for simplicity, we assume that each vehicle has the same capacity and that the capacity of a facility is equal to the total capacity of the vehicles located there. Admittedly, this is a very restrictive assumption because the capacity of an emergency vehicle actually depends on multiple factors, including the requirement on the average response time, the average frequency of calls in the population it will serve, and the average treatment time for a task, among others. A discussion of this problem exceeds the scope of this article. However, if the possible capacity levels of the facility at each potential site can be estimated and taken as a group of constants, the MCMCLP model can be easily modified to accommodate the situation. The location problems of emergency vehicles are,

96 in reality, complex. The MCMCLP is a static model that does not consider the dynamic factors

such as the daily population movement. Accounting for such factors will be the focus of our

future work.

The MCLP has been proven to be nondeterministic polynomial time (NP)-hard (Megiddo et al. 1981), which means that no algorithm has yet been discovered to solve it in polynomial time in the worst case. As an extension to the MCLP, the MCMCLP is also NP-hard. Therefore, the use of exact methods (e.g., enumeration or linear programming with branch-and-bound) to solve a large-scale MCMCLP will be difficult. Seeking heuristic methods (e.g., genetic algorithm or Lagrangian relaxation) is important for promoting the applications of the MCMCLP. A potential heuristic method for solving the MCMCLP is a two-phase procedure, in which the locations of the facilities and the demand allocation are first determined under the assumption that the facilities are uncapacitated; the emergency vehicles are then allocated to each facility depending on the allocated demands. We note that this two-phase procedure does not consider that the second phase may change the demand allocation determined by the first phase, which will cause the configuration of facility locations determined by the first phase to not necessarily be the optimal solution for the whole problem.

Although model formulation and the optimization of algorithms are always the focus in location modeling, many other aspects of the location problem, such as the representation for spatial demands, also influence the accuracy of the modeling solutions and require attention. An effective visualization of the problem solutions will be helpful in understanding the location-

allocation patterns and in making decisions by comparing different modeling results. One

problem that we need to address for our MCMCLP models in the future is how to better

represent in the map the demand objects served by multiple facilities.

97 In the MCMCLP model, GIS plays an important role. It is used to manage and organize

the spatial data, to realize the spatial demand representation, to help construct the model input

file for optimization software packages, and to visualize the problem solution with maps. In

addition to these important functions, GIS also facilitates theoretical advances in current location

science (Church 2002, Murray 2010).

4.6 Conclusion

The MCMCLP that we proposed in this article is an extension of the capacitated MCLP

to accommodate situations where the facilities to be sited have several possible capacity levels.

For the optimal siting of emergency vehicles, the MCMCLP considers the modular capacity

levels of a facility, the allocation of all demands, and the proximity of the uncovered allocated

demands to facilities. Two situations—the MCMCLP-NFC and the MCMCLP-FC—can be used depending on the circumstances of the facility. In cases where the cost of a facility is low and maximization of the covered allocated demands is the main purpose, such as establishing bases for ambulances that are not always based in a building but are often at a very rudimentary location such as a parking lot (Brotcorne et al. 2003), the MCMCLP-NFC may be more useful because more covered allocated demands are generally obtained than with the MCMCLP-FC. If the cost of facilities is also an important consideration, such as with fire stations for fire trucks, the MCMCLP-FC may be better because we can incorporate information about how many facilities we can build in the location modeling.

98 References

Adenso-Díaz, B. & Rodríguez, F., 1997. A simple search heuristic for the mclp: Application to the location of ambulance bases in a rural region. Omega, 25 (2), 181-187.

Balcik, B. & Beamon, B.M., 2008. Facility location in humanitarian relief. International Journal of Logistics: Research & Applications, 11 (2), 101-121.

Bennett, V.L., Eaton, D.J. & Church, R.L., 1982. Selecting sites for rural health workers. Social Science & Medicine, 16 (1), 63-72.

Berman, O. & Krass, D., 2002. The generalized maximal covering location problem. Computers & Operations Research, 29 (6), 563-581.

Brotcorne, L., Laporte, G. & Semet, F., 2003. Ambulance location and relocation models. European Journal of Operational Research, 147 (3), 451-463.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

Church, R.L., 2002. Geographical information systems and location science. Computers & Operations Research, 29 (6), 541-562.

Church, R.L., Stoms, D.M. & Davis, F.W., 1996. Reserve selection as a maximal covering location problem. Biological conservation, 76 (2), 105-112.

Correia, I. & Captivo, M.E., 2003. A lagrangean heuristic for a modular capacitated location problem. Annals of Operations Research, 122 (1), 141-161.

Cplex Help, 2011. Branch and cut [online]. http://www.iro.umontreal.ca/~gendron/IFT6551/CPLEX/HTML/usrcplex/solveMIP9.htm l#638133 [Accessed Access Date 2011].

99 Current, J. & O'kelly, M., 1992. Locating emergency warning sirens. Decision Sciences, 23 (1), 221-234.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Daskin, M. & Dean, L., 2005. Location of health care facilities. Operations Research and Health Care, 43-76.

Dca, 2011. Data and maps for planning [online]. http://www.georgiaplanning.com/dataforplanning.asp [Accessed Access Date 2011].

Eaton, D.J., Daskin, M.S., Simmons, D., Bulloch, B. & Jansma, G., 1985. Determining emergency medical service vehicle deployment in austin, texas. Interfaces, 96-108.

Griffin, P.M., Scherrer, C.R. & Swann, J.L., 2008. Optimization of community health center locations and service offerings with statistical need estimation. IIE Transactions, 40 (9), 880-892.

Haghani, A., 1996. Capacitated maximum covering location models: Formulations and solution procedures. Journal of advanced transportation, 30 (3), 101-136.

Henderson, S. & Mason, A., 2005. Ambulance service planning: Simulation and data visualisation. Operations Research and Health Care, 77-102.

Indriasari, V., Mahmud, A.R., Ahmad, N. & Shariff, A.R.M., 2010. Maximal service area problem for optimal siting of emergency facilities. International Journal of Geographical Information Science, 24 (2), 213-230.

Liao, K. & Guo, D., 2008. A clustering based approach to the capacitated facility location problem. Transactions in GIS, 12 (3), 323-339.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

Murray, A.T., 2010. Advances in location modeling: Gis linkages and contributions. Journal of geographical systems, 12 (3), 335-354.

100 Murray, A.T. & Gerrard, R.A., 1997. Capacitated service and regional constraints in location- allocation modeling. Location Science, 5 (2), 103-118.

Murray, A.T. & O'kelly, M.E., 2002. Assessing representation error in point-based coverage modeling. Journal of geographical systems, 4 (2), 171-191.

Oems, 2006. Office of emergency medical services/trauma operating report.

Pirkul, H. & Schilling, D.A., 1991. The maximal covering location problem with capacities on total workload. Management Science, 37 (2), 233-248.

Ratick, S.J., Osleeb, J.P. & Hozumi, D., 2009. Application and extension of the moore and revelle hierarchical maximal covering model. Socio-Economic Planning Sciences, 43 (2), 92-101.

Tong, D. & Murray, A.T., 2009. Maximising coverage of spatial demand for service. Papers in regional science, 88 (1), 85-97.

Verter, V. & Lapierre, S.D., 2002. Location of preventive health care facilities. Annals of Operations Research, 110 (1), 123-132.

Yin, P. & Mu, L., 2011. Service area spatial demand representation in maximal coverage modeling. Manuscript submitted for publication.

101

CHAPTER 5

AN EMPIRICAL COMPARISON OF SPATIAL DEMAND REPRESENTATIONS IN

MAXIMAL COVERAGE MODELING4

4 Yin, P and Mu, L. To be submitted to Environment and Planning B.

102 Abstract

Operationally representing spatial demand is necessary to apply location models to

planning processes and closely related to the efficiency of modeling solutions. A spatial demand

representation should not only be able to minimize representation error, but also keep the

complexity of model as low as possible. Most of the current research, however, is primarily

focused on assessing and reducing/eliminating representation error while ignoring the

complexity of modeling associated with demand representation. In this study, we use expressions

of set theory to formulize a polygon-overlay-based demand representation called service area

spatial demand representation (SASDR). Using the maximal covering location problem (MCLP)

as an example, we empirically compare SASDR to widely-used point-based and regular-area-

based demand representations in terms of both problem complexity and representation error. Our

study shows that, although use of SASDR can eliminate some errors associated with other

demand representations, problem complexity with SASDR could become extremely high with

the increase of potential facility sites, which could become computationally intractable for exact

methods in current optimization software. Point-based demand representation with fine

granularity sometimes is a good alternative to SASDR because it can provide similarly effective

modeling solutions while avoiding extensive computation in GIS for the realization of SASDR.

Regular-area-based demand representation is not strongly recommended based on its poor performance compared to the point-based demand representation with a similar problem complexity.

Keywords: MCLP, Spatial demand representation, Representation error, Problem complexity,

GIS

103 5.1 Introduction

The fact that different scale- and/or unit-definitions in geographic analyses produce

different results is known as the modifiable areal unit problem (MAUP) (Openshaw and Taylor

1981). The MAUP is important not only in general areas of geographic analysis, but also in

location modeling where the MAUP is manifested in aggregation and representation errors

(Cromley et al. 2012). There has been a long history of study on aggregation error in location

modeling including p-median problems and covering location problems (Hillsman and Rhoda

1978, Goodchild 1979, Current and Schilling 1987, Daskin et al. 1989, Current and Schilling

1990, Hodgson and Neuman 1993, Bowerman et al. 1999, Francis et al. 2009, Cromley et al.

2012). More recently, representation error in location modeling, especially covering location

models, has started to receive more attention (Murray and O'Kelly 2002, Murray et al. 2008,

Tong and Murray 2009, Cromley et al. 2012).

For covering location modeling, it is common to assume that aggregated or continuous spatial demand is concentrated on a set of points or uniformly distributed within areal units. With respect to these point-based and area-based demand representations, there are several studies

focusing on assessing the associated representation errors (Murray and O'Kelly 2002, Murray et

al. 2008). Several other studies tried to reduce or eliminate the representation errors by new

covering model formulations (Murray 2005, Tong and Murray 2009). Different from the

traditional area-based representations using census units or regular polygons, such as triangles or

rectangles, as demand objects, Cromley et al. (2012) proposed a new area-based demand

representation that partitions a continuous demand space using polygon overlay methods into a

set of areal units called the least common demand coverage units (LCDCUs). This representation

104 approach, without complicated model formulations, could reduce or eliminate some errors

associated with the traditional point-based and area-based representations.

Current studies with respect to spatial demand representations primarily focus on the evaluation of representation errors and how to reduce or eliminate these errors. However, the complexity of problems associated with demand representations is rarely discussed. Many covering location models, such as the maximal covering location problem (MCLP), have been proven to be nondeterministic polynomial time (NP)-hard (Megiddo et al. 1981), which means

that no algorithm has been discovered yet to solve it in polynomial time in the worst case.

Actually, the size of a covering location problem is highly related to the demand representation it

adopts. Therefore, even if a demand representation approach may theoretically reduce or

eliminate some representation errors in a problem, it probably could make the problem difficult,

if not impossible, to solve using exact methods in current optimization software. Relying on

some heuristic algorithms to solve such a complicated problem may introduce other errors in

modeling results.

As Cromley et al.’s (2012) spatial demand representation with LCDCUs is based on the

service area of a facility at each potential facility site, we define this representation as service

area spatial demand representation (SASDR). In this paper, we use the MCLP as an example to

empirically compare SASDR to the traditional point-based and regular-area-based

representations where both representation error and problem complexity are simultaneously considered. Specifically, we evaluate problem complexity associated with these three types of demand representations and compare their representation errors given similar degrees of problem complexity. This comparison is expected to provide some insight on how to choose appropriate demand representations in practical applications. Although the question of how to realize

105 SASDR with GIS was briefly described in texts by Cromley et al. (2012), it is worth formulizing the process of its realization for better preciseness and clarity. In the following two sections, more details about representation error and problem complexity in the MCLP are reviewed. Next, the formulization of SASDR is given and explained. Experimental designs for understanding the problem complexity and modeling errors associated with the three types of demand representations are then described, followed by the experimental results and discussions. Finally, some conclusions are offered.

5.2 Representation Error in Covering Location Modeling

In covering location modeling, aggregation and representation errors are related but fundamentally different. Murray and O’Kelly (2002) have noted that the aggregation of spatial information assumes there is one true lowest level of data. For example, the population at any higher level in the census hierarchy is an instance of the aggregation of the population at any lower level such as the census block level. Aggregation error occurs in any analysis conducted above the level of the individual or whenever a scale change occurs (Cromley et al. 2012).

Comparing to demand aggregation, demand representation usually has no such hierarchy as that in census data. Individual demand is usually represented by the location point of that demand.

Any aggregated or continuous demand is often assumed to be concentrated on a set of points or uniformly distributed within areal units. With different point or areal tessellations for representing the same aggregated or continuous demand in a region, some modeling errors could occur. Such representation error is usually measured by comparing modeling results with one spatial demand representation to those with another at the same aggregation levels.

It is a long-held tradition that continuous demand is represented by a set of discrete weighted points where the weight represents the amount of demand for service on that point.

106 Many location models including the MCLP were proposed based on this kind of demand representation. Along with the development of GIS in location science, areal units have been used to represent continuous demand due to the 2-dimensional nature of demand space and the strong capability of GIS to manipulate 2-dimensional spatial objects (Miller 1996, Kim and

Murray 2008, Murray et al. 2008, Tong et al. 2009, Tong and Murray 2009, Alexandris and

Giannikos 2010). Figure 5.1 shows four examples of the traditional point-based and area-based representations for the demand in a region with three polygons. In Figure 5.1(a), the demand in each polygon is assumed to be concentrated on the centroid of that polygon or uniformally distributed within that polygon. Figure 5.1(b) shows using a rectangle grid or its centroids to represent the demand space where the demand in each rectangle is assumed uniformally distributed or concentrated on its centroid. When the demand within each demand object cannot be obtained directly, which is very common, it may need to be estimated using areal interpolation techniques with other available demand data that have inconsistent boundaries of units with the demand representation. Especially, intelligent areal interpolation methods, which is based on the principles of dasymetric mapping, usually can provide better estimates of the spatial heterogeneity of demand within areal units than simple areal interplation methods do

(Cromley et al. 2012).

107

(a) (b)

Figure 5.1. Examples of spatial demand representations with (a) census blocks or their centroids, and (b) rectangle grid or its centroids

In many covering location models, demand of a demand object only has a binary status

— being completely covered by a facility or completely not. In Figure 5.1, we assume a facility

(the star) with circular service coverage is located in the region. According to the point-based

demand representation in Figure 5.1(a), the demand within polygon C is considered covered by

the facility since its centroid is within the service coverage. No demand in polygons A and B is

considered covered since both of their centroids are outside the service coverage. However, the

reality is that a portion of demand within polygon C is not covered while a portion of demand within polygons A and B is covered. Based on the area-based representation in Figure 5.1(a), no

demand in the whole region is considered covered since none of these three polygons is

completely within the service coverage. However, it is true that a portion of demand in these

three polygons is covered. The similar situation occurs when using the point-based or area-based

demand representations in Figure 5.1(b). Assuming the demand estimate within each areal unit is

“real”, we can see that point-based demand representation could either underestimate or

overestimate the amount of “real” demand covered, whereas traditional area-based demand

108 representation could underestimate the amount of “real” demand covered. Such underestimation or overestimation will lead to modeling errors in both the total amount of covered demand estimated by the objective functions of models and the configuration of facilities given by the decision variables in model results.

Based on the discussions by Casillas (1983) and Cromley et al. (2012), representation error is defined as the difference between the objective function values optimized for the same study area with two different demand representations. We use Cromley et al.’s (2012) terminology and consider the following notation:

fa is an objective function using representation a

fb is an objective function using representation b

xa is the optimal solution to the problem using representation a

xb is the optimal solution to the problem using representation b

Taking representation b as the reference, representation error is defined as follow:

[ ()()− xfxf ] Representation error = bbaa Equation 5.1 ()xf bb

Representation error can be decomposed into cost error and optimality error. Cost error is the difference between the objective function values of the same solution measured with two different demand representations, which is shown as follow:

[ ()()− xfxf ] Cost error = abaa Equation 5.2 ()xf bb

109 Optimality error is the difference between the objective function values of two optimal solutions measured with the same demand representation. It is defined as follow:

[ ()()− xfxf ] Optimality error = bbab Equation 5.3 ()xf bb

5.3 The MCLP Model and Problem Complexity

Given a covering standard for a service, such as maximum distance or travel time, the objective of the MCLP is to locate a fixed number of facilities to provide service coverage for as much spatial demand as possible. Consider the following notation:

I = the set of demand objects (i as demand object index)

J = the set of potential facility sites (j as facility site index)

dij= the travel distance or time from potential facility site j to demand object i

S = the distance or time beyond which a demand object is considered ‘uncovered’

wi = the demand for service at i

p = the total number of facilities to be located

1 if sitefacility j is selected x j =  0 otherwise

1 if demand i is covered (or served) yi =  0 otherwise

1 if sitefacility j is capable of serving demand .., ij ≤ Sdeii aij =  0 otherwise

110 The formulation of the MCLP (Church and ReVelle (1974) is

Maximize ∑ yw ii Equation 5.4 ∈Ii

Subject to

∑ ij ij ∀≥ iyxa Equation 5.5 ∈Jj

∑ j = px Equation 5.6 ∈Jj

x j ∈{}0, 1 ∀j Equation 5.7

yi ∈{}0, 1 ∀i Equation 5.8

The objective Equation 5.4 seeks to maximally cover the amount of weighted demand.

Constraints 5.5 require that demand i can be covered only if at least one facility is located at the sites where the service can cover demand i. Constraint 5.6 specifies the total number of facilities to be located. Constraints 5.7 and 5.8 impose integrality conditions on decision variables.

The complexity of the MCLP problem mainly depends on the number of demand constraints (Equation 5.5) and the number of integrality constraints on decision variables

(Equation 5.7) and (Equation 5.8). For each demand object (e.g., point or areal unit), if its demand weight is larger than 0 and it can be covered by a facility at a potential location, there will be a demand constraint and an integrality constraint associated with this demand object in the MCLP model. Each potential facility site also contributes an integrality constraint to the model. Therefore, the complexity of the MCLP problem is highly related with the spatial demand representation and the number of potential facility sites in an application. When using census

111 units or their centroids to represent demand, the number of demand objects is equal to the number of census units in the study area. However, when using point grid or regular area grid to represent demand, the number of demand objects depends on the grid design which is often arbitrary.

In applications of the MCLP model, the size of census unit or regular areal unit for demand representation is usually smaller than the service coverage of a facility for better accuracy of modeling results. Analysis based on a demand representation with finer granularity

(i.e., smaller size of demand object) also is expected to lead to smaller representation errors since more complete demand objects can be covered within service coverage of a facility. With respect to predefined potential facility sites, we need to consider multiple factors including cost, site availability, proximity to demand, access to other services, etc., which may have large variability in a region. More potential facility sites could provide more configurations of facilities to choose, which in turn can improve the optimality on the amount of demand covered by a given number of facilities. It is noted that, however, at the same time when more demand objects and potential facility sites are used to improve modeling results, the model could become dramatically complex and lead to a computational challenge for exact methods in current commercial optimization software. Heuristic methods, such as genetic algorithms, provide alternative approaches to solve such complex location problems. However, they cannot ensure optimal solutions which could lead to other errors in modeling results, and sophisticated strategies for heuristic algorithms and strong programming skills are also required.

5.4 Service Area Spatial Demand Representation

SASDR was originally described by Cromley et al. (2012) as an area-based demand representation, with or without intelligent areal interpolation, used to be compared to census-

112 centroid-based demand representation in terms of representation and scale error. In this section, we use expressions of set theory to formulize the realization of SASDR, which is easier to understand and to be implemented in different GIS software packages. In addition, we discuss both representation error and problem complexity of SASDR based on its concept.

The map overlay process has been used for approximately 50 years, and its multiple forms are important spatial analysis methods in GIS (McHarg and American Museum of Natural

History. 1969, Longley et al. 2005). SASDR is based on one of the map overlay operations.

Considering two sets A (rectangle) and B (circle) in Figure 5.2(a), the overlay operation A▲B is defined as below:

B A ▲ B {X|X I {A −=∈= B  B},A and X ≠ φ} Equation 5.9

where I is a two-member set in which, as shown in Figure 5.2(b), member − BA is the set of all elements that are members of A but not members of B, and member  BA is the set of all elements that are members of both A and B. A▲B is the set whose members are those non-empty members of I. Therefore, A▲B can be a two-member set {}− ,  BABA when ≠ BA and

 BA ≠ φ , be a one-member set {}− BA when A ≠ φ , ≠ BA and  BA = φ , be a one-member set {} BA when = BA and  BA ≠ φ , or be the empty set φ when A = φ .

113

(a) (b)

Figure 5.2. Illustration of overlay operation A▲B: (a) set A and set B (b) the result from A▲B

For a set of sets C = {Ci, i= 1, 2, 3, …, n} and a set D, overlay operation C▲D is defined

as below:

n D C D C ▲ D = ()C i ▲ D Equation 5.10 =1i

Therefore, C▲D is actually a set of sets consisting of all members of the sets obtained by

conducting the overlay operation on each member Ci of set C with set D.

Because the set of potential facility sites and the service standard are given in our case, the service area at each potential facility site can then be determined. Consider the following

notation:

U = the whole demand space

Sj = the service area at potential facility site j (j = 1, 2, 3, …, m)

SASDR is defined as the partition of demand space U into a finite demand object set SA_DOS :

SA_DOS = U S S S 321 ... ▲▲▲▲ S m Equation 5.11

114 ∈ Each element SAD _ DOS is defined as a demand object, also called LCDCU following

Cromley et al.’s (2012) terminology, that is disjointed with one other and  = UD . ∈SAD _ DOS

Figure 5.3(a) shows an example in which a rectangle demand space U will be partitioned

into a SASDR by two potential facility sites f1 and f2 with circular service areas S1 and S2. First, demand space U is partitioned by service area S1, creating two demand objects

SUSU S U ▲ S 1 { −= 1,  SUSU 1} (Figure 5.3(b)). Then, service area S2 is used to continue to partition

the demand space U. A total of four demand objects

SSU,SSU,SSU,SSU S S U S ▲▲ S 21 {()()()()21 −−−= 21  − 21  SSU,SSU,SSU,SSU 21 } are created in the final

 −  SASDR (Figure 5.3(c)). Demand objects ()1 SSU 2 and ()1 SSU 2 can be completely

covered if a facility is located at site f1, and demand objects ()−  SSU 21 and ()1  SSU 2 can

be completely covered if a facility is located at site f2. Neither of the services can completely or

partially cover demand object ()−− SSU 21 . Despite the simple circular shape demonstrated, the

facility service area could be any shape.

We can see that SASDR is fundamentally a simple map overlay-based approach.

Compared to point-based demand representations, it uses areal demand units that can reduce the potential measurement and coverage errors caused by aggregating continuous demand to discrete point demands. Compared to those traditional area-based demand representations using census units or regular area grid, it has the advantage that all demand objects will either be completely covered or not be covered by the service from any potential facility site. Without the partial coverage problem, the modeling is more efficient than those in which the partial coverage needs to be handled explicitly in models to reduce modeling errors, such as those proposed by Murray

(2005) and Tong and Murray (2009).

115

(a)

(b) (c)

Figure 5.3. The SASDR with circular facility service area: (a) demand space U and two potential service areas S1 and S2, (b) the partition of demand space U with service area S1, and (c) the partition of demand space U with both service areas S1 and S2

Different from point-based and traditional area-based demand representations where the number of demand objects is independent of the configuration of potential facility sites, the number and arrangement of demand objects in SASDR are completely determined by the service standard and the configuration of potential facility sites in an application. In other words, the complexity of a MCLP model using SASDR is a function of the combination of service standard

116 and configuration of potential facility sites. This could be a problem when a high density of

potential facility sites is needed.

5.5 Experimental Design

Unlike previous studies where the comparisons of spatial demand representations only focus on representation error, we also simultaneously consider problem complexity associated with spatial demand representations. It is known that the increase of demand objects or potential facility sites is expected to reduce representation error and improve the optimality of modeling solutions. In our experiments, we mainly focus on the following two questions:

(1) How does the complexity of a problem using SASDR change when varying service

standard and configuration of potential facility sites?

(2) Given similar degrees of problem complexity, is there a large representation error

between SASDR and other types of demand representations including point-based

and traditional area-based approaches?

The study area in the experiments is the City of Decatur, Georgia which has an area of

approximately 4.2 square miles. The 2010 U.S. Census population data at the block level are

used to estimate the demand of each spatial object in all representations. To improve the

accuracy of the demand estimation, we use the 2010 land use data showing developed and

undeveloped areas as ancillary data and overly it on the census population data so that all

population are constrained within the developed areas. The 2010 land use data were downloaded

from the website of Atlanta Regional Commission (ARC 2012).

To have an understanding about question 1, we design three modes for potential facility

sites including one regular pattern and two irregular patterns as shown in Figure 5.4. Figure 5.4(a)

shows regular grid points with spacing R. Figure 5.4(b) shows the centroids of all census blocks,

117 and Figure 5.4(c) shows all intersections of major roads in the study area. Both GIS data for census blocks and major roads came from the 2010 Census data. For the mode of regular grid points in Figure 5.4(a), we set spacing R with 5 values (meter as unit) including 500m, 400m,

300m, 250m, and 200m, which produce 42, 66, 116, 177, and 272 potential facility sites. Then, the same numbers of potential facility sites are randomly chosen from the centroids of census blocks in Figure 5.4(b) and the intersections of major roads in Figure 5.4(c). Finally, we have total 15 configurations of potential facility sites with three modes (regular grid point, centroid of census block, and intersection of roads) and five different numbers of sites (42, 66, 116, 177, and

272). With respect to the service standard of facilities, we define circular service coverage with three different radii: 300m, 650m, and 1000m. With each combination of service standard and configuration of potential facility sites, we create a SASDR and record the number of demand objects.

(a) (b) (c)

Figure 5.4. Three modes of potential facility sites: (a) regular grid points with spacing R, (b) centroids of census blocks, and (c) intersections of major roads

118 For question 2, we use circular service coverage with a radius of 1000m in the

experiment. Among the 15 configurations of potential facility sites created in previous

experiment, we choose two configurations with 66 and 272 grid points and two configurations

with 66 and 272 centroids of census blocks. Therefore, there are total four SASDRs with the

combinations of one type of circular service coverage and four configurations of potential

facility sites. In all of these four situations, the whole study area can be covered by the service if

there are enough facilities located. For the traditional demand representations used to compare

with the SASDRs, we use four rectangle grids as the examples of traditional area-based demand

representation, and use the centroids of these rectangle grids as the examples of point-based

demand representation (Figure 5.5). By adjusting the spacing of the rectangle grid, we make the

numbers of demand objects in these four grid-rectangle-based and four grid-point-based demand

representations close to those in the four SASDRs. Finally, there are total four groups of

problems in this experiment for comparison, each of which includes three problems that have

different types of demand representations but similar degrees of problem complexity. The

number of facilities evaluated p in Equation 5.6 for all of the problems starts from 1 and increases by 1 every time until the modeling reports 100% demand covered.

119

Figure 5.5. Examples of grid-point-based and grid- rectangle-based demand representations for comparison with SASDR

ArcGISTM v10 is used to realize the SASDR and its visualization. Programming with

Visual Basic for Applications (VBA) for ArcObjects in ArcGISTM v10 is used to structure the

optimization model file. The problems are solved using a commercial optimization package

CPLEX v12.2 that uses a Branch-and-Cut technique to search the optimal solution (CPLEX Help

2011). All analyses are carried out on a personal computer with Intel Core Quad 2.4 GHz CPU

and 3 GB RAM.

5.6 Results and Discussions

5.6.1 Problem Complexity with SASDR

Table 5.1 summarizes the numbers of demand objects in 45 SASDRs with different combinations of service radius (SR) and configuration of potential facility sites. We can see that, regardless of whether the pattern of potential facility sites is regular (grid point) or irregular

120 (block centroid or road intersection), the number of demand objects in the SASDR increases dramatically with the increase of the number of potential facility sites. Taking the group with grid points for potential facility sites and SR=1000m as an example, an increment in the number of potential facility sites by a factor of 6.5 (i.e. 272/42) increases the number of demand objects by a factor of 39.4 (i.e. 37012/939). Such a sharply increasing trend is even more obvious when

SR=300m and SR=650m in this experiment.

Table 5.1. Numbers of demand objects in 45 SASDRs

Mode / Number of Number of demand objects potential facility sites SR = 300m SR = 650m SR = 1000m Grid_Point /42 109 533 939 Grid_Point /66 427 1,479 2,120 Grid_Point /116 783 4,302 7,162 Grid_Point /177 2,849 8,355 15,505 Grid_Point /272 5,276 22,467 37,012 Block_Centroid/42 162 490 904 Block_Centroid/66 500 1,434 2,425 Block_Centroid/116 1,026 3,839 7,007 Block_Centroid/177 2,566 9,347 16,385 Block_Centroid/272 5,948 21,064 37,721 Road_Intersection/42 123 490 917 Road_Intersection/66 323 1,222 1,938 Road_Intersection/116 1,031 3,628 6,701 Road_Intersection/177 2,670 9,584 16,897 Road_Intersection/272 5,884 21,140 37,467

With the same number of potential facility sites and SR, we note that the number of demand objects in SASDR with regular pattern of potential facility sites could be either larger or less than that with irregular pattern of potential facility sties. Therefore, there is no obvious rule on the numbers of demand objects in SASDRs between regular and irregular patterns of potential facility sites. Since the number of demand objects in SASDR is determined by both SR and

121 configuration of potential facility sites, we use Site-Service Index to measure the degree of

clustering of potential facility sites at the scale defined by SR. Site-Service Index describes the

average number of potential facility sites within a circle with radius = 2SR and is defined as

follow:

N N ∑∑ ()ij ≤ 2SRdI Site - Service Index = i j Equation 5.12 N

where i and j are the indexes of potential facility sites, dij is the distance between potential

facility sites i and j, N is the total number of potential facility sites in a study region, and I(·) is an indicator function. We define the ratio of the total number of demand objects in SASDR to N as demand object density. Figure 5.6 shows the scatter plot of Site-Service Index and demand object density for the 45 SASDRs in our experiment. We can see there is a strong linear relationship between these two measures for either regular or irregular patterns of potential facility sites. The R2 is 0.998 among all of the three modes of potential facility sites. This linear

relationship can be used to predict the number of demand objects in SASDR with circular service

coverage, which equals to the multiplication of demand object density and N. Given a fixed

study area and a SR, when N increases to some degree, the spatial pattern of potential facility

sites start to become more and more clustered, and then Site-Service Index increases accordingly, which indicates an increase of the demand object density based on the linear relationship.

Therefore, both increases of N and demand object density will make the total number of demand objects rise quickly.

122 160

y = 0.8335x + 1.5637 140 R² = 0.998 120

100

80 Grid_Point

Demand Object Density 60 Road_Intersection Block_Centroid 40

20

0 0 50 100 150 200

Site-Service Index

Figure 5.6. Relationship between Site-Service Index and demand object density in SASDR with circular service coverage

Based on above experimental results, it is obvious that the problem complexity could become extremely high when a large number and highly clustered of potential facility sites is set.

In many practical applications, especially those working with continuous space, the number of potential facility sties could easily rise to thousands or even millions and they could be highly clustered compared to the service coverage. The quick explosion of problem size with the increase of potential facility sites could make the problem computationally intractable for exact methods in current optimization software. In addition, the realization of SASDR with a large amount of potential facility sites could also be a challenge for current GIS software since the algorithms of polygon overlay even now is one of the most difficult and complex parts in vector- based GIS (Longley et al. 2005).

123 5.6.2 Comparison in Representation Error

Given SR=1000m, Table 5.2 shows the numbers of demand objects in the four groups of problems with three types of demand representations for comparison. In the SASDRs in this experiment, the configurations of 66 potential facility sites lead to about 2,000 demand objects, while 272 potential facility sites lead to over 30,000 demand objects. The different numbers of demand objects also reflect the degrees of granularity of the demand representations. Since the difference in the number of demand objects within each group is less than 0.1%, and the same configuration of potential facility sites is used for the three problems in each group, the problems in each group for comparison have similar degrees of complexity.

Table 5.2. Numbers of demand objects in all demand representations for comparison

Mode / Number of Number of demand objects potential facility sites SASDR Point or rectangle grid Difference Grid_Point /66 2,120 2,120 0.00% Grid_Point /272 37,012 36,988 0.06% Block_Centroid/66 2,425 2,426 0.04% Block_Centroid/272 37,721 37,715 0.02%

Table 5.3 shows the minimum numbers of facilities reported by the objective functions to cover 100% demand in the study area. As expected, more potential facility sites usually need fewer facilities to cover the same demand space. We also notice that one more facility is needed for the grid-rectangle demand representation than other two demand representations when using

66 block centroids as the potential facility sites. It is mainly due to the underestimation of “real” covered demand by grid-rectangle demand representation.

124 Table 5.3. Minimum numbers of facilities reported by models for covering 100% demand

Mode / Number of Minimum number of facilities for 100% demand coverage potential facility sites SASDR Point grid Rectangle grid Grid_Point /66 8 8 8 Grid_Point /272 7 7 7 Block_Centroid/66 9 9 10 Block_Centroid/272 7 7 7

Figure 5.7 shows the percentages of covered demand reported by the MCLP models with three types of demand representations for four configurations of potential facility sites. Both of the regular and irregular configurations of potential facility sites show similar characteristics on the percentage of covered demand. When there are only 66 potential facility sites, the grid- rectangle demand representations lead to less percentages of covered demand than the SASDRs and point-based demand representations do. When the number of potential facility sties increases to 272, all three demand representations have very similar percentages of covered demand.

125

100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% Percentage of covered demand Percentage of covered demand 0% 0% 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 Number of facilities Number of facilities

(a) (b)

100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% Percentage of covered demand Percentage of covered demand 0% 0% 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 Number of facilities Number of facilities

(c) (d)

Figure 5.7. Percentages of covered demand reported by the MCLP models with 3 types of demand representations when the configuration of potential facility sites include: (a) 66 grid points, (b) 272 grid points (c) 66 block centroids, and (d) 272 block centroids

Using SASDR as the reference, Table 5.4 shows the percent cost and optimality errors between the grid-point-based demand representations and the SASDRs for the 4 configurations of potential facility sites. We can see that the cost errors are the primary part of the

126 representation errors in each group. The magnitudes of the cost errors become smaller when more demand objects (i.e., finer granularity of demand representation) are used. In addition, the non-zero cost errors are either positive or negative, which is the same as what we expect that point-based demand representation could either overestimate or underestimate covered demand.

Table 5.4 also shows that only a few non-zero optimality errors occur when 66 potential facility sites are set with about 2000 demand objects. When 272 potential facility sites are used with over

30,000 demand objects, all optimality errors are 0. This observation shows that, with the improvement of the granularity of demand representation, the differences generally become smaller on the optimal configurations of facilities given by the MCLP models with point-based demand representation and SASDR. We also notice that, when the number of demand objects is small, the real 100% covered demand may be not reached when the models report 100% covered demand, such as 8 facilities for the potential facility sites of 66 grid points and 9 facilities for the potential facility sites of 66 block centroids in this experiment.

Table 5.4. Cost and optimality errors between grid-point-based demand representations and SASDRs

Grid_Point /66 Grid_Point /272 Block_Centroid/66 Block_Centroid/272

(2120) (36988) (2426) (37715) p Cost Optimality Cost Optimality Cost Optimality Cost Optimality 1 -0.08% 0.00% -0.06% 0.00% 0.06% 0.00% 0.01% 0.00% 2 -0.03% 0.00% 0.05% 0.00% 0.12% 0.00% 0.00% 0.00% 3 0.24% 0.00% -0.02% 0.00% 0.12% -0.07% 0.02% 0.00% 4 0.39% 0.00% 0.00% 0.00% 0.26% -0.12% 0.02% 0.00% 5 0.14% 0.00% 0.00% 0.00% 0.14% 0.00% -0.02% 0.00% 6 -0.02% 0.00% 0.00% 0.00% 0.04% 0.00% 0.01% 0.00% 7 0.00% 0.00% 0.00% 0.00% 0.07% 0.00% 0.00% 0.00% 8 0.02% -0.02% 0.01% 0.00% 9 0.03% -0.03% Note: the number in the parentheses shows the number of demand objects in each demand representation

127 Table 5.5 shows the percent cost and optimality errors between the grid-rectangle-based demand representations and the SASDRs for the four configurations of potential facility sites. It is noted that the magnitudes of both cost and optimality errors are generally larger than those of the grid- point-based demand representations shown in Table 5.4. The cost errors are still the primary part in the representation errors for the grid-rectangle-based demand representations. In addition, the non-zero cost errors are all negative, which reflects that grid-rectangle-based demand representation usually underestimates covered demand. Similar with the grid- point- based demand representations shown in Table 5.4, the improvement of the granularity of demand representation decreases the difference on the optimal configurations of facilities given by the

MCLP models with grid-rectangle-based demand representation and SASDR. Moreover, the grid-rectangle-based demand representations can offer solutions that cover real 100% demand.

Table 5.5. Cost and optimality errors between grid-rectangle-based demand representations and SASDRs

Grid_Point /66 Grid_Point /272 Block_Centroid/66 Block_Centroid/272

(2120) (36988) (2426) (37715) p Cost Optimality Cost Optimality Cost Optimality Cost Optimality 1 -7.69% 0.00% -1.85% 0.00% -7.61% 0.00% -1.82% 0.00% 2 -7.70% 0.00% -1.72% 0.00% -5.85% 0.00% -2.04% 0.00% 3 -4.77% -0.37% -1.51% 0.00% -4.62% 0.00% -1.09% 0.00% 4 -3.01% -0.89% -0.87% 0.00% -3.38% 0.00% -0.68% 0.00% 5 -1.83% 0.00% -0.36% 0.00% -1.33% -0.61% -0.31% -0.07% 6 -0.51% -0.18% -0.02% 0.00% -0.84% 0.00% -0.07% 0.00% 7 -0.06% 0.00% 0.00% 0.00% -0.45% 0.00% 0.00% 0.00% 8 0.00% 0.00% -0.19% -0.09% 9 -0.05% -0.02% 10 0.00% 0.00% Note: the number in the parentheses shows the number of demand objects in each demand representation

Based on the experimental results about representation error described above, we have the following main findings:

128 (1) SASDR and the traditional area-based demand representations (e.g., use of census

units or regular polygons as demand objects) can both offer solutions providing real

100% covered demand if the whole demand space can be covered by enough number

of facilities with a given configuration of potential facility sites. However, the

minimum number of needed facilities analyzed with the traditional area-based

demand representation could be larger than optimal solutions. Point-based demand

representation with coarse granularity is difficult to offer solutions that provide real

100% covered demand. However, the improvement of the granularity of point-based

demand representation could mitigate the problem.

(2) Given similar problem sizes and using SASDR as the reference, when the granularity

of demand representation is relatively coarse, the representation errors, including cost

and optimality errors, associated with both point-based and the traditional area-based

demand representations are obvious. However, when the granularity of demand

representation is fine, the representation errors could become very small, especially

the optimality errors. In that case, the model solutions about the configuration of

facilities could be equally effective no matter which type of demand representation is

used.

(3) When the degrees of granularity are close, grid-point-based demand representation

usually has better performance than grid-area-based demand representation in terms

of both cost and optimality error.

These main findings provide us some implications on how to choose appropriate spatial demand representation in practical applications. When a small number of potential facility sites is needed or there is a requirement on real 100% covered demand, SASDR is a good choice.

129 When the number of potential facility sites rises to a large number that could lead to a SASDR with very fine granularity, using a point-based demand representation may be a good choice based on the following considerations. If a SASDR results in a large problem size that, however, is still solvable for exact methods in current optimization software, using a point-based demand representation with similar problem complexity as an alternative can give similar modeling solutions while avoiding extensive computation in GIS for the realization of SASDR. In point- based demand representation, the number of demand points is independent of the configuration of potential facility sites, which provides a flexible approach to balance problem complexity and representation error. If a problem using SASDR is too complex to solve by exact methods in current optimization software, it is possible to replace it by a point-based demand representation with less number of demand objects that can be defined based on the capability of optimization software. The loss of covered demand due to the representation errors could be compensated by increasing the number of potential facility sites without a large increase of problem size.

Regular-area-based demand representation is not strongly recommended because, given similar problem sizes, its performance is usually not as good as point-based demand representation and it also needs spatial analysis functions in GIS to examine the topological relationship between service coverage and each regular areal demand unit, which could be very time-consuming.

5.7 Conclusions

Spatial demand representation is an important topic in location modeling because it is necessary for applying location models to the planning process and strongly associated with the efficiency of modeling solutions. A spatial demand representation should not only be able to minimize representation error but also need to keep the complexity of model as low as possible.

Most of current research, however, is primarily focusing on assessing and trying to reduce or

130 eliminate representation error while ignoring the complexity of model associated with demand representation. In this paper, we use expressions of set theory to formulize SASDR that is a polygon-overlay-based demand representation originally described by Cromley et al. (2012) and also used for siting emergency vehicles by Yin and Mu (2012). Using the MCLP as an example, we then empirically compare SASDR to widely-used point-based and regular-area-based demand representations in terms of both problem complexity and representation error.

SASDR has several advantages including being able to offer solutions providing real 100% covered demand and eliminating some errors associated with point-based and other area-based demand representations. However, our study shows that, the complexity of problem with

SASDR could become extremely high when increasing the number and the degree of clustering of potential facility sites. This problem could lead to a dilemma for many practical applications where it is common to set a large number of potential facility sites for larger covered demand.

Many covering location problems themselves are nondeterministic polynomial time (NP)-hard

(Megiddo et al. 1981), which means that no algorithm has yet been discovered to solve it in polynomial time in the worst case. Therefore, these problems using SASDR could become more difficult, if not impossible, to solve by exact methods in current commercial optimization software. In such cases, heuristic methods may be the only ways that however could introduce other errors to modeling solutions and requires sophisticated strategies for algorithms and strong programming skills. In addition, the realization of SASDR for a large number of potential facility sites could be also a computational challenge for current GIS software.

The empirical comparisons of problems with similar degrees of complexity, but different spatial demand representations, provide us some insight on how to choose appropriate spatial demand representation in practical applications. Point-based demand representation sometimes is

131 a good alternative to SASDR when the problem with SASDR is too complex to solve by exact methods in current optimization software.

As we know, point-based and regular-area-based demand representations can be very flexible depending on the number and arrangement of demand objects as well as the shape of areal unit. In this study, we only choose a limited number of point-based and regular-area-based demand representations as examples to explore their characteristics in the MCLP modeling. Our findings may not be able to be generalized well to all situations.

In addition, we need to notice that the MCLP has been extended to incorporate more considerations to meet specific application requirements, such as the capacitated facility (Chung et al. 1983, Current and Storbeck 1988, Haghani 1996) and the allocation of demand beyond the covering standard in emergency service planning (Pirkul and Schilling 1991, Yin and Mu 2012).

In these variations of the MCLP, allocation of demand to facilities needs to be considered. The aggregation and representation errors on demand allocation could be one topic of our research in the future.

132 References

Alexandris, G. & Giannikos, I., 2010. A new model for maximal coverage exploiting gis capabilities. European Journal of Operational Research, 202 (2), 328-338.

Arc, 2012. Gis data and maps [online]. http://www.atlantaregional.com/info-center/gis-data- maps/gis-data [Accessed Access Date 2012].

Bowerman, R.L., Calamai, P.H. & Brent Hall, G., 1999. The demand partitioning method for reducing aggregation errors in p-median problems. Computers & Operations Research, 26 (10-11), 1097-1111.

Casillas, P., 1983. Data aggregation and the p-median problem in continuous space. In Ghosh, A. & Rushton, G. eds. Spatial analysis and location-allocation models. New York: Van Nostrand Reinhold, 327-344.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

Cplex Help, 2011. Branch and cut [online]. http://www.iro.umontreal.ca/~gendron/IFT6551/CPLEX/HTML/usrcplex/solveMIP9.htm l#638133 [Accessed Access Date 2011].

Cromley, R.G., Lin, J. & Merwin, D.A., 2012. Evaluating representation and scale error in the maximal covering location problem using gis and intelligent areal interpolation. International Journal of Geographical Information Science, 26 (3), 495-517.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Current, J.R. & Schilling, D.A., 1987. Elimination of source a and b errors in p‐ median location problems. Geographical Analysis, 19 (2), 95-110.

133 Current, J.R. & Schilling, D.A., 1990. Analysis of errors due to demand data aggregation in the set covering and maximal covering location problems. Geographical Analysis, 22 (2), 116-126.

Daskin, M.S., Haghani, A.E., Khanal, M. & Malandraki, C., 1989. Aggregation effects in maximum covering models. Annals of Operations Research, 18 (1), 113-139.

Francis, R., Lowe, T., Rayco, M. & Tamir, A., 2009. Aggregation error for location models: Survey and analysis. Annals of Operations Research, 167 (1), 171-208.

Goodchild, M.F., 1979. The aggregation problem in location‐ allocation. Geographical Analysis, 11 (3), 240-255.

Haghani, A., 1996. Capacitated maximum covering location models: Formulations and solution procedures. Journal of advanced transportation, 30 (3), 101-136.

Hillsman, E.L. & Rhoda, R., 1978. Errors in measuring distances from populations to service centers. The Annals of Regional Science, 12 (3), 74-88.

Hodgson, M.J. & Neuman, S., 1993. A gis approach to eliminating source c aggregation error in p-meidan models. Computers & Operations Research.

Kim, K. & Murray, A.T., 2008. Enhancing spatial representation in primary and secondary coverage location modeling. Journal of Regional Science, 48 (4), 745-768.

Longley, P.A., Goodchild, M.F., Maguire, D.J. & Rhind, D.W., 2005. Geographic information systems and science, 2nd ed.: John Wiley & Sons, Ltd.

Mcharg, I.L. & American Museum of Natural History., 1969. Design with nature, 1st ed. Garden City, N.Y.,: Published for the American Museum of Natural History [by] the Natural History Press.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

Miller, H.J., 1996. Gis and geometric representation in facility location problems. International Journal of Geographical Information Systems, 10 (7), 791-816.

134 Murray, A.T., 2005. Geography in coverage modeling: Exploiting spatial structure to address complementary partial service of areas. Annals of the Association of American Geographers, 95 (4), 761-772.

Murray, A.T. & O'kelly, M.E., 2002. Assessing representation error in point-based coverage modeling. Journal of geographical systems, 4 (2), 171-191.

Murray, A.T., O'kelly, M.E. & Church, R.L., 2008. Regional service coverage modeling. Computers & Operations Research, 35 (2), 339-355.

Openshaw, S. & Taylor, P.J., 1981. The modifiable areal unit problem. In Wrigley, N. & Bennett, R. eds. Quantitative geography: A british view. London and Boston: Routledge and Kegan Paul, 60-69.

Pirkul, H. & Schilling, D.A., 1991. The maximal covering location problem with capacities on total workload. Management Science, 37 (2), 233-248.

Tong, D., Murray, A. & Xiao, N., 2009. Heuristics in spatial analysis: A genetic algorithm for coverage maximization. Annals of the Association of American Geographers, 99 (4), 698- 711.

Tong, D. & Murray, A.T., 2009. Maximising coverage of spatial demand for service. Papers in regional science, 88 (1), 85-97.

Yin, P. & Mu, L., 2012. Modular capacitated maximal covering location problem for the optimal siting of emergency vehicles. Applied Geography, 34 (0), 247-254.

135

CHAPTER 6

CONCLUSIONS

6.1 Summary and Conclusions

With increasing digital health data and environmental, socioeconomic, behavioral data available, Geographic Information Systems (GIS) are receiving increased attention in public health studies. This dissertation research mainly focuses on three aspects of health studies using

GIS and spatial analysis: spatial disease cluster detection, spatio-temporal disease mapping, and health service planning. New methods or models are proposed and implemented with GIS in this research to address an important problem in each of the three aspects.

With respect to the detection of spatial disease cluster, for the first time, our study implements and tests Tango’s (2008) restricted likelihood ratio combined with Assunção et al.’s

(2006) dynamic Minimum Spanning Tree (dMST) search strategy to quickly detect disease clusters in arbitrary shapes. To understand the performance of this redesigned hybrid method in various situations, we design six cluster models and two non-cluster scenarios. These cluster models consider different numbers of disease cases in a study area and various shapes of clusters.

The choice of the screening level α1 in restricted likelihood ratio is also explored in our

redesigned spatial scan statistic method (RSScan). Besides the metric of power, we propose

using the Kappa Index of Agreement (KIA) to evaluate and compare the performances of cluster

detection methods to identify the boundaries of clusters in order to avoid the effects due to the

different cluster model properties. Finally, we provide the application of our RSScan method in a

136 case of detecting the cluster of lung cancer incidence in Georgia for the period 1998-2005. The experimental results indicate that RSScan method with appropriate screening level α1 generally has higher power and accuracy than Tango’s method, Assunção et al.’s method, and Kulldorff’s circular spatial scan statistic method (CSScan ) for the clusters in irregular shapes. Based on numeric experiments, our study recommends 0.2 as default for the screening level α1 in the

RSScan method to get higher statistical power and more accurate boundaries of clusters. It also should be noted that the performances of both RSScan method and other three methods vary under different situations such as counts of disease incidence cases and true cluster shapes. This finding corresponds well with the power analysis given by Waller and Gotway (2004) that most tests to detect clusters have spatially heterogeneous power.

Facing the fact that there are only a limited number of lung cancer studies in Georgia, especially at a fine spatio-temporal scale, our research using hierarchical Bayesian models to explore the spatio-temporal patterns of lung cancer incidence risks in Georgia from 2000-2007 contributes to the geospatial health analysis literature. The study is conducted at the census tract level using two-year time period as the temporal unit. The fine spatial and temporal scales enable the study show more detailed variations of lung cancer incidence risks in space and time, which can better support healthcare performance assessment, establishing potential etiological hypotheses, and making effective and efficient health policies. Compared to the crude

Standardized Incidence Ratio (SIR), Bayesian spatio-temporal model can provide more reliable estimate of disease risk in a fine spatio-temporal scale. A total of seven Bayesian spatio-temporal models under the separate and joint modeling frameworks are developed and compared. In this study, the joint models generally have better performance than the separate models using the deviance information criterion (DIC) as the criterion. The study also shows that there are strong

137 inverse relationships between the socioeconomic status (SES) and the lung cancer incidence risk

in Georgia males, especially white males, and weak inverse relationships in both white and black

Georgia females. This could lead to further studies on the underlying reasons such as

occupational risk factors.

The modular capacitated maximal covering location problem (MCMCLP) developed in

Chapter 4 is an extension of the capacitated maximal covering location problem (MCLP) to

accommodate situations where the facilities to be sited have several possible capacity levels. For

the optimal siting of emergency vehicles, the MCMCLP considers the modular capacity levels of

a facility, the allocation of all demands, and the proximity of the uncovered allocated demands to

facilities. Two situations—the MCMCLP-NFC and the MCMCLP-FC—can be used depending on the circumstances of the facility. As an example, these two models are successfully applied to

optimally site ambulances for emergency medical services (EMS) Region 10 in Georgia. In the

MCMCLP models, GIS plays an important role. It is used to manage and organize the spatial

data, to realize the spatial demand representation, to help construct the model input file for

optimization software packages, and to visualize the problem solution with maps. In addition to

these important functions, GIS also facilitates theoretical advances in current location science

(Church 2002, Murray 2010).

Spatial demand representation is an important topic in location-allocation modeling, such as the MCMCLP discussed above. A spatial demand representation should not only be able to minimize representation error but also need to keep the complexity of model as low as possible.

In Chapter 5, we use expressions of set theory to formulize the service area spatial demand representation (SASDR). Using the MCLP as an example, we then empirically compare SASDR to widely-used point-based and regular-area-based demand representations in terms of both

138 problem complexity and representation error. SASDR has several advantages including being

able to offer solutions providing real 100% covered demand and eliminating some errors

associated with point-based and other area-based demand representations. However, our study shows the complexity of the problem with SASDR could become extremely high when increasing the number and the degree of clustering of potential facility sites. This problem could lead to a dilemma for many practical applications where it is common to set a large number of potential facility sites for larger covered demand. In addition, the realization of SASDR for a large number of potential facility sites could be also a computational challenge for current GIS software. The empirical comparisons of problems with similar degrees of complexity but different spatial demand representations indicate that point-based demand representation could

be a good alternative to SASDR when the problem with SASDR is too complex to solve by exact

methods in current optimization software.

6.2 Future Research

Based on the results of this dissertation research, the future research will continue using

GIS and spatial analysis to advance health studies. As examples, three research directions are

shown as follows:

(1) New method for disease cluster detection

Although our RSScan method shows good statistical power and relative high accuracy of the boundaries of detected clusters in detecting spatial disease clusters in arbitrary shapes, the

weakness of this method also need to be noted. Our experiments shows that the statistical power

of our RSScan method varies in situations with different numbers of disease cases, shapes of the

true clusters, patterns of population at risks. The same situation exists in other existing cluster

detection methods as well. The relative arbitrary choice of the parameter of screening level in the

139 restricted likelihood ratio makes the RSScan method difficult to use in practice. Therefore,

improving the statistical power and the accuracy of the boundaries of detected clusters in

arbitrary shapes is one task of my future research. It could be realized by seeking more efficient

artificial intelligence methods as searching strategies and construct better penalty parameters for

test statistics. Recently, a multi-objective algorithm (Cançado et al. 2010) was proposed to avoid

or mitigate the subjectivity in choosing the penalty or other parameters in the test statistics in

traditional cluster detection methods. This could be a direction in my future research. In addition,

extending cluster detection from spatial dimension to spatial and temporal dimensions is

receiving considerable interests in disease surveillance. I will take exploring new methods for detecting spatio-temporal disease clusters as one of my future studies.

(2) Risk factors to lung cancer risk in Georgia

My dissertation research shows the spatio-temporal patterns of lung cancer incidence

risks by race and sex across whole Georgia from 2000 to 2007. These patterns could aid

authorities in making more effective health policies and healthcare services planning to reduce

health disparities and promote public health. However, to better prevent lung cancer, an

important question needs to be answered: what factors lead to such patterns? For example, why

dose northwest Georgia have stably high lung cancer incidence risks for all population subgroups?

In the future, study on the environmental factors related to the spatio-temporal patterns of lung

cancer incidence risks in Georgia is one of my research tasks. For example, how is the

correlation between the distribution of radon in underground water and the lung cancer incidence

in Georgia?

140 (3) Dynamic factors in health service planning

People usually concentrate in working places or commercial districts in daytime, and stay in residences in nighttime. Considering such population movements in health service planning could greatly improve the efficiency and efficacy of the usage of sources, especially emergency vehicles such as ambulances discussed in my dissertation. In the future, I will integrate dynamic factors in demand into my MCMCLP models to solve more practical problems.

141 References

Assunção, R., Costa, M., Tavares, A. & Ferreira, S., 2006. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25 (5), 723-742.

Cançado, A.L.F., Duarte, A.R., Duczmal, L.H., Ferreira, S.J., Fonseca, C.M. & Gontijo, E.C.D.M., 2010. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters. International Journal of Health Geographics, 9 (1), 55.

Church, R.L., 2002. Geographical information systems and location science. Computers & Operations Research, 29 (6), 541-562.

Murray, A.T., 2010. Advances in location modeling: Gis linkages and contributions. Journal of geographical systems, 12 (3), 335-354.

Tango, T., 2008. A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics, 29 (2), 75-95.

Waller, L. & Gotway, C., 2004. Applied spatial statistics for public health data: Wiley- Interscience.

142

APPENDIX I

LIST OF ACRONYMS

Acronym Full description 0-9 2SFCA Two-step Floating Catchment Area C CAR Conditional Autoregression CEPP Cluster Evaluation Permutation Procedure CI Credible Interval CSScan Circular Spatial Scan Statistic D DCA Department of Community Affairs DIC Deviance Information Criterion dMST Dynamic Minimum Spanning Tree E EMS Emergency Medical Services F FC Facility-constraint G GA State of Georgia GAM Geographical Analysis Machine GIS Geographic Information Systems H HSIP Homeland Security Infrastructure Program K KIA Kappa Index of Agreement

143 Acronym Full description L LCDCU Least Common Demand Coverage Unit M MAUP Modifiable Areal Unit Problem MCMCLP Modular Capacitated Maximal Covering Location Problem MCLP Maximal Covering Location Problem MIP Mixed Integer Programming MTFCC MAF/TIGER Feature Class Codes N NFC Non-facility-constraint NP Polynomial Time R RR Relative Risk RSScan Redesigned Spatial Scan Statistic S SASDR Service Area Spatial Demand Representation SES Socioeconomic Status SIR Standardized Incidence Ratio SR Service Radius V VBA Visual Basic for Applications

144