Synthesized Population Databases: a US Geospatial Database for Agent-Based Models

Methods Report

Synthesized Population Databases: A US Geospatial Database for Agent-Based Models

William D. Wheaton, James C. Cajka, Bernadette M. Chasteen, Diane K. Wagener, Philip C. Cooley, Laxminarayana Ganapathi, Douglas J. Roberts, and Justine L. Allpress

May 2009

RTI Press About the Authors RTI Press publication MR-0010-0905 William D. Wheaton, MA, is a senior research geographer and director of This PDF document was made available from www.rti.org as a public service RTI International’s Geospatial Science of RTI International. More information about RTI Press can be found at and Technology program. http://www.rti.org/rtipress. James C. Cajka, MA, Bernadette M. Chasteen, MA, and Justine RTI International is an independent, nonprofit research organization dedicated L. Allpress, MA, are research GIS analysts at RTI International. to improving the human condition by turning knowledge into practice. The Diane F. Wagner, PhD, is a senior RTI Press mission is to disseminate information about RTI research, analytic epidemiologist at RTI International. tools, and technical expertise to a national and international audience. RTI Press Philip C. Cooley, MS, is an RTI publications are peer-reviewed by at least two independent substantive experts Fellow in bioinformatics and high- and one or more Press editor. performance computing at RTI International. Suggested Citation Laxminarayana Ganapathi, PhD, Wheaton, W.D., Cajka, J.C., Chasteen, B.M., Wagener, D.K., Cooley, P.C., and Douglas J. Roberts, MS, are programmers/analysts at RTI Ganapathi, L., Roberts, D.J., and Allpress, J.L. (2009). Synthesized Population International. Databases: A US Geospatial Database for Agent-Based Models. RTI Press publication No. MR-0010-0905. Research Triangle Park, NC: RTI International. Retrieved [date] from http://www.rti.org/rtipress.

This publication is part of the RTI Press Methods Report series. RTI International ©2009 Research Triangle Institute. RTI International is a trade name of Research Triangle 3040 Cornwallis Road Institute. PO Box 12194 Research Triangle Park, NC All rights reserved. Please note that this document is copyrighted and credit must be provided to 27709-2194 USA the authors and source of the document when you quote from it. You must not sell the document Tel: +1.919.541.6000 or make a profit from reproducing it. Fax: +1.919.541.5985 doi:10.3768/rtipress.2009.mr.0010.0905 E-mail: [email protected] Web site: www.rti.org www.rti.org/rtipress Contents Synthesized Population Databases: Introduction 2 Conceptual Model of the A US Geospatial Database for Synthesized Agent Database 3 Methods 5 Agent-Based Models Generating Synthesized Households and Persons 5 Assigning Agents to Schools 8 William D. Wheaton, James C. Cajka, Assigning Agents to Bernadette M. Chasteen, Diane K. Wagener, Workplaces 11 Philip C. Cooley, Laxminarayana Ganapathi, Generating Group Quarters and Assigning Agents Douglas J. Roberts, and Justine L. Allpress to Them 12 Computer Resources, Time Resources, and Scalability 14 Summary and Conclusions 14 Abstract References 15 Agent-based models simulate large-scale social systems. They assign behaviors Acknowledgments Inside back cover and activities to “agents” (individuals) within the population being modeled and then allow the agents to interact with the environment and each other in complex simulations. Agent-based models are frequently used to simulate infectious disease outbreaks, among other uses.

RTI used and extended an iterative proportional fitting method to generate a synthesized, geospatially explicit, human agent database that represents the US population in the 50 states and the District of Columbia in the year 2000. Each agent is assigned to a household; other agents make up the household occupants.

For this database, RTI developed the methods for • generating synthesized households and persons • assigning agents to schools and workplaces so that complex interactions among agents as they go about their daily activities can be taken into account • generating synthesized human agents who occupy group quarters (military bases, college dormitories, prisons, nursing homes). • In this report, we describe both the methods used to generate the synthesized population database and the final data structure and data content of the database. This information will provide researchers with the information they need to use the database in developing agent-based models.

Portions of the synthesized agent database are available to any user upon request. RTI will extract a portion (a county, region, or state) of the database for users who wish to use this database in their own agent-based models. 2 Wheaton et al., 2009 RTI Press

Introduction The database described in this report was developed to support the infectious disease models of the An agent-based model is a computational model Modeling of Infectious Disease Agents Study for simulating the actions and interactions of (MIDAS), funded by the National Institutes of autonomous individuals (hereafter referred to as Health (see www.midasmodels.org). These models “agents”) in a network so as to assess their effects are used to study the spread of infectious diseases on the system as a whole. Agent-based models have through social contact. Cooley et al.,9 for example, been applied to such disparate fields as business used the synthesized population database in a (e.g., supply chain optimization,1 logistics, consumer model studying the spread of seasonal influenza in behavior2), traffic congestion,2 infectious agents,3-6 North Carolina. The simulation model predicted social and financial policy,7 and the human immune the spread of influenza by calculating probabilities system.8 In these and other applications, the system of disease spread between individuals represented of interest is simulated by capturing the behavior in the synthesized agent database. The synthesized of individual agents and their interactions. Agent- agent database, along with the associated school and based models can be used to test how changes in the workplace assignments, provided the agents for the behaviors of individual agents will affect the system’s model and baseline data about how individuals may overall behavior. come into contact with each other at school or work. Figure 1 illustrates a portion of the model results. Agent-based models are built on microdata (i.e., Specifically, it shows the percentage of persons who data on individual agents), not on aggregated are ill on day 85 of a simulated influenza outbreak population summary data. Microdata—in which in central North Carolina. Yellow dots are low each data element is associated with a specific percentages, orange are medium, and red are high. agent—allow researchers to specify and model The epidemic is concentrated in the western part of important population characteristics and structures, the study area. such as family structure, that are derived from data on individuals. For example, microdata on a single For models that depend on social interactions, household of four would contain information about aggregating people into settings that enhance the the age, sex, and relationship of each individual to the probability of interpersonal contact is especially household and to each other. This level of detailed important. Close interpersonal contact occurs in a structure could not be derived from aggregated limited number of settings, such as a households, population data. workplaces, schools, or hospitals. The methods

Figure 1. An example of how the synthetic agent database may be used in disease modeling Synthesized Population Database for Agent-Based Models 3 reported here result in a georeferenced population (highlighted in blue) are shown, as is the link of households containing individuals with correct between the household and its four occupants. The and appropriate age and sex demographics. The database contains the geographic location, household result is a synthetic population that reflects the actual attributes, and data on the four persons who live in population and household family relationships in the household. If all of the households in a geospatial different areas of the United States. We also developed area (such as a census tract, county, or state) were a separate, related method to generate the locations aggregated, the resulting synthetic population would of group quarters (military bases, college dormitories, be consistent with the census data for that area. prisons, and nursing homes) and their occupants, The remainder of this report describes the conceptual but it is not the focus of this paper. In addition, the model of the synthesized agent database and the synthetic population does not include the homeless: methods we used to generate it. We also summarize although the Census Bureau collected some data the methods used to assign agents to schools, on the homeless population as part of the 2000 workplaces, and group quarters. Portions of the decennial Census, it was not considered a census of synthesized agent database are available to any user the homeless and the data are not reported within upon request. RTI will extract a portion (a county, the general Census databases used to generate the region, or state) of the database for users who wish to synthesized population database described here.10 use this database in their own agent-based models. The synthesized population database contains a record for each household and separate records for each individual in the United States. All individual Conceptual Model of the Synthesized occupants are identified for any given household. Agent Database Household characteristics, characteristics of each The final synthesized database consists of two key person, and the link between individuals and the database tables: one contains a record representing households they belong to are maintained. Figure 2 each household (the Household table), and the other shows a map of a portion of the geospatial database contains a record representing each individual who consisting of an area about 2 miles wide. The black lives in a household (the Person table), as shown in dots represent the synthesized households in this Figure 2. area. The characteristics of a single household

Figure 2. The integrated spatial and tabular model of the synthesized population database

Randomly selected synthetic household

Household attributes

Persons 4 Wheaton et al., 2009 RTI Press

Table 1 depicts a portion of the Household table. contains the unique identifier of the household Here, each record (row) represents a different in which each agent resides; “pnum” is a number household. The hh_id is a unique identifier for every assigned to each individual in the household; “age” household in the United States. “Persons” represents contains the person’s age; “sex” contains the person’s a count of the persons in the household; “hinc” sex; and the remaining fields contain additional represents the total income of the household; “p18” attributes about each person, such as the school represents the number of persons in the household and workplace to which he or she is assigned. The less than 18 years of age; and “wif” represents table also contains xcoord, ycoord, lat, and long, as the number of working people in the family. The described previously for the household table but not “xcoord” and “ycoord” fields contain the location of shown here. each household in a standard map projection (Albers The two tables are linked with a common identifier equidistant conic projection) and are provided to (hh_id); records for individuals thus contain the simplify certain display and measurement functions identifier for the household in which they live. for users of the data. The “lat” and “long” fields We can, therefore, identify all the individuals who contain the latitude and longitude coordinates, constitute a household. This feature is important in respectively, of the household. infectious disease models because the close proximity A portion of the Person table appears in Table 2. and frequent contact among household members In this table, each record represents an individual increases the likelihood of disease transmission person, or agent. The “person_id” is a unique among them. identifier for every agent in the US database; “hh_id”

Table 1. The synthesized agent household database table

Table 2. The synthesized person database table Synthesized Population Database for Agent-Based Models 5

and many other variables) are provided. PUMS also Methods provides data on individuals within each household This section describes the main processing methods (including age, sex, ethnicity, language spoken, that we developed to generate the synthesized agent school enrollment, occupation, travel time to work, database. The overall process includes four steps: military service, and many other variables). In (1) generating the basic synthesized households and addition, the PUMS data set maintains linkages agents, (2) assigning agents to schools, (3) assigning between individuals and households, which allows agents to workplaces, and (4) generating additional the household population structure to be brought agents to reside in group quarters housing. forward through further analyses. • The PUMS data are available for predefined Census Generating Synthesized Households and areas known as Public Use Microdata Areas Persons (PUMAs). PUMAs are defined by each state, rather Household and Individual Data Sources than by the US Census Bureau. The Census Bureau requires that each PUMA contain about 100,000 We used three primary data sources to generate the persons, but otherwise, states have wide latitude synthesized agents and households. All three data to define the shape and extent of each PUMA. sources are produced by the US Census Bureau: PUMAs tend to be relatively small in densely • US Census Bureau TIGER Data. The TIGER populated areas and relatively large in sparsely (Topologically Integrated Geographic Encoding populated areas. Figure 3 illustrates the variability and Referencing) data provide the spatial context of sizes and shapes of PUMAs in Mecklenburg for decennial Census data collection.11 TIGER County, North Carolina. Some PUMAs are smaller defines, among many other things, the boundaries than counties (as in this case), whereas others of states, counties, Census tracts, block groups, and encompass multiple counties. blocks. Census tabulation data are aggregated into these various geographic boundaries. The smallest Figure 3. Public Use Microdata Areas for Mecklenburg Census geographic boundary for which the full County, N.C. suite of Census variables (including socioeconomic Iredell County variables) is available is the Census block group. Rowan County PUMA PUMA In addition, the TIGER files include data on 01500 01200 boundaries of bodies of water and road networks, Lincoln County which were used in generating the synthesized Mecklenburg County database. PUMA Cabarrus County 01000 PUMA • Summary File 3 (SF3) Data. The SF3 data contain Gaston County 01300 PUMA the demographic variables from the Census, 01100 PUMA organized and aggregated to many different 00902 PUMA 12 PUMA 00903 geographic boundaries. Data variables on 00904

population and housing are available in these files. PUMA 00901 PUMA • Public Use Microdata Sample (PUMS). The PUMS York County 00905 PUMA data contain records representing a 5 percent 00500 Union County PUM A sample of the occupied and vacant housing units 01400 in the United States and the people in the occupied units.13 These data are actual responses to Census long-form questionnaires and, therefore, retain Each PUMA must be associated with a Census block family structure information. Data on households group so that the PUMS household and person (including number of persons in the household, records can be explicitly tied to geographic areas number of bedrooms, age of building, access to containing SF3 summary statistics. RTI generated telephone service, type of heating, mortgage data, a cross-walk table defining these associations using 6 Wheaton et al., 2009 RTI Press standard geographic information system (GIS) spatial generated point features to represent the location overlay techniques. of each household. We determined the number of points generated within each Census block group by Synthetic Population Data Processing Methods the count of households in the block group according Figure 4 provides an overview of the process we used to SF3 data. The location within the block group is to generate the synthetic population database. This random, except that we did not place households process included two basic activities: within bodies of water (lakes, ponds, or large rivers). • generating household locations Generating Microdata Records for All Households. • generating microdata records for all households. Because the microdata are available for only 5 percent of US households, we needed to devise In addition, we also compared the synthetic a process to replicate the PUMS household data population to Census counts as a quality control records to generate microdata records for 100 percent measure. of the households represented by the point features described above. We did this using a statistical Generating Household Locations. Each household in technique known as iterative proportional the database is represented as a GIS “point feature.” fitting (IPF).14 Point features are unique x,y locations containing descriptive tabular attributes. Because no national To carry out the IPF procedures, we used an existing database of household locations is available, RTI computer program, the Population Generator, developed at Los Alamos Figure 4. Generalized flow chart of the synthetic population data National Laboratory as part of processing the TRANSIMS15 transportation Block Group Public Use Microdata Area simulation modeling package. The Geography (PUMA) Geography Population Generator implements SF3 Block group- Public use Aggregated to-PUMA microdate (5% of a procedure developed by Beckman Census Counts crosswalk all households) et al.14 to generate synthetic populations from the three US Census data sets described above Iterative proportional (TIGER, SF3, and PUMS) and the ﬁtting cross-walk table that associates PUMAs with Census block group Block Group polygons. Geography Synthesized Count of Synthesized households The basic concept of the IPF Block group households by person (100% of household boundaries block group procedure is as follows: given population) that the PUMS data contain a 5 One One record record for percent sample of actual household for each each person household Generate microdata in a PUMA (made household up of many census blocks), and locations given that aggregated counts of households and their characteristics Join synthesized are summarized at the block group household records Geospatial to geographic household geography in the SF3 data, then a households features set of PUMS household microdata One geographic Linked via feature (record) for records can be selected for a PUMA database each household that accurately represent the relationship Integrated geospatial characteristics of real households synthesized in the census block group. The households Synthesized Population Database for Agent-Based Models 7 details of the IPF procedure used in the TRANSIMS To measure the difference between the synthesized Population Generator are fully described in Beckman households and the actual Census aggregated counts, et al.14 we aggregated the attributes of the synthesized households for Durham County, North Carolina, The IPF procedure selects records from the 5 percent summarized them by Census block group, and sample of households contained in the PUMS data compared them with the original Census aggregated to represent households in a particular Census block data for each of the 129 Census block groups group such that, when the assignment process is in Durham County. Table 3 illustrates how the complete, 100 percent of the known households data compare at the county level. The percentage in a block group are represented. People living in difference in total household population is 0.08 group quarters are not included in the TRANSIMS percent; the percentage difference in average Population Generator process. Consequently, we household income is 1.52 percent. used a separate procedure to handle group quarters residents; that process is described in a later section Table 3. Comparison of Census aggregated data and of this report. synthesized data for Durham County, N.C. The TRANSIMS population generator program Data Type Total Household Average Household was developed to create synthetic households and Population Income associated population in major metropolitan areas Census SF3 213,500 $56,705.85 Synthesized 213,325 $55,844.73 (where population density is high). The synthesized Households database we created using the TRANSIMS Percentage 0.08 percent 1.52 percent population generator also includes rural areas Difference and other areas with lower population density. Because TRANSIMS uses specific criteria to assign Figure 5 shows the frequency distribution of the households to block group areas, we expect the total household population comparison, by Census synthetic population to have the best fit for the block group. The sizes of synthesized household following population characteristics: populations were within 5 percent of the sizes of the • number of people in the household under age 18 Census-reported household populations for about 81 • household income percent (104 of 129) of the block groups

• household size Figure 5. Frequency distribution of the percentage • household population difference in household population between the synthesized data set and the Census data for • vehicles available. Durham County, N.C.

Comparing the Synthetic Population to Census Counts 120 The IPF procedure used to generate microdata 100 for all households attempts to select the most relevant household records from the PUMS sample 80 to construct a set of records for 100 percent of households such that the aggregated household 60 demographics of that 100 percent data set match the original Census block data. However, because 40 Number of Block Groups the PUMS data are a sample of households, it is not possible to perfectly reconstruct the original Census 20 data attributes while also honoring the actual total 0 household population, household income, number 0-5 5-10 10-15 15-20 > 20 of workers in the family, persons under age 18 years, Percent Difference in Household opulationP and vehicles available in each block group. 8 Wheaton et al., 2009 RTI Press

Figure 6 shows the frequency distribution of the School Data Sources household income comparison, by block group, RTI used the synthesized agent database described based on median household income. The synthesized previously, public school data from two sources, and median household incomes were within 5 percent road network information from a third source to of the Census-reported median income for about assign synthesized school-age children to schools. The 64 percent (82 of 129) of the block groups. For two three data sources are the following: block groups, the difference between the synthesized • National Center for Education Statistics (NCES) income and Census income was significantly greater Public School Data for 2005–2006. These NCES than for the other block groups (difference greater data contain a list of public schools for the United than 75 percent). These two block groups contain the States, along with each school’s location and grade- dormitories for Duke University, which the Census by-grade capacity.17 categorized as group quarters. Very few households that are not group quarters are located in these two • Private school data. These data, obtained from a 18 block groups, and the Population Generator was not commercial data vendor, contain a list of private able to do as good a job selecting a set of synthetic and parochial schools and their locations. households that match the census SF3 data as for • TIGER road network data. These 2000 data other blocks because it considers only households (described earlier) from the US Census contain the when performing its selections. entire US road network.11

Figure 6. Frequency distribution of the percentage School Assignment Method difference in median household income between Figure 7 illustrates the data processing steps that we the synthesized data set and the Census data for used to assign school-age students to schools. Durham County, N.C. 100 The PUMS data include a code indicating whether a child is enrolled in public school, private school, or no school. We use those codes to select only children 80 who attend either private or public school for use in the school assignments method described below. 60 Children who do not have a school enrollment code in the PUMS data are not assigned to any school and 40 represent the approximately 2.2 percent of school-age children who are home-schooled. Number of Block Groups 20 We used different methods to assign students to public and private schools to reflect the different 0 0-5 5-10 10-15 15-20 20-25 > 25 geographical processes inherent in enrollment Percent Difference in edianM Household Income decisions for public versus private schools. We determined how many students should be assigned to public or private schools in each area based on codes Assigning Agents to Schools found in the PUMS data. Infectious disease transmission is known to occur at The public schools assignment method is based a higher rate in schools because of the relatively close on the assumption that public school students 16 and sustained contact between students. Therefore, are enrolled at the closest school having adequate the synthesized agent database must assign school- capacity. This assumption is necessary because no age individuals to schools to enable modelers to national data source of school catchment areas exists. model explicitly the social contacts between agents in Even within school districts, there may be magnet school settings. schools, which draw students from throughout a Synthesized Population Database for Agent-Based Models 9

county or city; or there may be busing or other school database from NCES (which contains data on school assignment methods that are not based on location and enrollment by grade), the synthesized distance to schools. However, in the absence of agent database, and the TIGER road network data good data on these factors, we believe the minimum to perform an assignment that attempts to fill each distance assumption for school assignments is grade in each school to capacity. The spatial allocation valid and an appropriate simplification that allows is based on a minimum path algorithm such that students to be assigned to schools in a nationally available students of a certain grade level will be consistent manner. assigned to the closest school that has capacity for The public school allocation method assigns students of that grade level. In a single household, one agents who are of school age (4 to 17 years of age) child could be assigned to a public school, a second to public schools. The allocation uses the public- to a private school, and a third might be home- schooled. We did not attempt to make the assignments Figure 7. Flow chart of the process of assigning school- to public, private, or home schools homogeneous age children to schools within households because of the added complexity and processing that would have been necessary to Synthesized persons complete this more complex assignment process. Figure 8 illustrates the spatial distribution of high

Extract 4- to schools in a portion of Kings County, Washington, 17-year-olds and the allocation of high school students from the synthesized agent database to those schools. Each dot on the map represents the location of a high National Synthesized schools school-aged school. For each student assigned to a high school, database persons a line is drawn between the location of the student’s One record per school household and the high school. The figure illustrates the results after all high school grade assignments Generate school Extract based Home- GIS features from on PUMS schooled have been run. Because the process is run grade by latitudes/longitudes enrollment code students grade, in some cases students seem to be drawn from other schools’ catchments. In effect, each grade has its Schools School-aged own catchment, and assignment overlaps result. If a GIS layer and children eligible school attributes for assignment school’s grade-by-grade capacity has not been met and unassigned students are available at greater distances, the algorithm will assign those students even if they

Allocate students are closer to a different school, if the closer school’s to public schools capacity has already been met. Public school assignments are done separately Students for each county. Children who are 4 years old are Unassigned assigned to a students public school and assigned only to schools that have pre-kindergarten school identiﬁer enrollment. Children who are 5 years old are assigned to kindergarten. The remaining children, ages 6 to 17, Students Allocate students assigned to are assigned to appropriate grades from first grade to to private schools private school and school identiﬁer twelfth grade. In the absence of school-specific data on actual enrollment, we assume that each school is fully Merge enrolled. Thus, the process continues until all public school positions have been assigned.

4- to 17-year-olds After public-school students have been assigned, we with school assignments (or no assignment, implement the private-school assignment method on for home-schooled) 10 Wheaton et al., 2009 RTI Press school-age individuals not assigned to public schools Individual choice is a major factor affecting whether and not reserved as homeschooled. Catchment students attend private schools and which specific areas for private schools are broader than for public private school they might attend. These factors schools; therefore, the private-school method allows are difficult to account for in assigning a synthetic students to be assigned to a private school that is population to private schools. We assumed that not necessarily in the same county. Because private although distance is important in private school schools draw students from across county borders, assignments, it is less critical than in public school we cannot process these assignments one county at assignments. Therefore, distance was still the key a time, as we did with public-school assignments. criteria in assigning students to private and parochial Instead, we did private-school assignments one state schools, but unlike the public school allocation at a time. This allowed the schools to draw from a method, private school students were not constrained more natural distribution of potential students. to attending the nearest private school with capacity. Instead, students were assigned to Figure 8. Allocation of high-school-age synthesized agents to Kings private schools using a concentric County, Wash., high schools ring approach. Table 4 illustrates the size of each concentric ring and our assumptions about the proportion of a private school’s enrollment drawn from each ring. These assumptions are the same across the United States, and they have not been validated. Figure 9 illustrates the geographical distribution of students assigned to the twelfth grade for one private school. The distribution closely aligns with the percentages specified in Table 4.

Table 4. Ring sizes and percent enrollment for private school assignments Distance Percentage of School Enrollment 0–10 km (0–6.2 mi) 50% 10–15 km (6.2–9.3 mi) 25% 15– 20 km (9.3–12.4 mi) 25%

At the end of the allocation process, we may still have non-home- schooled school-age children who have not been assigned to a public or private school. We assume in these cases that our schools databases are incomplete and that these school-age children should Synthesized Population Database for Agent-Based Models 11

be assigned to schools. Thus, Figure 9. Allocation of 12th-grade synthesized agents to a sample private we conducted a post-allocation school process that assigns remaining unassigned school-age children to the closest school even if the school’s capacity has already been met.

Assigning Agents to Workplaces To identify persons and locations where infectious disease may be transmitted, we need to be able to assign the non-school- age population to workplaces. A brief overview of the workplace assignment methods is provided here; a complete and detailed description will be included in a future methods report.

Workplace Data Sources We used the following data sources for making workplace assignments: • US Census Bureau Special Tabulation Product 64 Workplace Assignment Method 19 (STP64). STP64 provides counts of workers Figure 10 illustrates the generalized data processing by Census tract of residence and Census tract of steps followed to assign workers to individual work. It therefore provides a useful national data workplaces. set to guide the assignment of synthesized agents to workplaces. We developed a two-stage method to assign synthesized agents to workplaces. The first stage • Synthesized Population. The synthesized agent assigned workers to a Census tract for work. The database described above was the source of the second stage created individual workplace locations worker agents to be assigned to workplaces. within each Census tract and assigned workers to • InfoUSA Business Counts. We obtained these specific workplaces. 2006 data from InfoUSA via the ESRI Business Analyst GIS product.20 InfoUSA provides data on The STP64 data contain a record for each combination more than 14 million US businesses, including of Census tract of residence and Census tract of their location and a count of employees at each work for which persons from one Census tract work workplace. To simplify the assignment process, we in the same or another Census tract. We assigned a summarized these individual workplace records set of synthesized agents from each Census tract of into the following six categories by number of residence to a Census tract of work as specified by the employees: 1–4, 5–24, 25–99, 100–499, 500–4,999, STP64 data. For example, if the STP64 data indicated and 5,000+ employees. that 50 persons living in tract A work in tract B, then 50 agents living in tract A were coded with a work identifier of tract B. 12 Wheaton et al., 2009 RTI Press

Figure 10. Flow chart of the process of assigning workers to workplaces After determining which workers work in each Census tract and

InfoUSA generating records for each Synthesized business persons workplace in each tract, we then microdata assigned the synthesized agents 14 million individual businesses who work in a Census tract to specific workplaces. We used Extract Summarize by tract working-aged and business size the specified capacity of each adults (employees) workplace to determine how many workers should be assigned to that Persons eligible STP64 Place of Counts of for workplace Residence/Place workplaces by site. assignments of Work database tract and size At the conclusion of the workplace

Generate workplaces to assignment process, we assigned Assign workers ﬁt counts and size agents to individual workplaces to to tract of distribution by tract workplace meet two main criteria: (1) each agent works in the correct tract, Count of businesses by as specified in the STP64 data, size category Assign persons to and (2) each agent is assigned Persons with workplaces based tract of work on tract of work to a workplace such that the identiﬁer and business size distribution distribution and capacity of each workplace within each tract is honored. Agents who work in the Synthesized persons assigned same workplace have the same to workplaces workplace identifier; therefore, developers of agent-based models know explicitly which workers After assigning agents to a Census tract of work, we may come into contact with each other based on ran a process to generate specific workplaces that their workplace assignments. The method does not meet the workplace counts and sizes in the InfoUSA account for workers working different shifts or for data. For each Census tract, we generated a record for telecommuters. each business located within the tract. We also coded these records with a unique identifier that included Generating Group Quarters and Assigning the Census tract identifier and the size of the business Agents to Them (one of the six categories noted above for number of Because residents of group quarters (college workers based on the distribution of business sizes dormitories, prisons, military barracks, nursing from InfoUSA). homes, and others) make up 2.7 percent of the US population,20 they are important subpopulations for The result of this analysis was a workplace identifier infectious disease research. However, the synthesized for each person assigned to a place of work. The agent database described above does not include resulting data table contains a record representing representations of agents who live in group quarters. each business workplace (along with the number of Therefore, we created a separate process to generate employees who work there) found in the Census tract. group quarters agents and assign them to appropriate Each workplace is placed at the center of the Census group quarters locations. tract for this analysis. Synthesized Population Database for Agent-Based Models 13

Group Quarters Data Sources Group Quarters Generation and Assignment Methods We used four types of primary data sources to Figure 11 illustrates the generalized procedure generate group quarters data: for creating the group quarters locations and corresponding synthesized residents. • US Census Bureau Summary File 1 (SF1). The SF1 provides counts of group quarters residents by The goal of this process was to generate records for 21 type of group quarters, by block group. The SF1 group quarters and associated records for the people provides the number and sex of group quarters within each group quarters unit that mimic the populations and breaks out population by sex into synthesized agent data structure (described above) the following age groups: under 18 years, 18 to 64 as much as possible to enable the two data sets to be years, and over 64 years. used together with minimal difficulty. The person • Homeland Security Infrastructure Program records do not include all the detailed information in (HSIP) Database. HSIP is a US Department of the synthesized household agent database generated Homeland Security geospatial data product that from the PUMS data, but they do include sex and age. contains, among many other things, locations of We used the location data from HSIP and other prisons, military quarters, colleges and universities, sources to generate points reflecting accurate and nursing homes.22 locations for group quarters, where possible. We • Group Quarters Age Distributions. To refine the then assigned the count of residents of that type (e.g., age data, we needed additional data on the age military residents) for a specific block group from distributions of group quarters populations for the SF1 data to the group quarters points in that the year 2000. These additional data allow us to block group. Each point, therefore, became a group generate person records with associated age and sex quarters of a specific type. In some instances, we information. We used several sources, including could not find accurate locations for specific group those from US Department of Justice (DOJ) on quarters. When we could demonstrate that it was rare 23 prison populations and the US Department of for more than one group quarters of a specific type Defense (DoD) on military populations.24

Figure 11. Flow chart of the process of creating group quarters locations and assigning residents

Counts of group Counts of group quarters Generalized Block group quarters by block residents by block group group quarters boundaries group and type and type of group quarter demographics

Generate group Synthesize individual group quarters quarters resident records to match counts and locations demographic distributions

Group quarters Group quarters point features residents

Assign synthesized group quarters person records to group quarters points features

Integrated group quarters locations and residents data 14 Wheaton et al., 2009 RTI Press

(e.g., military bases) to occur within a block group, Summary and Conclusions we generated a single point for the group quarters in the block group, located at the geographic center of Our US synthesized human agent database provides the block group. We assigned all agents for that type a realistic agent population for use in agent-based of group quarters in that block group to that point. models. We developed the database by combining When necessary, we generated multiple points to available national data sets, population-generating represent multiple facilities. techniques, and GIS techniques. Moreover, we developed several custom enhancements, including We then refined the age distributions using the creating group quarters and assigning agents to additional age data. We used the SF1 data to schools, workplaces, and those group quarters. obtain, for example, the number of males under 18 in the military population; we then used the age Because of the large input data sets, nationwide distribution of males found in the population for processing, and the inherent difficulty in appending military personnel to assign specific person records county-based and state-based processing into for the block group. national geospatial databases, we encountered many technical hurdles during production. We Computer Resources, Time Resources, and applied several quality assurance and quality control Scalability processes to check the results as we processed each county and state. For example, we compared maps The original iterative proportional fitting procedures showing the counts of synthesized persons by Census that we used were designed to generate data one city tract and block group with the aggregated counts at a time. RTI developed automation routines to run provided by the Bureau of the Census. This step these procedures for the entire country on a county- ensured that we did not create any holes or gaps in by-county basis. The basic synthesized population the data. database is complete for the entire United States. The processes for generating group quarters and Future directions for enhancing and improving assigning synthesized agents to schools, workplaces, the database include improving the locations of and group quarters are too time-consuming to run household point features, developing a process for the entire United States at one time. Instead, these for including race and ethnicity in the iterative processes are run to make assignments on an as- proportional fitting technique, and developing needed basis as requests for synthesized populations updated versions of the database that use newer for a particular area are received. demographic data provided by the American Community Survey.25 The storage size of the final database (even if school, workplace, and group quarters assignments were Agent-based models are an increasingly popular tool included for the whole United States) would total less for analyzing large-scale social systems as varied as than 50GB of storage disk space. The data are housed disease epidemics, tax policy, and transportation in a geospatial database to allow the data to be used planning. The US synthesized database developed by in geospatial models and analyses. RTI and described here is available to any researcher who would like to use it in his or her own model. The database is too large to be distributed as a whole, but RTI will extract portions of the database by county or groups of counties and make these extracted subsets available to researchers on request. Synthesized Population Database for Agent-Based Models 15

11. US Census Bureau; Department of Commerce, References Economics and Statistics Administration. 2000 1. Lin F, Lin S. Enhancing the supply chain Topologically Integrated Geographic Encoding performance by integrating simulated and and Referencing (TIGER) system. Washington, physical agents into organizational information DC: US Census Bureau; 2005. systems. Journal of Artificial Societies and Social 12. US Census Bureau; Department of Commerce, Simulation. 2006;9(4):1. Economics and Statistics Administration. 2000 2. Sallach D, Macal C. The simulation of social Census of population and housing, summary agents: an introduction. Special Issue of Social file 3. Washington, DC: US Census Bureau; Science Computer Review. 2001;19(3):245-8. 2005 March. 3. Eubank S. Network-based models of infectious 13. US Census Bureau; Department of Commerce, disease spread. Jap J Infect Dis. 2005;58(6): S9-13. Economics and Statistics Administration. 2000 4. Ferguson NM, Cummings DAT, Fraser C, Census of population and housing, public use Cajka JC, Cooley PC, Burke DS. Strategies for microdata sample 2000. Washington, DC: US mitigating an influenza pandemic. Nature. 2006 Census Bureau; 2005 Dec. Jul;442:448-52. 14. Beckman RJ, Baggerly KA, McKay MD. Creating 5. Longini IM, Halloran ME, Nizam A, Yang Y, Xu synthetic baseline populations. Annals of S, Burke D, et al. Containing a large bioterrorist Transportation Research. 1996;30(6):415-29. smallpox attack: A computer simulation 15. TRANSIMS Open Source [homepage on the approach. Int J Infect Dis. 2007 Mar;11(2):98-108. Internet]. Los Alamos (NM): Los Alamos 6. Epstein JM, Axtell R. Growing artificial societies: National Laboratory; 2008. Transportation social sciences from the bottom up. Cambridge Analysis Simulation System (TRANSIMS). (MA): MIT Press; 1996. Originally available from Los Alamos National Laboratory; now available from: 7. Orcutt GH, Mertz J, Quinke H, editors. http://transims-opensource.org Microanalytic simulation models to support social and financial policy. Amsterdam: North- 16. Longini IM, Halloran ME. Strategy for Holland; 1986. distribution of influenca vaccine in high risk groups and children. Am J Epidemiol. 8. Macal CM, North MJ. Tutorial on agent-based 2005;161:303-6. modeling and simulation part 2: How to model with agents. In: Perrone LF, Wieland FP, Liu J, 17. Common Core of Data (CCD) [Internet]. Lawson BG, Nicol DM, Fujimoto RM, editors. Washington, DC: US Department of Education, Proceedings of the 2006 Winter Simulation National Center for Education Statistics; 2008. Conference; 2006. Public school data for 2005-2006 school year. Available from: http://nces.ed.gov/ccd 9. Cooley P, Ganapathi L, Ghneim GS, Holmberg /schoolsearch SD, Wheaton WD, Hollingsworth CE. Using influenza-like illness data to reconstruct an 18. Schoolinformation.com [homepage on the influenza outbreak. Mathematical and Computer Internet]. ASD Data Services, LLD. School Modeling. 2008;48(5-6): 929-39. information website; 2008. Available from: http://www.schoolinformation.com 10. Frequently asked questions: how does the Census Bureau account for homeless people? [Internet]. 19. Census 2000 special tabulation: census tract US Census Bureau; 2003 [cited 2008 Sept 17]. of work by census tract of residence (STP 64) Available from: http://factfinder.census.gov [Internet]. Wasington, DC: US Census Bureau; /home/en/epss/faq.html#homeless 2004, April. [cited 2008 March 6]. Available from: http://www.census.gov/mp/www/spectab /stp64-webpage.html 16 Wheaton et al., 2009 RTI Press

20. InfoUSA. c2006. InfoUSA (component of 24. Selected manpower statistics for fiscal year 2005: Environmental Systems Research Institute’s Defense Manpower Data Center report. [Internet]. [ESRI’s] Business Analyst GIS product). Available Washington, DC: US Department of Defense. at http://www.esri.com/software/arcgis/extensions Tables 2-17 for year 2000 and 2-17a for year 2001, /businessanalyst/index.html age distribution of male/female military personnel 21. Census 2000 summary file 1 [Internet]. strength [in thousands] and percent figures; Washington, DC: US Census Bureau; 2001. 2000 c2005. Available from http://siadapp.dmdc.osd. Census of population and housing: technical mil/personnel/M01/fy05/m01fy05.pdf documentation; 2007 July. Available from: 25. American community survey [Internet]. http://www.census.gov/prod/cen2000/doc/sf1.pdf Washington, DC: US Census Bureau; 2008. 22. US Department of Homeland Security, National Available from: http://www.census.gov/acs/www Geospatial Agency (NGA). Homeland Security Infrastructure Program (HSIP) geospatial database. Washington, DC: US Department of Homeland Security; 2006. 23. Beck AJ, Karberg JC. Bureau of Justice Statistics bulletin: prison and jail inmates at midyear 2000 [Internet]. Washington, DC: US Department of Justice, Office of Justice Statistics. Table 12: number of inmates in state or federal prisons or local jails by gender, race hispanic origin, and age, June 30, 2000; 2001 March. Available from: http://www.ojp.usdoj.gov/bjs/pub/pdf/pjim00.pdf Acknowledgments The project described was supported by grant number U01GM070698 (Models of Infectious Disease Agent Study—MIDAS) from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official view of the National Institute of General Medical Sciences or the National Institutes of Health. RTI International is an independent, nonprofit research organization dedicated to improving the human condition by turning knowledge into practice. RTI offers innovative research and technical solutions to governments and businesses worldwide in the areas of health and pharmaceuticals, education and training, surveys and statistics, advanced technology, international development, economic and social policy, energy and the environment, and laboratory and chemistry services. The RTI Press complements traditional publication outlets by providing another way for RTI researchers to disseminate the knowledge they generate. This PDF document is offered as a public service of RTI International. More information about RTI Press can be found at www.rti.org/rtipress. www.rti.org/rtipress RTI Press publication MR-0010-0905