<<

The LIFE-M Project The Longitudinal, Intergenerational Electronic Micro-Database

Martha Bailey,1,2 Sarah Anderson,1 Catherine Massey1 1 University of Michigan 2 National Bureau of Economic Research

March 3, 2017

Abstract

Some of the most important questions in demography relate to changes across time. But, most data spanning the late 19th and 20th century in the U.S. are cross-sectional. This limits the study of population dynamics, including long-run and intergenerational changes in fertility, racial and gender disparities, family formation and dissolution, immigration and geographic mobility, health and aging, and mortality. This paper describes the Longitudinal, Intergenerational Family Electronic Micro-Database (LIFE-M), a new project that seeks to create a longitudinal and intergenerational micro-database for millions of individuals and spanning much of the late 19th and 20th centuries. Complementing historical census linking projects, LIFE-M is novel in linking vital records that chronical births, deaths, and to the historical population censuses. Vital records provide insight about the life course and intergenerational outcomes of millions of families and uniquely allow the linking of women from before and after (because they contain mothers’ birth and married ). This paper provides an overview of the LIFE-M project, including a description of the LIFE-M data structure, our linking methods, proposed expansions, and a discussion of future research capabilities.

Acknowledgements

This project was generously supported by the National Science Foundation (SMA 1539228), the University of Michigan Population Studies Center Small Grants (R24 HD041028), the Michigan Center for the Demography of Aging (MiCDA, P30 AG012846-21), and the Michigan Institute on Research and Teaching in Economics (MITRE). We gratefully acknowledge the use of the services and facilities of the Population Studies Center at the University of Michigan (R24 HD041028). We are grateful to Dora Costa, Shari Eli, Adriana Lleras-Muney, Joseph Price, and the board members of the LIFE-M project, including Eytan Adar, George Alter, Hoyt Bleakley, Matias Cattaneo, William Collins, Katie Genadek, Maggie Levenstein, Bhash Mazumder, and Evan Roberts, for their helpful suggestions. We are also grateful to Morgan Henderson and Garrett Anstreicher for their excellent contributions to the LIFE-M project and assistance with this analysis. I. Introduction

Is the U.S. still the “land of opportunity?” Is it still the “great melting pot?” How did

intergenerational mobility change over the 20th century? Did the transformation of the South’s

institutions, from Jim Crow to Civil Rights, affect health, family structure, and poverty among

African Americans? How has transformation in women’s rights and roles changed the structure of

families and their children’s opportunities? What have been the long-run health and economic

implications of the extraordinary early 20th century public health efforts (e.g., the virtual

eradication of malaria and hookworm or the development of vaccinations for life-threatening and

debilitating diseases)?

These important questions relate to how individuals’ lives and experiences change across

time. However, most population data spanning the late 19th and 20th centuries are cross-sectional—

large sets of individuals at one point in time. Cross-sectional data limit the study of population

dynamics such as long-run and intergenerational changes in fertility, racial and gender disparities,

family formation and dissolution, immigration and geographic mobility, health and aging, and

mortality. As a consequence, some of the most important questions in population science and

demography remain unanswered.

This paper describes the Longitudinal Intergenerational Family Electronic Micro-dataset

(LIFE-M), which is building an integrated database of vital records and census data spanning the

early and middle 20th century US. LIFE-M uses the remarkable new resource of digitized vital

records (birth, marriage, and mortality certificates) to fill a critical need for more longitudinal and

intergenerational data.1 Once completed, LIFE-M will incorporate over 185 million individual

vital statistics records linked to the 1880 and 1940 decennial population censuses and, ultimately,

1 Vital records have been collected for centuries. Recent digitization and publication efforts make our project possible.

Bailey, Anderson, and Massey – 2 the social security death index (SSDI), records (MR), and immigration records from ship

manifests (SM).

Funded by NSF 1539228, LIFE-M’s large longitudinal, intergenerational micro-data have

the potential to shift the research frontier in studies of health and longevity, childbearing and family

structure, and the long-run effects of early-life circumstances. Complementing historical census linking projects, LIFE-M is novel in its use of millions of newly transcribed, publicly available micro-data vital statistics. Using these data, LIFE-M will reconstitute family units from birth records (by linking on parents’ names), connect multiple generations and adult siblings, merge individual socio-demographic and economic variables using the 1880 and 1940 censuses, and create measures of infant mortality and longevity by linking to death records. All of this is possible for both men and women, because mothers’ birth (“maiden”) and married names are contained in both birth and marriage records.

LIFE-M will ultimately make three important contributions. First, it will facilitate studies of the effects of early-life conditions on later-life health and longevity by linking millions of individuals from birth to death. LIFE-M’s large samples will include individuals who died as infants or children (often not enumerated in censuses or surveyed), mitigating this source of sample selection bias and facilitating new analyses of infant mortality. LIFE-M also documents location changes over time, enabling the study of life-cycle geographic mobility and the relationship

between adult outcomes and early-life contextual factors (e.g., family, sibling sex composition,

birth , and neighborhood). These data will provide new opportunities for evaluating the short

and long-run effects of early-life circumstances and the multitude of public health policies for

today’s aging population.

Bailey, Anderson, and Massey – 3 Second, LIFE-M will facilitate the linkage of multiple generations for millions of families and siblings. Unlike existing data that trace small numbers of families over time (or for short periods of time), LIFE-M reconstructs large samples of interconnected generations (enhanced and cross-validated using census records) to facilitate detailed studies of the intergenerational determinants of health and longevity across time and space.

Third, LIFE-M will consist of very large micro-data samples overall and for understudied populations. Existing longitudinal micro-data (PSID, NLS, and HRS) often have rich health variables but samples that are too small to answer questions of interest—especially for separate analyses by racial/ethnic or nativity subgroup. Historical longitudinal samples typically exclude women by necessity (e.g. the Union Army veterans) or because they cannot be linked by .

LIFE-M will uniquely follow large, representative samples of women longitudinally and across generations and improve linkage rates among men. Improvements are especially notable for non- white men, non-native speakers of English, and the less educated who are more likely to change the spellings of their names. This is possible because vital statistics data contain additional information (exact day of birth, middle names, and parents’ names) that reduce multiple matches and they also contain multiple birth records for the same parents, which facilitate corrections of transcription errors and minor misspellings.

This paper provides an overview of the LIFE-M project and our preliminary results for

Ohio. We describe the variables, the intergenerational structure of the LIFE-M data, and our linking procedures. We also discuss the coverage and representativeness of the 20th century vital statistics microdata by comparing statistics from the microdata to published aggregate statistics and by examining correlates of missing data. We conclude with our preliminary linkage rates for

Ohio and a discussion of how the LIFE-M data can be used in population research.

Bailey, Anderson, and Massey – 4

II. Recent Advances in Longitudinal Data

Censuses provide rich cross-sectional snapshots of the late 19th and 20th centuries, but these data limit researchers’ abilities to follow individuals over time to answer longitudinal or intergenerational (or any inherently dynamic) questions. Existing large-scale data are transforming what is possible. This section discusses how LIFE-M complements these efforts.

Existing Large-Scale, Longitudinal and Intergenerational Data

Several types of data currently permit analyses of twentieth century population dynamics, including linked historical data, linked contemporary data, and panel surveys. One example of linked historical data is the Early Indicators Project. First collected by Robert Fogel and others and now led by Dora Costa, these data provide an important longitudinal perspective on health and economic outcomes during the middle 19th century (Wimmer 2003). The data consist of 39,340

Union Army (UA) soldiers, approximately 6,200 of whom were “Colored Troops.” These data measure the date of death and provide rich information on disability, health, the use of medical care, and pension receipt for men reaching retirement age in the late 19th century. Through links to the 1850 and 1930 censuses, the UA data also include socio-demographic and economic variables. An important limitation of the UA data is that they consist of men who were mostly northern born.

The Minnesota Population Center (MPC) has also led efforts to digitize and integrate historical census data. Thus far, they have created Integrated Public Use Microdata Series Linked

Representative Samples of the 1880 Census linked to the 1850-1930 Census one-percent samples

(Ruggles et al. 2010)2 These linked data record economic (e.g., occupation, literacy, labor-force

2 This large linked sample follows two earlier linking projects. Pullum and Guest created a national linked sample of two cohorts

Bailey, Anderson, and Massey – 5 participation, home ownership) and demographic (e.g., age, birth place, race, marital status,

number of children) outcomes for around 500,000 people. Though large in scale, important limitations of these data include that many women cannot be linked by name (because they change

their names at marriage), longitudinal coverage is sparse (at most two points in time for any single

person), and intergenerational coverage consists of at most two generations (father-son pairs).

Combining forces with the U.S. Census Bureau, MPC currently intends to expand these

historical linkages to the recently digitized 1850-1940 full-count population censuses as well as

link forward from 1940 on to the 2000 Census and the 2001-2015 American Community Surveys

– creating a data infrastructure spanning over 150 years (Alexander et al. 2015). These historical linkage efforts have focused on men because they are easily linkable across historical censuses.3

Smaller-scale surveys, collected more recently, offer a second type of longitudinal, intergenerational data. These include the Panel Survey of Income Dynamics (PSID), National

Longitudinal Surveys (NLS), and the Health and Retirement Study (HRS). The PSID and NLS longitudinal surveys began in the late 1960s and contain rich information on economic status, health, and well-being. The initial PSID sample consisted of 5,000 families and has grown to over

7,000 families as the study also follows descendants. The original NLS cohorts, which cover the periods 1966-1981 for young men, 1966-1990 for older men, 1967-2003 for mature women, and

1968-2003 for younger women, also began with initial sample sizes of around 5,000.4

The HRS is a longitudinal survey that has followed Americans over age 50 from 1992 to the present. After beginning with the 1931-1941 birth cohorts (N= 12,000), the HRS added 1924-

of men born in the 1880 and 1900 censuses (Guest 1987, N=4,4014, linkage rate 39.4%). Ferrie (1996) linked a nationally representative sample of the 1850 census to the 1860 census (N=4,938, linkage rate 19.3%). 3 The linkages from 1940 forward are not limited to men due to the holdings of administrative data at the Census Bureau, which includes women’s birth (“maiden”) and married name changes as far back as 1936. 4 The NLS has subsequently tracked supplemental samples. One covers ages 14-22 in 1979 (N=12,686) (and children for women in this survey) and another ages 12-16 in 1996 (N=9000).

Bailey, Anderson, and Massey – 6 1930 and 1942-1947 birth cohorts in 1998 (N=14,000). HRS data include health, disability, wealth,

retirement, and financial literacy questions in addition to retrospective measures of early life and

adult economic and outcomes. Note that the oldest individuals (born in 1924) need to survive to

age 74 to make it into the survey, so HRS samples miss many individuals in these early cohorts

dying at younger ages.

These surveys cover cohorts reaching adulthood about 100 years after the UA veterans—

individuals born or reaching adulthood in the second half of the 20th century. This leaves a gap of

roughly one century between the longitudinal coverage of the UA data and recent surveys.

Moreover, these data have some common limitations: significant attrition (52.2 percent of the

original PSID sample remained by 1989; around 5 percent of the NLS samples left per year; 15

percent attrition in the HRS as of 2004) and lack of geographic coverage, which constrain the

representativeness of these data at the state and more local (county/town) levels.5

In summary, longitudinal and intergenerational analyses of the 20th century U.S. have been

limited by existing data in several ways. Historical data have focused on men, either soldiers (UA)

or because they can be linked across historical censuses. Ongoing surveys (such as the PSID) tend

to have small samples and limited geographic coverage, especially for understudied populations.

Only a handful of existing data sources link a person’s early-life family and contextual factors to

their date of death, family structure, and later life economic outcomes, but many of these early life

5 A variety of independent administrative and restricted data sources offer a third type of longitudinal, intergenerational data. The National Longitudinal Mortality Study (NLMS) links the Current Population Surveys and other records to death certificates to examine the relationship of demographic and socio-economic characteristics with mortality rates. These large micro-data samples (N> 340,000 deaths) generally link individuals 50 or older to demographic and socio-economic information in the CPS from about age 40. Researchers have also conducted labor-intensive hand-linkages across censuses (Ferrie 1996; Guest 1987; Long and Ferrie 2013; Collins and Wannamaker 2014; Bleakley and Ferrie 2013a, 2013b, 2014) or hand-linked samples of U.S. census data to other records, including ship manifests (Abramitzsky et al. 2014) and census records from other countries (Biavaschi and Elsner 2013). Many of these linked samples are the property of the researchers who collected or linked them and are not available for public use. Lack of access to these data and substantial barriers to creating such samples limit replication, new research using these data, and analyses of data quality.

Bailey, Anderson, and Massey – 7 events are based on imperfect recall or knowledge of aging adults (such as in the HRS). LIFE-M will be the first large-scale dataset to provide longitudinal and intergenerational information for a rich set of early life, family composition, health, and census variables for representative samples of men and women born in the first half of the 20th century.

The Contribution of LIFE-M to Population Data Infrastructure

LIFE-M’s compilation and linkage of millions of vital records and census data will advance data infrastructure in multiple dimensions. LIFE-M will provide:

(1) Enhanced micro-data on the U.S. for population research

LIFE-M will include the following variables for each represented person:

• birth family characteristics (e.g., birth • marriage family characteristics (e.g., age order, sibling sex composition, age at marriage, married name, name differences, twinning, number of siblings); and characteristics including all • health (date and place of death, i.e. characteristics on this list); longevity); • parental and grandparent characteristics • own births (number of children, mortality (e.g., age, race, occupation6, education6, of own infants and children, timing of and birth state or country from the births, sex composition, and twinning); censuses); and • own economic and demographic outcomes • geographic location (town or address) at (wages, employment, occupation, birth vital events and census enumeration state or country, education from the 1940 (lifetime mobility). census);

Linked census samples or vital records alone contain only snapshots of these outcomes, but LIFE-

M integrates health, economic, family, and demographic data into a single large-scale, longitudinal

and intergenerational file.

(2) Unprecedented sample sizes of linked women and understudied populations

Vital records contain information on married women’s birth and married names. Typically, census linking projects based on necessarily exclude women who change their names at

6 Occupation only available in the 1880 and 1940 censuses. Completed education only available in the 1940 census.

Bailey, Anderson, and Massey – 8 marriage. Birth certificates, however, contain information on mothers’ birth (“maiden”) names in

at least 86 percent of cases. In addition, marriage certificates contain information on the birth

names of the and groom. Vital records information, therefore, contains a cross-walk between

women’s names that will facilitate their unprecedented inclusion population and health research.

LIFE-M will also increase sample sizes for other understudied groups. Vital records

contain more information than census records (exact day of birth, middle names, and parents’ full

names), which will increase unique matches. Moreover, LIFE-M data contain records for the same

parents with more than one child which enable corrections of transcription errors and minor

misspellings that limit name matching for the less educated, the less affluent, and the foreign born

(who have non-Anglophone names or speak English as a second language). Increasing the inclusion of understudied populations is crucial, because most historical linked samples are very small. For instance, the IPUMS linked 1870 and 1880 census samples match 12 percent of white men but only 6 percent of black men and 3 percent of immigrant men across censuses. Aggregating these samples over all of the linked censuses between 1850 and 1930 results in total samples sizes of around only 5,000 black men and 5,000 foreign-born men. These small sample sizes preclude the study of important questions for these groups. We anticipate LIFE-M’s sample sizes for these groups will be at least ten times as large.

(3) Intergenerational linkages of families across multiple generations

Whereas existing data that trace small numbers of families over time (PSID, NLSY, HRS) or permit the intergenerational analyses of men alone (UA data or linked census samples), LIFE-

M uniquely permits the large-sample intergenerational linkage of entire families (including dauthers and mothers).

(4) Coverage of cohorts born in the early 20th century

Bailey, Anderson, and Massey – 9 Current surveys cover cohorts reaching adulthood about 100 years after the UA veterans,

which leaves the cohorts born in the early 20th century largely uncovered by longitudinal data.

These cohorts include the veterans of two World Wars and the children of the Great Depression.

This group largely defines aging Americans in the last half of the 20th century.

(5) Geographic information over the entire life course

Place of event (town/county) or address at census enumeration present a multitude of

opportunities for SBE researchers to link LIFE-M to administrative data, policy measures, or datasets containing contextual, demographic, and economic variables for specific research purposes. This information will also facilitate linkages to other datasets such as those containing

lead exposure by city, state-level changes in compulsory schooling laws, as well as other

contextual exposures that vary by geography and time.

III. Data Collections in LIFE-M

Until recently, most of the records in LIFE-M were hand written and stored in archives.

FamilySearch.org, a nonprofit genealogical website, transcribed these records and made them

publicly available. For the entire U.S., these micro-data include over 60 million births for 19 states,

70 million deaths for 41 states, and 55 million marriages for 47 states.7

LIFE-M currently has funding to include five states, beginning with Ohio. We have

collected birth, marriage, and death records for Ohio from 1880-2000 from public sources. Figure

1 describes variable coverage for each type of vital record in Ohio. We have nearly full coverage

7 Data transcription is ongoing, so the number of covered states should increase over time. Birth records are available for (using state postal abbreviations) CA, Cook County, IL, DE, IA, ME, MA, MI, MN, NC, OH, Philadelphia, PA, Salt Lake City, UT, Sebastian County, AR, TX, VT, WV, WI, and Yates County, NY. Death records are available for AL, AZ, AR, CA, CT, DC, DE, FL, GA, HI, ID, IL, IN, IA, KS, KY, LA, ME, MA, MI, MN, MO, MT, NH, NJ, NM, NY, NC, OH, OR, Pittsburgh, PA, RI, SC, TN, TX, UT, VT, VA, WA, WV, and WI. Marriage records are available for AL, AK, AR, AZ, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, IN, IA, KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, NJ, NM, NY, NC, ND, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VT, VA, WA, WV, WI, and WY.

Bailey, Anderson, and Massey – 10 of birth, marriage, and death records from 1880 through the late twentieth century. Each of these

record holdings contain name, place and date of birth, and parent names necessary for record

linkage across all three vital records types and census data. To assess completeness within a record holding, we compare the number of observations in the individual-vital record holdings to published vital statistics.

Completeness of LIFE-M Vital Records

Vital records may be incomplete for a variety of reasons, including under-registration,

record degradation (fires, floods, etc.), or incomplete digitization. To characterize the

completeness and representativeness of LIFE-M data, we compare pre-linked LIFE-M tabulations

to the census and published tabulations of births and deaths data (only possible after states entered

the Federal Registration Areas). Our analysis finds that the LIFE-M pilot data contain 70 percent

of all published births, 79 percent of all published deaths, and 45 percent of all marriages for the

areas and periods they represent. These figures are encouraging and similar to what is considered

a high response rate in major national surveys.

Intergenerational Structure of LIFE-M Data

Figure 2 provides approximate dates to illustrate the cohort and generational structure for

the LIFE-M project’s generation 0 (G0) to generation 3 (G3). G2 is our core sample of infant birth

certificates for which LIFE-M will construct intergenerational and longitudinal data. G0 refers to

those born before 1860 (contemporaries of the UA cohorts); G1 to those born 1870-1899; G2 to

those born 1900-1929; and G3 to those born 1930 forward (contemporaries of the HRS cohorts).

Data Linking Structure

Figure 3 provides an overview of LIFE-M’s linking process. The first step of the process

is to reconstitute birth and marriage families of the late 19th and early 20th century birth cohorts

Bailey, Anderson, and Massey – 11 (G2) (Figure 3, arrow 1). This requires linking birth records (G2) to one another using parents’ full names (G1) and other information such as parents’ birth places (when available). We also examine records with only one parent to identify cases of parent deaths and remarriage.

Noteworthy is that the G2 birth certificates provide a link for at least two generations. Also,

G2 can be linked to their own children (G3), because birth records contain mother’s birth names.

This step allows the reconstruction of two to three generations of interrelated families. In addition, the resulting family sizes are compared to census tabulations to examine data quality.

Our second step (arrow 2) is to link marriage records by bride and groom name, exact date of birth (allowing for over-reporting of age, Blank et al. (2009)), and place of birth (when available in the collection). Although 90 percent of women born in this period were ever married (Bailey et al. 2014), marriage registration was highly incomplete and less than a full match rate is expected.

This step can also be completed for some of G1 and G3.

The third step is to link G2 to their grandparents (G0) using the 1900 and 1880 censuses

(arrow 3). Parents’ (birth or married) names (G1) are linked to the 1900 census names which provides key information on birth place, age, and race. Next we link G1 to the 1880 census using their names only or names in addition to ages, birth place, and race (obtained from the 1900 link).

This step connects G2 to G0 and is important because it allows for the addition of G1’s early life family conditions, including G0 ancestry/heritage, economic circumstances such as occupation, race, and address. The fourth step links four generations (G0, G1, G2, G3) to the full-count 1940 census (arrow 4). This step uses full names (including birth and maiden names of women), exact birth dates/age, and birth place. The 1940 Census is the first census to include rich information on educational attainment, wages and salary, and many employment outcomes. This is only possible

Bailey, Anderson, and Massey – 12 for some of G0 (many will have passed away before 1940), but most of G1, G2 as adults (in their marriage families), and G3 as children (in birth families).

Importantly, the 1940 census allows cross-validation of the linkages in steps 1 and 2 for

Figure 3 using birth place, age, children born (sample line), age at marriage (sample line), parent’s

birthplace (sample line), and spouse name.8 It also links the names of some children (G3) including their birth place and siblings, which can be compared to birth records in step 1. G1 and G2 can also be linked by name to the now fully indexed 1940 census (arrow 4). This links education and wages to the parents of G3 as children. The 1940 census also contains information to cross-validate the linkages in steps 1 and 2 (sample line respondent variables indicated with * in Figure 3): birth state or county, age, children born, age at marriage, and spouse name. It also links the names of some G3 children with their birth place and siblings, which can be compared to birth records in step 1 (Figure 3).

The final step (arrow 5) is to link G0-G3 to death records. The linking variables are full birth and/or married names, exact day of birth, place of birth (county/town), and parents’ names and place of birth when available in the collection. Almost all of G0-G2 will have died in the time span covered by the death records, as will many of G3. Because death records span almost the entirety of the 20th century for most collections, we will derive longevity for at least three generations. We also link infant deaths to parents’ names to fill in missing birth records, because many infant deaths were not recorded as births (and this helps us reconstitute families further).

8The 1940 Census was the first census to include a long-form portion, which was a 5% sample of the full population for which additional questions were asked. Whether you were selected into the long-form sample was determined by your sample line, with sample lines 14 and 29 selected for additional questions. These included number of children born, age at first marriage, and parent’s birthplace. These variables are indicated with * in Figure 3.

Bailey, Anderson, and Massey – 13 IV. Linking Methodology

Correctly linked data are critical for population inferences (Abowd and Vilhuber 2005,

Cambell 2009, Kim and Chambers 2012). While human clerical review provides high quality links

of individuals across data sources, it is cost prohibitive for large population samples. Our approach

relies on machine learning disciplined by a new ground truth of clerically linked records.

The LIFE-M project has created a new ground truth using an independent human review

process. In this review process, two highly trained individuals choose from a set of computer-

generated potential links using name, date of birth (or age), and birth state. When the two initial

reviewers disagree, the records are re-reviewed by an additional three individuals to resolve discrepancies. Discrepancies occur infrequently; for example, reviewers disagreed on a unique match for 8.3 percent of male North Carolina infants and 10 percent of male Ohio infants when linking to the 1940 Census.9

A semi-automated process allows us to monitor the efficiency of different data trainers

(time per link) and link quality. The use of random “audit batches” also allows us to monitor the quality of data links for each trainer. The result of this process is a highly vetted, hand-matched

“ground truth” dataset. With funding from the NSF, LIFE-M has already created a “ground truth” sample of nearly 54,000 Ohio children and almost 15,000 parents linked to the 1940 census and death and marriage records. These data provide the “ground truth” basis for this project (Bailey et al. 2016).

In addition, we will use a genealogical “ground truth” sample for a small subsample of records. This has been created as part of the LIFE-M project in conjunction with Brigham Young

University’s data linking lab by research assistants skilled in family history. Research assistants

9 Discrepancy rates vary by linkage type, ranging from a low 2.8 percent for birth certificates linked to death certificates (arrow 5) to 18.1 percent for birth records linked to other birth records (arrow 1).

Bailey, Anderson, and Massey – 14 sift through multiple sources of data (including all of those used in the LIFE-M project) to create record linkages and complete family trees. Although this type of linking is cost-prohibitive for larger projects and may be unrepresentative (e.g., the people who can be linked using this method are not representative of the population), the advantage of this comparison is that it has a very low rate of false links. Also, comparison of links produced by genealogical methods and those produced by the LIFE-M data trainers shows that they agree for 96% of cases.

V. Results from the Ohio Linking Project

We have produced a significant amount of high-quality ground truth data for Ohio that will be used in our machine learning models. This ground truth includes links of birth certificates to marriage and death records and the 1940 Census. Figure 4 demonstrates the completeness of

Ohio’s birth records from 1868 to 2011 and death records from 1908-2012. Prior to Ohio’s entry into the Federal Registration Areas (1909 for the Death Registration Area and 1917 for the Birth

Registration Area), there are no published Vital Statistics to use for a comparison. Figure 4 uses counts from the 1910 to 1940 censuses by age to generate appropriate denominators. The solid series shows the ratio of LIFE-M vital records to published vital statistics counts from the National

Center for Health Statistics (NCHS). These ratios hover around 1 (the horizontal line in the figure) beginning in the early 1900s, which suggests that LIFE-M birth and death records capture very close to the universe of births and deaths for the entire the 20th century. In some years, LIFE-M records exceed or fall below published statistics. This is likely due to duplicate records that we will remove with further data cleaning, as well as underreporting of infant deaths in census and vital statistics data.

Figure 5 shows the results of the Ohio linking to date. We began with a random sample of

13,270 birth certificates. Parents of these infants were linked to parents listed in the full set of Ohio

Bailey, Anderson, and Massey – 15 birth certificates to produce a reconstructed family sample consisting of 53,721 siblings (shown

by arrow 2 in Figure 3). Overall, we linked at least 51 percent of the 53,721 siblings in the reconstructed family data to the 1940 Census. In addition, 21 percent were linked to a marriage

certificate and 19 percent were linked to their death record. Overall, we achieve linkage rates that

are often similar to or higher than those found using other linkage methods, but we also assess the

success of the linked LIFE-M Ohio samples in terms of representativeness and type I error.

Because birth certificates do not contain socio-demographic measures found in the census

(race, age, or incomes of the parents), we make use of alternative features of these data to assess

representativeness. Exact day of birth is ideal because it is as close to a continuous measure as we

can get in historical records, and season of birth is strongly correlated with socio-economic

characteristics in modern data (Buckles and Hungerman 2013). Figure 6 shows the distribution of

day of birth of matched male Ohio infants relative to day of birth for the entire sample of male

Ohio birth certificates. Although the distributions appear similar, a Kolmogorov-Smirnov test

reveals the two distributions are statistically different from each other.

Bailey et al. (2016) tested representativeness of samples produced by other popular

automated matching methods (mentioned in section IV) as well as LIFE-M. They find little

evidence that LIFE-M’s clerical review or the automated linking methods provide representative

samples of the population, which is consistent with findings in multiple papers (Abramitzky et al.

2012; Abramitzky et al. 2014; Collins and Wannamaker 2015). This could imply limited external

validity of results using these samples, especially because the linked samples tend to be more likely

to be native born and more educated. Consequently, we will develop weights to ensure

representativeness of the LIFE-M matched samples before release.

Bailey, Anderson, and Massey – 16 Bailey et al. (2016) also examined type I error by linkage approach. They compare the links of each automated algorithm to the LIFE-M clerically reviewed sample and, if the links are the same, they are treated as correct. Discordant links were reviewed by two data trainers in a “police line-up” process, where the ground truth and algorithm links are presented as well as close candidates. Trainers then select the best match based on name, age, and place of birth information, which gives automated algorithms and LIFE-M equal chances at being chosen (or not). While the error rates of other automated matching methods ranged anywhere from 19 to 81 percent, the trainers only reversed their decision for the LIFE-M links for 2.2 percent of discordant matches,

suggesting Type I error of LIFE-M’s ground truth sample is low.

VI. Opportunities for Research

LIFE-M will ultimately provide a new resource for demographers, increasing sample sizes and broadening the representation of understudied subgroups. It will allow more systematic

analyses of missing data and link rates, and reduce data barriers for path-breaking and high-impact

fertility, mobility, health and aging studies.

LIFE-M data will be useful for a variety of analyses. One set of LIFE-M studies could

replicate the volume of longitudinal work on men’s health for women, minorities, and the foreign

born. LIFE-M data could also shed light on the following questions: How was the health of women

or other understudied populations (who might live in worse neighborhoods or earn lower incomes)

affected by environmental contaminants, infectious diseases, or policies? Do better measures of

infant mortality change these findings?

LIFE-M will facilitate new work on the intergenerational determinants of health and

mobility. For instance, what are the characteristics of intergenerational longevity transmission?

What is the sibling (including sisters!) or parent-child longevity correlation? Do these relations

Bailey, Anderson, and Massey – 17 vary across time or space? How has the relationship of life-expectancy and demographic and socio-

economic variables changed over time? Do early-life maternal characteristics affect later-life child health outcomes differently than paternal characteristics?

LIFE-M will also facilitate more research on families. How has assortative mating changed over the 20th century? Are there implications for inequality in children’s resources and health

disparities? Has there always been a marriage longevity premium for men but penalty for women?

LIFE-M also permits analyses of “shotgun” marriages, birth spacing, family size, and infant

mortality rates over time and across space.

LIFE-M permits the analyses of multitudes of previously unstudied policy interventions.

How did the reduction in alcohol consumption with Prohibition (or the address-distance to

counties/cities with legal alcohol sales after repeal) affect infant mortality rates and longevity?

How did the diffusion of early vaccines or antibiotics reduce child and infant mortality rates?

Finally, LIFE-M is highly expandable. For example, we plan to digitize and incorporate cause

of death for individuals in LIFE-M. Cause of death is currently written in script on death

certificates—but not in digital form. Digitizing this additional variable has the potential to make

LIFE-M much more useful to researchers interested in aging and mortality, allowing them to

examine relationships between a multitude of early-life and intergenerational factors, longevity,

and the cause of death. Furthermore, LIFE-M is linkable to other historical data sources, such as

enlistment records, as well as contemporary data sources. LIFE-M may also be linked to other

large-scale data infrastructures, including projects at the Minnesota Population Center (MPC),

Census Bureau, and surveys at the University of Michigan (HRS and PSID). For instance, a current

proposal to the NIA for American Longitudinal Infrastructure for Research on Aging (ALIRA)

would link LIFE-M to MPC’s Linked Historical Census Samples via a Historical Identification

Bailey, Anderson, and Massey – 18 Key (HIK) and to the Census Longitudinal Infrastructure Project (CLIP) via a Personal

Identification Key (PIK), which includes Medicare and Medicaid enrollment and the Survey of

Income and Program Participation. These collaborations mean that enhancements to the health data in LIFE-M will directly enhance a variety of other closely related data infrastructure projects.

Bailey, Anderson, and Massey – 19 VII. References

Abowd, J. M. and L. Vilhuber (2005). "The Sensitivity of Economic Statistics to Coding Errors in Personal ." Journal of Business and Economic Statistics 23(2): 133-165. Abramitzky, R., L. Platt Boustan and K. Eriksson (2014). "A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration." Journal of Political Economy 122(3): 467-506. Aizer, A., S. Eli, J. Ferrie and A. Lleras-Muney (2016). "The Long Term Impact of Cash Transfers to Poor Families." American Economic Review 106(4): 935-971. Alexander, J. T., T. Gardner, C. G. Massey and A. O'Hara (2015). "Creating a Longitudinal Data Infrastructure at the Census Bureau." Retrieved September 29, 2016. Available at http://paa2015.princeton.edu/uploads/152688. Bailey, M. J., S. Anderson and M. Henderson (2016). LIFE-M Ohio Boys and Men. University of Michigan. Bailey, M. J., M. Guldi and B. J. Hershbein (2014). Is There a Case for a “Second Demographic Transition”: Three Distinctive Features of the Post-1960 U.S. Fertility Decline. In Human Capital and History: The American Record, edited by Boustan, L. P., C. Frydman and R. A. Margo. Cambridge, MA: National Bureau of Economics Research. : . Bailey, M. J., M. Henderson and C. G. Massey (2016). How Do Automated Linking Methods Perform? Evidence from the LIFE-M project. Edited by. : . Available at University of Michigan Working Paper. Blank, R. M., K. K. Charles and J. M. Sallee (2009). "A Cautionary Tale about the Use of Administrative Data: Evidence from Age of Marriage Laws." American Economic Journal: Applied Economics 1(2): 128-149. Buckles, K. S. and D. M. Hungerman (2013). "Season of Birth and Later Outcomes: Old Questions, New Answers." Review of Economics and Statistics 95(3): 711-724. Cambell, K. M. (2009). "Impact of record-linkage methodology on performance indicators and multivariate relationships." Journal of Substance Abuse Treatment 36(1): 110-117. Collins, W. J. and M. H. Wanamaker (2014). "Selection and Economic Gains in the Great Migration of African Americans: New Evidence from Linked Census Data." American Economic Journal: Applied Economics 6(1): 220-252. Collins, W. J. and M. H. Wanamaker (2015). "The Great Migration in Black and White: New Evidence on the Selection and Sorting of Southern Migrants." Journal of Economic History 75(4): 947-992. Feigenbaum, J. J. (2015). "Automated Census Record Linking: A Machine Learning Approach." Retrieved October 11, 2016. Available at http://scholar.harvard.edu/files/jfeigenbaum/files/feigenbaum-censuslink.pdf. Ferrie, J. P. (1996). "A New Sample of Males Linked from the 1850 Public Use Micro Sample of the Federal Census of Population to the 1860 Federal Census Manuscript Schedules." Historical Methods 29(4): 141-156. Ferrie, J. P. and J. Long (2013). "Intergenerational Occupational Mobility in Great Britain and the United States since 1850." American Economic Review 103(4): 1109-1137. Goeken, R., L. Huynh, T. A. Lynch and R. Vick (2011). "New Methods of Census Record Linking." Historical Methods 44(1): 7-14. Kim, G. and R. Chambers (2012). "Regression Analysis under Probabilistic Multi-Linkage." Statistica Neerlandica 66(1): 64-79.

Bailey, Anderson, and Massey – 20 Ruggles, S. (2006). "Linked Historical Censuses: A New Approach." History and Computing 14: 213-224. Ruggles, S., J. T. Alexander, R. Goeken, M. B. Schroeder and M. Sobek (2010). Integrated Public Use Microdata Series (Version 5.0) [Machine-readable database]. Minneapolis: University of Minnesota. Vick, R. and L. Huynh (2011). "The Effects of Standardizing Names for Record Linkage: Evidence from the United States and Norway." Historical Methods 44(1): 15-24. Wimmer, L. T. (2003). Reflections on the Early Indicators Project: A Partial History. In Health and Labor Force Participation over the Life Cycle: Evidence from the Past, edited by Costa, D. L. Chicago, IL: University of Chicago Press. : 1-11.

Bailey, Anderson, and Massey – 21

Figure 1: LIFE-M Holding Availability by Year and Linking Variables

Record Birth Parent Event Event Cause of State Name Type Date Name Date Location Death B x x x, m x x OH M x y x x x D x x x, m x x,b x

Notes: Full legal names on a birth certificate are the name of the infant, on a marriage certificate the of the bride and groom, and on death certificate the name of the decedent. o on the arrows indicates entry of the state into the Federal Registration Area for births or deaths, x=data present, m=mother’s “maiden” name, and b=birthplace (town or county) of the primary individual.

Bailey, Anderson, and Massey – 22 Figure 2. LIFE-Intergenerational Structure

Notes: G2 is our core sample of infant birth certificates for which LIFE-M will construct intergenerational and longitudinal data.

Bailey, Anderson, and Massey – 23 Figure 3. LIFE-M Linkage Procedure from Vital Statistics to Censuses and Other Datasets

1940 Census 1900 1 3 G0, G1, G2 Census birth place, birth place, Births Records children born*, race, infant full name (G2), day & place of birth, parents’ birth age marriage*, occupation, names (G1) spouse name, age, 5 age, parents’ address birth place,

Death Records occupation, 1880 decedent full name (G0-G2), education, Census 2 parents’ names (G0-G1), day employment, G0 parents; & place of death wages, address Birth place, 4 race, G3 as children: occupation, Marriage Records birth place, address bride & groom full names (G1- siblings G3), day & place, parents’ names (G0-G2)

Notes: G0: born <1860 (~ UA cohorts); G1: born 1870-1899; G2: born 1900-1929; G3: born 1930- (~HRS cohorts). Planned links to military records and ship manifests omitted for space reasons.

Bailey, Anderson, and Massey – 24 Figure 4. Completeness of Ohio Birth and Death Records

Panel A: Ohio Birth Records Relative to Published Vital Statistics and Census

Panel B: Ohio Death Records Relative to Published Vital Statistics and Census

Source: Ohio birth certificates, published vital statistics on births and deaths in the US, constructed births from the 1910-1940 Censuses

Bailey, Anderson, and Massey – 25 Figure 5. Pilot Project Linkage Rates of LIFE-M Sample from Ohio Births, 1909-1920

1880 Marriage Census Records Core Linking Sample TBD 21% Random sample of birth 1900 records Death Census TBD 24.7% 19% Records In Linked siblings of random birth Progress sample 51% N = 53,721 1910 1940

Census Census

Notes: Ohio’s random sample contains 13,270 children (born from 1909-1920). There are 2.95 children per family in family reconstitution versus 2.45 in 1920 census

Bailey, Anderson, and Massey – 26 Figure 6. Representativeness of LIFE-M’s Linked Birth-1940 Census Records

Source: Bailey et al. (2016)

Bailey, Anderson, and Massey – 27