
Digging into Data White Paper Mining Microdata: Economic Opportunity and Spatial Mobility in Britain, Canada and the United States, 1850-1911 Baskerville, Peter, Professor of History and Classics/Humanities Computing, University of Alberta, Edmonton, Alberta, Canada. Dillon, Lisa Y., Associate Professor and Co-Director, Programme de recherche en démographie historique, Département de Démographie, Université de Montréal, Montréal, Québec, Canada. Inwood, Kris, Professor of Economics, University of Guelph, Guelph, Ontario, Canada. Roberts, Evan, Assistant Professor of Population Studies, University of Minnesota, Minneapolis, Minnesota, United States of America. Ruggles, Steven, Regents Professor of History and Population Studies, and Director of the Minnesota Population Center, University of Minnesota, Minneapolis, Minnesota, United States of America. Schürer, Professor Kevin, Professor of History and Pro Vice Chancellor, University of Leicester, Leicester, United Kingdom. Warren, John Robert, Professor of Sociology, University of Minnesota. Project progression Mining Microdata: economic opportunity and spatial mobility in Britain, Canada and the United States was funded in Round 2 of the Digging into Data Challenge in 2011. The team learned of the award in October 2011, enabling a preliminary meeting of the project principals at the annual meetings of the Social Science History Association in November 2011. Although initially scheduled for two years, the grant was extended in two no-cost extensions because of significant delays in obtaining key data. A contract to obtain the British 1911 data in a format suitable for data record linkage was not agreed until nearly the end of the initial grant period, and data was not delivered until late-2014. The revision to the data delivery schedule anticipated in the grant required revision to the project’s analysis and publication timeline. We have devoted more time than initially anticipated to describing the theory and practice of record linkage in large sets of individual-level records. Despite these challenges we have presented initial analytical results on the first cohort (1850/1- 1880/1) proposed for analysis, and developed a suite of analytical programs that will be run on the datasets constructed for the second cohort. Our white paper includes a publication on social mobility in the first cohort; as well as extensive discussion of our methods in published papers. The project principals met to discuss and develop the project on several occasions during the grant period • May 2012 in Guelph, Ontario. This meeting was hosted by the Canadian team with local arrangements made by Kris Inwood, and held in conjunction with an international meeting about longitudinal data analysis in historical social science. • April 2013 in Leicester, United Kingdom. This meeting was hosted by the British team, with local arrangements made by Kevin Schürer. We met for three days to outline plans for data construction and the eventual analytical papers. Our publications have followed the plans developed at this meeting, modified to accommodate the delayed production of the second cohort (1880/1-1910/11) of data. • April 2014 in Vienna, Austria. The Canadian and British teams and Evan Roberts from the U.S. team met in Vienna in conjunction with the European Social Science History Conference, at which we presented a jointly authored paper. • November 2014 in Toronto, Ontario. The Canadian and U.S. teams met in conjunction with the annual meetings of the Social Science History Association conference. Project management The team for this Digging Into Data project had been working together for a decade on developing historical census data into a consistent international dataset as part of the North Atlantic Population Project (http://www.nappdata.org). This was our first joint foray into substantive research, though many of the team members had worked on substantive research together in different groups of 2-3 people. 1 We continued the style of project management that had been developed in the North Atlantic Population Project, and worked successfully. We agreed on a schedule of in-person meetings early in the project, and established goals to be achieved in advance of those meetings. At the meetings we allocated by mutual agreement responsibility for different aspects of the project to be primarily led by particular institutions or individuals. Key data creation, analysis and writing components were allocated as follows • Theoretical development of record linkage algorithms: Guelph team led by Kris Inwood with computer scientist Luiza Antonie. • Evaluation of household relationships (father-son links) in Canadian census data which did not explicitly enumerate household relationships: Montréal team led by Lisa Dillon. • Development of methods for analysis of small area geographic context: British team led by Kevin Schürer. • Supervision of record linkage production: Minnesota team led by Evan Roberts • Drafting of article on record linkage techniques and methods: Canadian team (Inwood, Dillon, Baskerville) • First draft of article on comparative social mobility: Evan Roberts. By allocating key leadership roles for particular tasks to different members of the team we ensured that the project would make progress in multiple areas, and that we would not duplicate effort. The work produced by each lead individual or team was reviewed by the other members of the collaboration prior to submission to conferences or journals. Throughout the project we remained in regular contact between in-person meetings by phone and email. Project challenges Our project faced two major, and inter-related challenges. At the time of grant submission we anticipated that we would take delivery of the 1911 British census dataset shortly after the grant began in early 2012, and complete linking of the 1881-1911 panel within the 2012 year; allowing us to focus on data analysis in Year 2 of the grant. In fact, we did not receive the data until late- 2014, primarily owing to delays in obtaining a license from Findmypast — the genealogical company that produced the data after their corporate ownership changed. Without these necessary permissions we were unable to make progress on a critical component of the grant: creating a panel of men aged 0-20 in 1881, and linked to their adult observations in 1911. After obtaining the data, we ran into a second set of challenges: the methods we had derived to create linked datasets with smaller pairs of datasets were not computationally efficient for linking complete enumerations at both ends of the observation period. For all of our other cohort panels we had just one complete count enumeration, as summarized in the following table. 2 Table 1. Sample densities in Mining Microdata cohorts Country 1850/1-1880/1 1880/1-1910/11 sample densities sample densities Canada 20% to 100% 100% to 5% Great Britain 2% to 100% 100% to 100% United States 1% to 100% 100% to 1% Our approach to record linkage relies on comparisons of names within blocks defined by the intersection of age, birthplace, sex (and race in the United States). These are characteristics that should not change over time, and allow identification of a set of individuals where links are not biased by changes in social outcomes. Within each block defined by a single year of birth (surrounded by a small window to account for inaccuracies in enumeration of ages), birthplace, and sex we compare the names of all individuals who appear in both datasets. Thus for all individuals born in 1875 in a particular county (province or state) we compare the similarity of names between the two datasets. If there are 2000 people in the first dataset and 1000 people in the second dataset we must make 2,000,000 (1000 x 2000) comparisons of the similarity of names. Links are made from the pool of people for whom there is no closely competing person with a similar name. Individuals with a common name in large entities are unlikely to be matched. John Smith born in Ontario, Yorkshire, or New York is never going to be matched because so many other individuals have the same representation in the data. Men with names that are genuinely rare or unique, but have a close similarity to another name in the dataset will also not be matched. Thus “Jahn Smithson” from Ontario, Yorkshire, or New York may be the only man with that name in the few years surrounding his birth, but his name’s similarity to a more common name means it is possible he really was John Smith, and his name was spelt incorrectly. Our record linkage procedure is designed to ensure that people are not linked because they are erroneously unique. If there is a close competitor in age or spelling from the same birthplace, a match is not made. Our record linking procedures built on a significant existing literature. Our goals, however, differed significantly from those of most data mining applications of record linkage. The primary goal of most data mining has been to maximize the number of valid links. Our objective is different: we do not focus on maximizing the linkage rate. Instead, our procedures are designed to maximize the representativeness of the linked cases and the accuracy of the links. This means we pay close attention to potential sources of selection bias, and ignore information routinely used by other record-linkage procedures. Although we cannot eliminate selection bias for unobserved characteristics, we can adopt procedures that greatly reduce the potential for bias compared with previous approaches. Our algorithm relies exclusively on characteristics that should not change over time. At minimum, these variables are first name, last name (for men and for women who do not marry between observations), birth year, sex, and place of birth. Most record linkage software makes use of a broader range of characteristics to confirm links and resolve ambiguities, but that 3 approach introduces bias. For example, if we used spouse’s characteristics to confirm linkages, we would bias the sample in favor of persons who remained married to the same person for multiple decades, and such persons are not representative with respect to either occupational or geographic mobility.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages121 Page
-
File Size-