Identifying Households for Historical Censuses to Generate

Longitudinal Data

A Thesis

Presented to

The Faculty of Graduate Studies

of

The University of Guelph

by

Shada Omar Zarti

In partial fulfilment of requirements

for the degree of

Master of Science in Computer Science

June, 2017

c Shada Omar Zarti, 2017 ABSTRACT

Identifying Households for Historical Censuses to Generate Longitudinal Data

Shada Omar Zarti Advisors: University of Guelph, 2017 Dr. Gary Gr´ewal Dr. Luiza Antonie

The availability of historical censuses and advances in automatic record linking tech- niques provide social scientists and historians with research opportunities based on longitudinal data. Automatically linking the same individuals and households in mul- tiple sources creates longitudinal data more quickly and with less effort. The most common way to do this is to link individual records (pairwise linkage). More re- cently, a strategy of linking groups of records has been used. Unfortunately, in some historical censuses, household identifiers (HID) were not recorded at the time of the enumeration or not transcribed into the digital collections. In this thesis, we link four Canadian historical censuses (1871, 1881, 1891, and 1901) using both pairwise and group-linkage methods. We develop and implement a method to identify HID in the 1891 and 1901 censuses automatically. Then, we use this new information to generate longitudinal data that follows 159,872 over three decades from 1871 to 1901. Acknowledgments

I would first like to thank my supervisors: Dr. Gary Gr´ewal and Dr. Luiza Antonie for the continuous support of my Masters study and research, for their patience, motivation, enthusiasm, and immense knowledge. Their guidance helped me through all the time of my study, research, and writing of this thesis. They steered me in the right direction always. I could not have imagined working with better supervisors for my master study. I would also like to acknowledge Dr. Kris Inwood. I am gratefully indebted to him for his very valuable comments on my work in this thesis. Finally, I must express my very profound gratitude to my parents and to my hus- band for providing me with unfailing support and continuous encouragement through- out my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you.

iii Table of Contents

List of Tables vi

List of Figures vii

1 Introduction 1 1.1 Motivation ...... 2 1.1.1 Why do we need to link Canadian historical census records? .2 1.1.2 Why do we need to use a group linkage? ...... 3 1.1.3 Why do we need to generate household identifiers? ...... 3 1.2 Thesis Statement ...... 4 1.3 Approach ...... 4 1.4 Contributions ...... 5 1.5 Organization ...... 5

2 Background and Related Work 6 2.1 Record Linkage Overview ...... 6 2.2 Record Linkage Process ...... 7 2.2.1 Data preprocessing ...... 7 2.2.2 Blocking ...... 8 2.2.3 Comparison ...... 8 2.2.4 Classification ...... 9 2.2.5 Evaluation ...... 10 2.2.6 Pairwise Record Linkage ...... 12 2.2.7 Group Record Linkage ...... 13 2.3 Historical Record Linkage ...... 14 2.3.1 Challenges of Household Linkage ...... 16

3 Historical Census Data 19 3.1 Canadian Historical Census Records ...... 19 3.2 The 1891 Census Records ...... 21 3.3 The 1901 Census Records ...... 25 3.4 Data Issues ...... 28

4 Automatic Household Identification for Historical Census Data 31 4.1 Household Identification - Methodology ...... 31 4.1.1 First-Pass Assignment of HID ...... 34 4.1.2 Resolving Suspended Records ...... 36

iv Case 1 ...... 37 Case 2 ...... 38 Case 3 ...... 38 Case 4 ...... 42 Case 5 ...... 43 4.1.3 Sliding Window Algorithm ...... 44 4.2 Complexity ...... 45 4.3 Evaluation and Results ...... 47

5 Linking Canadian Historical Census Collocations 51 5.1 Method Overview ...... 52 5.1.1 The PiM System ...... 52 5.1.2 The Disambiguation System ...... 53 5.1.3 Links Integration ...... 56 5.2 Evaluation and Results ...... 57 5.3 The Disambiguation System with Different Group Similarity Techniques 58 5.3.1 Set Similarity Measures ...... 59 5.3.2 Bipartite Matching Technique (BM) ...... 62 5.3.3 Winkler Bipartite Matching Technique (WBM) ...... 64 5.3.4 Applying Thresholds ...... 64 5.3.5 Evaluation and Results ...... 65 5.4 The Bias of the Linkage Process ...... 66

6 Conclusions and Future Work 78 6.1 Future Work ...... 79

Bibliography 82

v List of Tables

2.1 Data evaluation ...... 11 3.1 The common attributes between the 1891 and 1901 ...... 28 4.1 The percentage of the 5 cases of the suspended records in the censuses 36 4.2 The use of the family number attribute in 1901 census ...... 37 4.3 A relation to household member example ...... 39 4.4 Incomplete information example ...... 39 4.5 Given name instead of surname example ...... 40 4.6 Member-record before head-record example ...... 41 4.7 Spelling errors example ...... 41 4.8 Head equivalent example ...... 42 4.9 Suspended records of case 4 example ...... 43 4.10 Different PIDs range in consecutive pages example ...... 43 4.11 The percentage of the suspended records in the censuses before and after applying sliding-window search ...... 47 4.12 Summary of household information ...... 49 5.1 Linking census results: The PiM system ...... 57 5.2 Linking census results: The disambiguation system ...... 58 5.3 Set similarity measures ...... 60 5.4 Evaluation results of the disambiguation system versions ...... 66

vi List of Figures

2.1 Record Linkage Process ...... 7 2.2 Types of links ...... 11 2.3 Two households from two censuses ...... 16 3.1 A cropped image of the 1891 Canadian census page ...... 20 3.2 Transcribed 1891 census pages ...... 22 3.3 The Frequency of Page Size ...... 25 4.1 Scanning trough census pages ...... 32 4.2 Assigning HIDs flowchart ...... 34 4.3 A sliding window search through census pages ...... 44 4.4 The distribution of household size of the manual HID (1871, 1881) and the automatic HID (1891, 1901) ...... 50 4.5 Number of households and average number of people per household, , 1851 to 2011. Reprinted from The shift to smaller households over the past century, by , April 8 2017, retrieved from http://www.statcan.gc.ca/pub/11-630-x/11-630-x2015008-eng.htm 50 5.1 The process of generating longitudinal data over 3 decades ...... 51 5.2 The process of the PiM system ...... 52 5.3 The process of the disambiguation system ...... 54 5.4 Example of disambiguating multiple groups. (1) Starting one to many and many to one link groups (2) Disambiguated one to many link groups (3) Disambiguated many to one link groups (4) Final set of single links, by Richards, 2014, retrieved from [42] ...... 55 5.5 Same person across 4 censuses ...... 57 5.6 The generated longitudinal data ...... 58 5.7 Venn diagram of 2 households ...... 60 5.8 Two linked households example ...... 61 5.9 Bipartite graph of households links ...... 63 5.10 Bias of the PiM and disambiguation systems - Gender ...... 68 5.11 Bias of the PiM and disambiguation systems - Age by Female . . . . 68 5.12 Bias of the PiM and disambiguation systems - Age by Male ...... 69 5.13 Bias of the PiM and disambiguation systems - Marital status . . . . 69 5.14 Bias of the PiM and disambiguation systems - Males Marital status . 70 5.15 Bias of the PiM and disambiguation systems - Female Marital status 70

vii 5.16 Bias of the PiM and disambiguation systems - Birthplace ...... 71 5.17 Bias of the PiM and disambiguation systems - Origin ...... 71 5.18 Bias of the PiM and disambiguation systems - ...... 72 5.19 Bias of the versions of the disambiguation system - Gender . . . . . 73 5.20 Bias of the versions of the disambiguation system - Age by Females . 73 5.21 Bias of the versions of the disambiguation system - Age by Males . . 74 5.22 Bias of the versions of the disambiguation system - Marital Status . 74 5.23 Bias of the versions of the disambiguation system - Marital Status by Females ...... 75 5.24 Bias of the versions of the disambiguation system - Marital Status by Males ...... 75 5.25 Bias of the versions of the disambiguation system - Birthplace . . . . 76 5.26 Bias of the versions of the disambiguation system - Origin ...... 76 5.27 Bias of the versions of the disambiguation system - Religion . . . . . 77

viii Chapter 1

Introduction

Record linkage is the process of efficiently finding data records that describe the same entity (e.g., the same person) across various data sources [49]. pairwise linkage, which considers pairs of individual records, is the most common record-linkage method [49]. In this process, deciding whether two data records refer to the same entity is solely based on the similarity between their common attributes (e.g., name, date of birth, etc.). Recently, group linkage has received a lot of attention [49]. In contrast to pairwise linkage, group linkage considers groups (e.g., entire households) of individuals besides pairwise linkage in order to increase the linkage rate. Social scientists, economists, and historians are interested in linking historical census data to build longitudinal data [21]. Longitudinal data have the same type of information on the same entity at multiple points in time. Such data provides a wealth of information about past populations, thus enabling social scientists to perform in-depth studies on how the social and economic features of their society influenced how they lived. However, to apply group-record linkage techniques on census data, group identifiers need to be available. Households of individuals are the natural groups in census data. Historical censuses usually lack Household Identifiers (HIDs). Finding HIDs is non-trivial, due to the large volume of records involved (e.g., often > 106), transcription and digitization errors, and a limited number of features (e.g., person’s name, gender, relation to the person considered as the head of the household, and limited geographical information of household). Previous approaches manually generated households, but this is not only slow, it can only be used for a

1 relatively small number of data records. Identifying HIDs of millions of census records requires an automated approach. In this thesis, we aim to link four Canadian census datasets namely, the 1871, 1881, 1891, 1901 to generate longitudinal data over three decades. The 1871 and 1881 were linked in [4] and [42], and we build on these studies to link the four censuses. However, the 1891 and 1901 lack HIDs, so we present a households-identification system to generate HIDs for the 1891 and 1901 Canadian censuses. We then use the HIDs produced by the system to link the 1891 and 1901 censuses. Finally, we link the four Canadian historical censuses: the 1871, 1881, 1891, and 1901.

1.1 Motivation

1.1.1 Why do we need to link Canadian historical census records?

The Canadian historical census data have been used for historical research to study various topics, such as class, wealth, gender, occupation, political behavior, and social structures [31] because of its rich systematic detail and the lack of alternate sources describing the population [5]. The historical census data are used to track individuals through time by linking records between censuses to generate longitudinal data. Currently, there are four Canadian historical censuses that are digitized and available to link: 1871, 1881, 1891, and 1901. The 1871 and 1881 censuses were successfully linked twice, once using pairwise linkage [4], and then using households information [42] identified manually by domain experts. The resulting longitudinal data were used to study the change in Canadian work patterns during the 1870s [2]. In this thesis, our goal is to first link the 1891 and 1901 censuses, before linking them to the 1871 and 1881 censuses to provide longitudinal data that spans three decades, and opens new opportunities to study Canadian society in the nineteenth and early twentieth centuries. However, as stated in the previous section, both the 1891 and

2 1901 datasets lack HIDs.

1.1.2 Why do we need to use a group linkage?

Group linkage has been shown to be a superior linkage technique in many studies [21, 39, 42] because it increased the linkage rate. pairwise record linkage only compares individual records; however, it is difficult to decide whether two identical records represent the same or distinct individuals. In particular, the matching of records often relies on a limited number of attributes which can lead to groups of people with similar attributes, thus making the linkage process difficult. Hence, researchers have become interested in group linkage, which compares individual records and the groups they belong to. For example, linking and matching two people from two censuses is based on the similarity of their personal information and their households similarity. The use of group linkage techniques has increased the amount and quality of generated data [21, 39, 42].

1.1.3 Why do we need to generate household identifiers?

Group-linkage techniques require ways to group records together. In census data, people from the same household can be grouped together. However, the household information is not available in many historical censuses, either because it was not recorded at the time of enumeration, or it was not transcribed into the digital datasets. Hence, identifying HIDs is a critical step in the group linkage of the historical censuses, but it is not an easy task because the size of census data is usually large, and the quality of the data is low. Previous studies have manually generated HIDs for census data which is time and human resource expensive [6] or have automatically generated HIDs for only small samples of censuses (i.e., containing only thousands of records, unlike real-world census data)[21]. In general, automatically identifying HIDs for any full census would typically save an enormous time and effort. In this thesis, we

3 present an algorithm for generating HIDs for the 1891 and 1901 Canadian census. This algorithm is a specific to the 1891 and 1901 censuses, but ideas could be generalized to other census data with missing HIDs.

1.2 Thesis Statement

This thesis demonstrates that a longitudinal dataset can be constructed by link- ing four Canadian historical census datasets: 1871, 1881, 1891, 1901. The ability to generate HIDs allows the 1891 and 1901 censuses to be linked by group-linkage techniques which in turn increases the amount of longitudinal data constructed by linking the four censuses. A separate goal of this thesis is to investigate the effect of using different group similarity measures with the group-linkage technique.

1.3 Approach

The main goal of linking historical census data is to provide social scientists with an accurate longitudinal data that tracks as many individuals as possible. To achieve this goal, researchers suggest using a group linkage over pairwise linkage to link censuses [21, 39, 42] because it has achieved higher linkage rate with lower number of false matches (i.e., where an entity from one dataset is incorrectly linked to a different entity in another dataset). To apply group linkage on census data, HIDs should be available. Hence, we develop an automatic HID identification system. In this work, we generated HIDs for the 1891 and 1901 censuses. These HIDs enables us to link the 1881 to 1891 and 1891 to 1901. The 1871 and 1881 were linked in [4] and [42]. The result of linking the four datasets is a longitudinal dataset that provides social scientists with new research opportunities.

4 1.4 Contributions

The main contributions of this thesis are listed below:

• The first automatic household-identification method that uses domain knowl- edge is proposed for generating HIDs of the 1891 and 1901 Canadian historical censuses.

• The first linking of the 1871, 1881, 1891, and 1901 Canadian historical census datasets.

• The new longitudinal data tracks 159,872 Canadians over three decades in the nineteenth and early twentieth centuries. This longitudinal data provides sci- entists with more research opportunities that span a longer window of time in Canadian history.

1.5 Organization

The remainder of this thesis is organized as follows: Chapter 2 discusses back- ground information and previous work related to this thesis. Chapter 3 provides information on the datasets used in our research, along with the challenges they present. Chapter 4 presents a novel algorithm to find HIDs for the 1891 and 1901 Canadian census data. Chapter 5 explains the process of linking the four Canadian censuses using a group-linkage system. This chapter also includes an exploration of several similarity measures that can be used in the disambiguation record linkage system. Finally, Chapter 6 highlights the achievements and important conclusions of this research, along with ideas for future work. Publications related to this work include [3].

5 Chapter 2

Background and Related Work

This chapter introduces the necessary background material and literature review of this thesis. Section 2.1 provides an overview of record linkage. Section 2.2 describes the process of record linkage, including the main steps, followed by an overview of the pairwise linkage and group-linkage methods. Section 2.3 provides a review of the literature related to record linkage that applied to historical records.

2.1 Record Linkage Overview

Record linkage has been studied for decades, and scientists refer to it with various terms, including entity resolution, deduplication, and data integration. In 1946, Dunn proposed the idea of record linkage. He believed that “a book of life” for individuals can be created by linking health-care records (e.g., birth and death records) using (primarily) birth certificate numbers [16] as personal identifiers. However, not many datasets have unique identifiers. In 1959, Newcombe and his colleagues developed a computerized record-linkage method that used multiple record identifiers, such as name and date of birth, to determine the matched records between datasets [37]. In 1969, Fellegi and Sunter developed a mathematical model for the problem of record linkage [20]. A comprehensive overview of record linkage is presented in [18, 49]. Linking various data sources can provide scientists with more research opportu- nities than from one data source [49]. For example, linking cancer registry data, vital records, and birth records was the key strategy used to investigate the possibility of an unusual association between some neural tube defects in children and the subse-

6 quent development of cancer in their parents [46]. This study could not have been performed without the integrated data.

2.2 Record Linkage Process

Record linkage identifies records that describe the same entity from various datasets even if they have no common unique identifiers (e.g., social security number). If two or more datasets share unique identifiers, a standard database join operation is enough to link the datasets. However, these identifiers are not available in many real-world datasets. Hence, a record-linkage system that compares other common attributes, such as name, address, age and gender is often required to link datasets. By taking two datasets or more as an input, a record linkage system typically finds matches by following five steps (shown in Fig 2.1): data pre-processing, blocking, comparison, classification, and evaluation. Each step is described in the following sub-sections. Figure 2.1: Record Linkage Process

2.2.1 Data preprocessing

Data pre-processing, such as data cleaning and/or standardization, is often the first step in the record linkage process. Usually, the datasets to be linked have various quality issues, such as incomplete or missing values, inconsistent values, and/or noisy data. Linking such data is difficult and may generate incorrect results. Also, the same

7 information may be coded using different formats in different datasets, which makes comparing attributes a challenging task. For example, “John Smith” and “Smith J” could be two different name formats that might refer to the same name. Consequently, using a cleaning and standardization process to improve data quality and to make record comparison possible is an important step.

2.2.2 Blocking

In the blocking step, record pairs are subdivided into blocks to speed up the linkage process [28]. For example, consider two datasets, A and B, each with 5000 records. Comparing each of the records in A to each of the records in B require 25×106 comparisons. In general, the total number of comparisons grows quadratically with the size of the data sets. Blocking these pairs into groups and then comparing only the records in the same group will reduce the number of comparisons. In practice, records are grouped using a blocking key. For example, the first two letters in a surname can be used as a key to block records. All records with the same blocking-key value will be located in the same group, and the comparison step will be executed only between these records. Notice that these groups should include any possible matches, because only record pairs in these groups are linked. Choosing and applying a suitable blocking key is critical to achieve high quality record linkage results.

2.2.3 Comparison

In the comparison step, records in each block generated in the blocking step are compared to produce feature vectors γ (defined in Equ 2.1) that contain the similarity score between each corresponding attribute pair for every compared record pair. The feature vector refers to how alike the record pair attributes are to each other. Each dataset has a set of attributes. The set of variables t represents i number of attributes

8 ti that appear in both datasets, A and B. A vector γ (which is defined in Equ 2.1)

contains the similarity score si for each attribute ti of the record pair (a, b).

γ[(a, b)] = {s1[(a, b)], ..., si[(a, b)]} (2.1)

There are two comparison strategies: exact comparison and approximate comparison [28]. Exact comparison is used when two attributes are required to be identical (e.g., grades). Approximate comparison uses similarity measures (i.e., a real-valued measure that quantifies the similarity between two objects) to assign a similarity score between 0 (no match) and 1 (exact match) to the compared attributes. The more similar the attribute values are, the closer the similarity score is to 1. For example, given two names: “MARTHA” and “MARHTA” to be compared, the Jaro Winkler measure (i.e., a similarity measure for string data [48]) assigns a similarity score 0.96, which means these names are very similar. This Jaro Winkler measure is used to overcome the spelling errors while comparing strings.

2.2.4 Classification

Classification is the task of predicting the class of record pairs (match and non- match) based on their vectors γ from the comparison step. Classification methods can be divided into two major types: supervised and unsupervised methods [28]. Supervised methods infer a model from labeled data, which consist of a set of labeled record pairs (match and non-match) [28]. The dataset is split into a training dataset and a testing dataset. The training dataset is used to infer the classification model. The testing dataset is used to assess the performance of the inferred model that is used to classify new record pairs. Decision trees [41], Neural Networks [45], Support Vector Machine (SVM) [30], and many others are examples of supervised classifiers. However, labeled data is often not available, and the labeling process is time consuming and may require domain knowledge. In this case, an unsupervised

9 method may be required. Unsupervised methods infer a model from unlabeled data (i.e., record pairs) that are used to classify new record pairs [28]. Theses methods seeks out similarity between record pairs to determine whether they can be grouped to matches and non-matches without labeled data. K-means clustering [29] is a common method that is used to assign record pairs into matched or non-matched clusters. However, other methods exist including fuzzy clustering [17] and hierarchical clustering [28].

2.2.5 Evaluation

The quality results produced by a record-linkage system is difficult to assess with- out knowing the correct matches. When the labeled testing data is not available, a record-linkage system can be evaluated by counting the number of generated matches (i.e., linkage rate), but there is no guarantee that the generated links are correct links. In such situation, it may be possible to have the links reviewed by a domain expert. However, this is only feasible if the number of links is reasonably small. When labeled data is available, other evaluation ways can be used to assess a record-linkage system. By comparing link labels of the record-linkage system and the testing data, there are four possible results (as shown in Table 2.1): True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). A True Positive link (TP) is a record pair that appears as a match in the testing data set and in the record linkage output. A False Negative link (FN), which is considered an error, is a record pair that appears as a match in the testing data and a non-match in the record linkage output. A False Positive link (FP), which is also considered an error, is a record pair that appears as a non-match in the testing data and as a match in the record linkage output. A True Negative link (TN) is a record pair that is identified as a non match by both the testing dataset and the record linkage system. There are different evaluation measures that use the testing dataset to evaluate the performance of record linkage systems. True Positive Rate (TPR) and False

10 Table 2.1: Data evaluation

hh hhhh hhhLinkage system hhhh Match Non-match Testing data hhhh Match TP FP Non-Match FN TN

Figure 2.2: Types of links

Positive Rate (FPR) are two evaluation measures that are used in this thesis. The TPR is defined in Equ 2.2, and the FPR is defined in Equ 2.3. Social science studies typically require high TP and low FP.

TP TPR = ( ) ∗ 100 (2.2) TP + FN

FP FPR = ( ) ∗ 100 (2.3) TP + FP

Linkage Rate (LR) is another evaluation measure that is used to assess the quality of results. There are two types of matches: single and multiple links, and the linkage

11 rate considers only the single links. A single link (a, b), as shown in Fig 2.2, is a record a from dataset A linked as a match to only one record from dataset B (one to one links). A multiple link, as shown in Fig 2.2, is a record c from dataset A linked as a match to more than one record from dataset B (one to many links), such as (c, d), (c, f), or a record f from dataset B is linked to more than one record from dataset A (many to one link), such as (c, f), (e, f). Multiple links generally arise when a limited number of record attributes are compared [42]. For example, when only comparing names, “Smith J” in one dataset could be linked to “John Smith”, “Jones Smith”, and “Jacob Smith” in the other dataset. These links are considered as ambiguous links, and only single links are evaluated [42]. As a result, the linkage rate is defined as the percentage of the number of single links in the first dataset (equation 2.4).

# Single Links LR = ( ) ∗ 100 (2.4) # Records in F irst Dataset

2.2.6 Pairwise Record Linkage

The linkage process explained above is called pairwise linkage, as it only handles pairs of individual records. In this process, identifying the same entities from two different datasets is determined by how the values of the individual records’ attributes resemble each other. However, it is usually not easy to decide when matched records represent distinct entities. For instance, two people of the name ”John” who are living in the same area and having the same age may be neighbours and not the same person. Also, a common problem is the large number of ambiguous links (multiple links). These links are discarded even if they contain true matches. To overcome these difficulties, researchers tend to employ a group-linkage technique [7, 22, 23, 34, 39], which is described in the next subsection. For instance, in [42] the multiple links make up roughly 92% of all matched links returned by the classifier in the pairwise linkage step. Hence the author presented a group-linkage system that disambiguates

12 these multiple links to increase the linkage rate.

2.2.7 Group Record Linkage

Group record linkage compares the attributes of a record pair along with the group it belongs to [39]. A real-world entity is usually represented by a group of records that share a common group identifier rather than by an individual record. For example, an author in bibliographic data can be represented by a group of individual records. Each record represents a publication. Determining whether two records refer to the same author is based on the similarity of their group of publications along with the similarity of the personal information that is listed in their records. As a result, the authors are the same person if the similarity of their personal information is same or very similar, and they share as many publications as possible. Comparing groups of records instead of comparing only individual records is the core of the group linkage technique [39]. There are different group comparison methods that have been used in group linkage systems. For example, set similarity measures (i.e., functions that test the similarity between two finite sets), such as the Jaccard measure [11], are common methods are used to compare groups in record linkage. Due to the generally good performance of the Jaccard measure, researchers have used it in historical record linkage with success [21, 42]. This measure will be explained in detail in Chapter 5. Further, a machine learning technique (graph-based model) is used to deal with the record groups in [22]. For the classification and evaluation steps, the same methods that are used in the pairwise record linkage system can be used for group record linkage.

13 2.3 Historical Record Linkage

Historical census data (i.e., an official count of a population, which occurs at specific interval in time) is considered to be an informational treasure for genealogists, social scientists, and historians. It provides information about people’s ancestors, including how they lived and what the social and economic features of their society were [27]. Linking historical censuses to construct longitudinal data is called historical record linkage. The longitudinal data allow researchers to build more reliable studies about various topics such as family formation and dissolution, social and geographic mobility, and the interrelationship of geographic and economic movement [44]. Historical linkage has been studied since the 1980s. An early example can be found in [33], where a test census of Tampa, Florida and an independent post- enumeration survey conducted by the U.S. Census Bureau were linked by a record- linkage software system based on Fellegi-Sunter mathematical theory of record linkage [20]. This study discussed the practical issues of automated historical record linkage [33]. In 1996, Dillon integrated a sample of the 1871 Canadian census with similar samples of the 1850 and 1880 American censuses to generate a longitudinal data that used to examine household structure across the two countries and three years [14]. The Minnesota Population Center (MPC) at the University of Minnesota has the world’s largest demographic data collections, which are released through different projects including the Integrated Public Use Microdata Series (IPUMS) 1. These linked datasets (including historical American censuses) has been released to the public, and researchers worldwide [1]. A description of data and the linkage process can be found in [1, 26, 44]. In addition, various studies that used their linked data can be found at the center’s website [1]. In Canada, a recent linking of historical censuses was made by a group of re- searchers at the University of Guelph (People-in-Motion (PiM)) [4]. They developed

1http://doi.org/10.18128/D020.V6.4.

14 a pairwise linkage system (PiM system) to link the complete 1871 Canadian census to the complete 1881 Canadian census. However, this system evaluated only the single links, and the multiple links were treated as ambiguous links and discarded. Generally, applying group linking methods to link households along with individ- uals has reduced the number of multiple links and increased the amount of generated longitudinal data. For instance, Richards in [42] presented a group-linkage system (disambiguation system) that uses households to link the 1871 and 1881 Canadian census. This system increased the linkage rate from 15.8% to 31.8%. A higher linkage rate means longitudinal data for more individuals. A similar group-linkage system that also uses households was presented by Fu [21] to link six samples of Lancashire (England) censuses of one district (Rawtenstall) during the 19th century. This sys- tem consists of two steps: pairwise linkage and group linkage. The maximum linkage rate of pairwise linkage was 28%, and after group linking the maximum linkage rate increased to 56%. These studies demonstrate the usefulness of linking households to create more longitudinal data; hence, we use the previous system by Richard [42] to link two more complete Canadian censuses (1891 and 1901) to the already linked ones the 1871 and 1881 censuses to generate longitudinal data. Linking historical censuses using group linkage requires the availability of house- hold information. Group-linkage techniques require records to be grouped, and in census data records can be grouped by households. A household is a person or group of people who live in the same dwelling and also share living accommodation. It consists of a head of household (i.e., a status held by the person who is self- identified as the head when the census was taken, running the household, looking after other household members, and paying living costs) and other members who are relatives or some other group of people such as servants [21]. In historical census data, the household information is usually not available, either because it was not recorded at the time of the enumeration, or because it was not digitized.

15 Figure 2.3: Two households from two censuses

2.3.1 Challenges of Household Linkage

In general, linking historical censuses presents more challenges compared to link- ing contemporary data [13]. In addition to data quality issues (explained in the Chapter 3), the changeable household structure over time makes comparing house- holds of two consecutive historical censuses to find the same households a challenging task. The structure of households over time is likely to change because of many factors (e.g., death or birth). Figure 2.3 shows two households (hypothetical) from two consecutive censuses (Census 1 and Census 2), where Census 2 occurs ten years after Census 1. The question that we need to answer is, “Are these households the same?” Notice that the head of the first household is a husband, “John Smith”, where the head of the second household is a widowed wife, “Mary Smith”. Mary could be John’s widowed wife, but this is not clear. In historical censuses, usually, the head of a household is the husband/father, but it could be any household member (male, female, husband, wife, etc.) who self-identified as head. Also notice that the

16 number of household members of the two households is different. Due to time change factors (marriage, divorce, immigration, or emigration, etc.), it is normal to have more household members or less after ten years. For instance, the daughter “Susan Smith” appears in the first household and not in the second one. Susan may have married and moved to another household, as a wife apparently, or she may be deceased. Moreover, the servant “Ann Taylor” only appears in the second household, which could mean that she joined the household later. Although, there are many differences between these two households, logically, they may still be the same household. Researchers have used different strategies to compare households during record linkage. The household identifiers (HIDs) are unique to the census and not unique through time. This means that comparing households is not as simple as looking up the same HID between two censuses. In [21, 26, 42], a pairwise linkage step was performed first to find all single links between households. Then, Goeken [26] used a simple rule to define the matched households: any two households that have single links between them are matched households. Fu [21] and Richards [42] generate a household match score by using the Jaccard measure [32], which is defined as the number of common single links divided by the size of the union of the households. A match score ranges from 0 (non matched households) to 1 (exact matched house- holds). In another study, Fu [22] builds a graph for each household that presents the structural relationship between household members. Then, a graph similarity function (explained in [22]) is used to assign household match scores. While there are different studies that focus on how to compare the households, there are few studies that focus on how to generate them. Without defining house- holds, we cannot apply a group linkage on the historical census data. Hence, the lack of HIDs is one of the main challenges in historical record linkage. To overcome this challenge, researchers have generated HIDs for census datasets manually [4]. If the size of the dataset is small, creating households manually can be feasible, but in the case of a census with millions of records, it is time and labor consuming. Therefore,

17 generating HIDs automatically is time efficient. To the best of our knowledge, the only automated HIDs identification system was presented by Fu [21]. It generated HIDs automatically for a sample (160,000 records) of six Lancashire (England) censuses, unlike the real-world census data which has millions of records. This sample covers one district (Rawtenstall), which is a small area in contrast to the full census that covers the entire country. To generate HIDs, Fu used a set of rules to define the head record; then, by using an algorithm, the census records were scanned. Each time a record satisfied the head rule, the HID number was incremented by one, and the HID number assigned to all of the following member records, until another head record is found. The generated households were used in a group linkage process to link these samples. Automatically identifying households would allow extracting HID even for large, complete census data. To fill this gap, in the next chapter, we present a system, that uses domain knowledge and automatically extracts HIDs.

18 Chapter 3

Historical Census Data

This chapter introduces the Canadian historical censuses to be linked in this thesis, and the challenges they present. Section 3.1 discusses the Canadian historical census data in general. Section 3.2 describes the 1891 census dataset, and Section 3.3 describes the 1901 census dataset. Finally, Section 3.4 discusses the data issues of the 1891 and the 1901 datasets.

3.1 Canadian Historical Census Records

In general, historical census data provides information about people’s ancestors, including how they lived and what the social and economic features of their society were. In Canada, the aim of collecting census data is to observe and measure various aspects of a number of social and economic issues by the government [27]. With regards to the 1871, 1881, 1891, and 1901 Canadian censuses used in this work, census collections cover both provinces and territories that are divided into districts (i.e., counties and cities). The districts are further divided into sub-districts (i.e., townships, parishes and larger towns). The original enumerators were assigned to specific geographic areas. The census questionnaire forms cover a variety of subjects, such as personal information, religion, education, and finance [27]. The information used to complete the census forms was provided by the household member who self- identified as the head of a household or institution. The hand-written documents (see Fig 3.1) were microfilmed and digitized (only the microfilmed and digital copies exist today). The digitized census documents are

19 20

Figure 3.1: A cropped image of the 1891 Canadian census page available from Library and Archives Canada (LAC) 1. Figure 3.2 shows an example of the digitized page from the 1891 census. In practice, the process of converting handwritten papers to machine-readable data often introduces errors. For example, an image that is hard to read may be difficult to transcribe, thus resulting in incorrect information. Moreover, the original census forms included much more information about the inhabitants (e.g., buildings and lands), but in the transcribed copy, only a limited number of attributes appear. In this thesis, the four Canadian historical censuses have the following characteristics:

1. 1871 census contains 3,466,427 records, each with 50 attributes.

2. 1881 census contains 4,277,807 records, each with 54 attributes.

3. 1891 census contains 4,787,244 records, each with 18 attributes

4. 1901 census contains 5,343,566 records, each with 19 attributes

We aim to collectively link these censuses to allow researchers to trace people for a period of 30 years (i.e., 1871-1901). However, the lack of HIDs for the 1891 and 1901 censuses is the main concern. HIDs for the 1871 and 1881 censuses were manually generated by the PiM team at the University of Guelph [6], but HIDs for the 1891 and 1901 datasets do not exist. The digitized copy of the 1891 and 1901 are simple lists of individuals with no clear designation of where one household ends and where a new household begins. Hence, we need to build an automatic household identification system for the 1891 and 1901 datasets (described in Chapter 4). In the following sections, we describe the attributes of the digitized copies of the 1891 and 1901 datasets, and identify any challenges working with them may present.

3.2 The 1891 Census Records

The 18 attributes of the 1891 transcribed records are listed below: 1http://www.bac-lac.gc.ca/eng/census/Pages/census.aspx

21 22

Figure 3.2: Transcribed 1891 census pages 1. Province: there are 9 values: , , , , Prince Edward Island, British Columbia, Territories, New Brunswick, and Man- itoba.

2. District original: the name of the census district, which could be a city, town, group of townships, or other defined area. There are 201 district values in this census. There are no missing values.

3. District num original: a number associated with the census district, ranging from 1 to 201. The number of records that contain missing values is 7,046.

4. Subdistrict original: each district is divided into sub-districts, and a name is used to refer to these. In Figure 3.1, the district and subdistrict value are shown in the header of the census form. There are 2,393 sub-district values. There are no missing values.

5. ID: the image number of the original handwritten census page. The 1891 census has 101,098 pages. Each page has a number of individual records that range from 1 to 100, but as shown in Figure 3.3 the majority of these pages has 50 records. Only 0.2% of the pages has more than 50 records, and the average number of records per page is 47.3.

6. PID: a personal identifier which is a unique value for each record. There are no missing values.

7. Primary given: refers to the individual’s first name. The number of records that contain missing values is 23,015.

8. Primary surname: refers to the individual’s family name. In the original forms (see Fig 3.1), column three has a full name for each person (given name followed by surname). The number of records that contain missing values is 13,932.

23 9. Marital: a person’s marital statues. This attribute has five values (single, married, widowed, cohabiting, or divorced). In Fig 3.1, the marital status attribute is in column seven.

10. Relation: a member relation to a head of household, such as son, wife, or lodger. In Fig 3.1, the relation attribute is located in column six. This field includes English and French values. There are 8404 relation values, but 94% of the records contain common values, such as son, head, daughter, or wife. The rest of the values are less common (e.g., “grand child”). They form low percentage, so these values are not considered in HIDs defining rules of our HID identification system (are described in Chapter 4).

11. Sex: “Male” or “Female” are the main values of the sex field, but there are few other values that form less than 1% (e.g.,“??” or “Both”) that we ignored them.

12. Age: a numerical value that refers to the number of years completed at the last birthday before April 6, 1891 (the day when the census was collected). It ranges from 1 to 105. If the age is less than one year, then the age is followed by days, weeks, or months (e.g., “6 months”), or the age has a date format (day-month). When the age is unknown the symbol “??” is used.

13. Birth loc: refers to the place of birth for each individual. The input may be a name of a province, country, city or even continent.

14. Fat bir loc: refers to the birthplace of a person’s father. The values are names of provinces, countries, continents, or “not stated” when the place is unknown.

15. Mot bir loc: refers to the birthplace of a person’s mother. It has the same value format as the father’s birth location.

16. Religion: the belief of an individual. Although, there are more than 9,000 different values, 97% of census’s records have one of 50 main and the

24 90000

80000

70000

60000

50000

Freq 1891 40000 1901

30000

20000

10000

0 0 5 10 15 20 25 30 35 40 45 50 >55 Number of records in a page

Figure 3.3: The Frequency of Page Size

most frequent religion is “Roman Catholic”. The large number of values is because of spelling errors or using different terms to refer a particular religion.

17. French: a binary value (Yes/No) that refers to whether the person is French Canadian.

18. Roll: the roll number of the original hand-written census papers.

3.3 The 1901 Census Records

The 19 attributes of the 1901 transcribed records are listed below:

1. ProvinceFull: there are 9 values: Ontario, Quebec, Nova Scotia, New Brunswick, Prince Edward Island, British Columbia, Territories, New Brunswick, and Man- itoba.

2. District: in this dataset, there is only one attribute to describe the district, which is the district name. There are 202 districts. The number of records

25 containing a missing value is 149.

3. SubDistrict: a letter and a number are used to refer to the sub-districts. The to- tal number of sub-districts is 3069. The number of records containing a missing value is 11,120.

4. ID: refers to the digital image numbers of the original hand-written pages in the 1901 census. In total, there are 112,888 pages. Each page has a number of records that ranges between 1 and 357, but as shown in Figure 3.3 the majority of these pages have 50 records. Only 0.4% of the pages has more than 50 records, and the average number of records per page is 45.4.

5. Page: while ID is the number of the digital image of the original census page, this attribute is the number that appears in the original handwritten census page.

6. PID: a unique numerical value for each record. It has no missing values.

7. Name given: the given name of each person in the household. The number of records that contain missing values is 21,193.

8. Name surname: a string that refers to the last name of an individual. The number of records that contain missing values is 42,790.

9. Marriage status: there are four values: single, married, widowed or divorced.

10. Relation: relation to a head of the household. There are 17,177 different relation values in English and French; 90% of these values are common, such as wife, son, or lodger. The remaining values are infrequent values (e.g., day laborer or teacher). These values are not considered by our HID identification system.

11. FamNum: a number for each family or household in each subdistrict. This attribute is not unique and has 35,812 missing values, so it cannot be used as a household identifier.

26 12. Sex: “male” or “female”.

13. Age: a numerical value referring to the number of years completed at the last birthday before March 31, 1901 (the day when the census was collected). It ranges from 1 to 112. If the age is less than one year, the age is followed by days, weeks, or months, or the age has a date format (day-month).

14. Birloc: refers to a country or place of birth. If Canada, the province or territory is indicated. For people born outside of Canada, the name of the country of origin is indicated.

15. Religion: 97% of the census’s records indicate one of 50 main religions, but, in general, there are more than 14,000 values. This number is so high because dif- ferent words are used to refer to the same religion (e.g., Budd and Buddhist), as well as the spelling mistakes. The most frequent religion is “Roman Catholic”.

16. Roll: the roll number of the original hand-written census’s papers.

17. ImmYear: The year the person moved to Canada from another country.

18. Nationality: Canadian for those who live in Canada and who had acquired rights of citizenship. For Non-Canadians the country of their birth is indicated.

19. Tribal: race or origin, which generally traced through the father. There are 6824 different values in this field. The most frequent value is “England”, accurately for 21% of all values.

There are 14 common attributes between the 1891 and 1901 censuses. They are listed in Table 3.1. Specific information about the 1871 and 1881 censuses can be found in [4]. More information about the four censuses: 1871, 1881, 1891 and 1901 can be found at the Library and Archives Canada (LAC) website2.

2http://www.bac-lac.gc.ca/eng/census/Pages/census.aspx

27 Table 3.1: The common attributes between the 1891 and 1901

Attributes 1891 1901 District original District Subdistrict original (name) Subdistrict (number) Marital Marriage status Province ProvinceFull Primary given Name given Primary surname Name surname PID PID Sex Sex Relation Relation Birth loc Birloc Age Age ID ID Religion Religion Roll Roll

3.4 Data Issues

The main challenge in linking the four censuses (1871, 1881, 1891, 1901) by a group record linkage is providing the group identifiers (HID) for the 1891 and 1901. To achieve that we present an automatic HID identification system. The difficulty in building such a system is that we have to work with imprecise data (the digitized historical census records). As we mentioned in section 3.1, transcribing the census data has introduced some data errors. The main quality problems of our datasets are listed below:

1. Excluded attributes: while transcribing the data, a limited number of at- tributes were included. For example, the address attribute was not included in our census data. It would be very helpful in making a decision about belonging to a household; however, we use other attributes to find HIDs.

2. Attributes with blank values: the missing information for some attributes

28 can hinder the HID identification system. For instance, in our research, the missing surname or relation values add ambiguity to the data which affects the HID identification and the linkage process.

3. Attributes with incomplete values: missing information means that the attribute has no value (blank) while incomplete information means that there is a value but it does not give complete information. For example, listing adopted children in one household without referring to the adoption information makes the data vague, therefore, the household identification become complicated.

4. Attributes with incorrect values: spelling errors are common mistakes that can create obstacles to data analysis. Further, misplacing the data in the digi- tized census also results in ambiguous information.

5. Attributes with nonstandard values: using nonstandard values in one at- tribute makes building rules about this attribute a challenging task. For ex- ample, different terms in one dataset were used to describe grand child in the relation attribute, such as “grand son”, “g son”, and “son of daughter”. Also, some attributes can have different numbers of values overtime [13]. For exam- ple, while in the 1871 Canadian census, the input values for “marital status” were married, widowed, divorced or single, in the 1891 Canadian census, there is an extra value: cohabiting.

6. PID errors: over and above the general issues with data attributes, there is another problem related to the personal identifier (PID) attribute. After transcribing the historical censuses, the records were numbered automatically from the first record of the census to the last record. Any mistake that happened during this process generates PID errors. Because there is a strong assumption that the order of the transcribed records should be the same order as in the original census forms, any change in this order results in PID errors that affect

29 the HID identification process.

In this thesis, we aim to build an automatic household identification system for the 1891 and 1901 censuses that takes into account the data issues (described above) and presents some solutions for them. In the next chapter, all these issues are explained in more detail, and a household identification system is proposed to deal with these challenges.

30 Chapter 4

Automatic Household Identification for Historical

Census Data

In this chapter, we present a method, that uses domain knowledge, to automat- ically discover and assign HIDs to individual records of the 1891 and 1901 censuses. Section 4.1 describes the two-steps automatic HID identification method. Section 4.2 discusses the runtime of our method. Section 4.3 presents the results of applying the automatic HID identification on the 1891 and 1901 censuses. Finally, we conclude this chapter in Section 4.4.

4.1 Household Identification - Methodology

Based on the definitions of the household and the head of household given in Chapter 2 section 2.4, information from a domain expert1, and after observing the images of hand-written census forms (e.g., Fig 3.1) and the digitized datasets (e.g., Fig 3.2) of the 1891 and 1901 Canadian censuses, the following assumptions are made:

1. For every subdistrict in a district, each census form was filled in by the head of a household, and the head-record is the first record in the household, which is sometimes followed by other member records (e.g., partner, children, parent, servants). Hence, a household in a census dataset begins with a head-record and ends with the next head-record. For example, the “Smith” household in Fig 4.1 starts with a head-record and is then followed by various member records.

1Dr. Kris Inwood, Economics and History Department, University of Guelph

31 Page PID Surname Relation Page PID Surname Relation Page PID Surname Relation 1 1 Smith Head 2 8 Corbeau Son 3 15 Seifert Servant 1 2 Smith Daughter 2 9 Corbeau Son 3 16 Murry Head 1 3 Smith Son 2 10 Eliott Head 3 17 Murry Wife 1 4 Smith Son 2 11 Gifford Head 3 18 Tombs Head 1 5 Smith Sister 2 12 Gifford Wife 3 19 Tombs Wife 1 6 Corbeau Head 2 13 Gifford Brother 3 20 Tombs Son 1 7 Corbeau wife 2 14 Seifert Servant 3 21 Tombs Son

Figure 4.1: Scanning trough census pages

2. After transcribing the historical censuses, the PID were generated by numbering the records automatically from the first to the last record of the census, thus we can expect these identifiers to be sequential. For instance, household members in Fig 4.1 have sequential PIDs.

3. Household members (should) share some basic information that can be used to identifying households.

(a) The nuclear household members (e.g., wife and children) should share the same surname with the head. The exceptions are wives from Quebec who often keep their surnames after marriage.

(b) Household members share the same location information that can be iden-

32 tified by the district and subdistrict information.

(c) Household members should be located on the same page or the nearby pages: the previous page, or the next page. For instance, in Fig 4.1, the head of “Corbeau” household and his wife are located in page “1” and the rest of the household records appear on the next page.

(d) The personal identifiers (PID) of household members, starting from the head, should be sequential.

Based on these assumptions (that were validated by a domain expert1), we define the initial idea of household identification: by scanning a sorted dataset, a household begins with head-record and ends with the next head-record. However, to be certain about households and to avoid any possible error while identifying them, we analyze seven attributes that can provide us with information about households: surname, relation to a head of household, marriage status, district name, and subdistrict name, page number (ID), and personal identifier (PID). Starting from the first record, the seven attributes of each record are analyzed and compared with the attributes of the previous record (record X and the previous record Y ). The flow chart in Fig 4.2 shows the steps of comparison, the result of each step, and the action that can be taken based on each result. There are four possible actions in the comparison step: start a new household (green rectangle), continue to a household (blue rectangle), compare more attributes (black rectangle), or exclude this record (red rectangle) because it has uncertain information and contradicts our working assumptions. Identifying HID of all census records without excluding any records is an aim of our system. The process of identifying HIDs is performed in two stages: assigning HID (as shown in Fig 4.2), and resolving unassigned records to decrease the number of excluded records. 1Dr. Kris Inwood, Economics and History Department, University of Guelph

33 Figure 4.2: Assigning HIDs flowchart

4.1.1 First-Pass Assignment of HID

First we need to sort the digitized datasets by district, sub district, ID, and PID to arrange the records in a specific way to obtain the order of the census records when they were microfilmed and then transcribed. The census records were arranged by district, subdistrict, and then by pages. The sorted census that has M records each with N attributes is scanned sequen- tially starting with the first record and ending with the last record (Algorithm 1). Record X is compared with its previous Y in order to decide if it is: (1) a member-

34 record (line 9), (2) a head-record of a new household (line 6), or (3) a suspended record (line 12) that requires more analysis to decide whether the record X belongs to this household. Following the flowchart in Fig 4.2, the record X is a head-record (identified by a green rectangle) if it:

1. has a “head” in the relation field, or

2. has a new district or subdistrict value.

The record X is a member-record (identified by a blue rectangle) if it has the same district and sub district values, and it:

1. has a sequential PID, a family relation value (son, single daughter, or wife), and shares a last name with the head-record, or

2. has a relation value that does not correspond to a family input (domestic, servant, etc.).

The record X is considered as a suspended record (identified by a red rectangle) if it does not satisfy any of the previous rules. We classify the suspended records into the following five cases:

1. Case 1: records with missing district names and numbers (red rectangle number 1, in Fig 4.2).

2. Case 2: records with missing sub district names and numbers (red rectangle number 2, in Fig 4.2).

3. Case 3: records with a sequential PID, a different surname from the head, and a family relation value (red rectangle number 3, in Fig 4.2).

4. Case 4: records with a non sequential PID and the same surname of the head (red rectangle number 4, in Fig 4.2).

5. Case 5: records with a non sequential PID, a different surname from the head, and a family relation value (red rectangle number 5, in Fig 4.2).

35 Algorithm 1 Scanning and Comparison Step Require: Sorted CensusRecords[N,M] 1: i ← 1, Y ← CensusRecords[, i] 2: Y [HID] ← 1 3: while i < M do 4: i = i + 1 5: X ← CensusRecords[, i] 6: if HeadRecord(X) is True then 7: X[HID] ← Y [HID] + 1 8: else 9: if MemberRecord(X) is True then 10: X[HID] ← Y [HID] 11: else 12: SuspendedRecord(X) 13: end if 14: end if 15: Y ← X 16: end while

4.1.2 Resolving Suspended Records

To resolve the suspended records, we manually explored these records and dis- cussed their significance with a domain expert 1. In this section, we discuss all of the cases for the suspended records and present solutions to resolve them. These solu- tions are used in a search and assign sliding window algorithm which will be explained in Subsection 4.1.3. Table 4.1 shows the percentage of each case of the suspended records from the total records in the 1891 and 1901 censuses. Table 4.1: The percentage of the 5 cases of the suspended records in the censuses

1891 1901 Case 1 0.1% 7,045 0.002% 149 Case 2 0% 0 0.2% 11,109 Case 3 4.6% 223,707 1.6% 88,132 Case 4 4.2% 209,530 0.06% 3,269 Case 5 1% 50,876 0.02% 1,562 Total 10% 491,158 1.9% 104,221

1Dr. Kris Inwood, Economics and History Department, University of Guelph

36 The percentage of the suspended records in the 1901 census is much smaller than in the 1891 census. The is due the existence of the “family number” attribute in the 1901 census. This number is not unique, and has many missing values. There are 34,825 missing values in this attribute. Also, this number appears out of order. For example, the numbers (785, 786, 797, 811) are not in the correct sequence. However, we used it to decide when a new household should start regardless of the relation value. The definition of head-record in the 1901 census includes “having a new family number value” as one of the rules to start new household. Table 4.2 shows an example where we start a new household (i.e., 556) when the family number changes from 1 to 2 even if though the relation value of the first record is not “Head”.

Table 4.2: The use of the family number attribute in 1901 census Family number PID Given name Surname Relation HID 1 4994 Sheldon Pettifer Son 555 1 4995 Russel A Pettifer Son 555 1 4996 Gordon C Pettifer Son 555 1 4997 William Mcchesney Lodger 555 2 4998 Mary Jane Spencer Wife 556 2 4999 Charles Spencer Son 556

Case 1

The missing data is one of the main issues with the 1891 and 1901 datasets. The records that have missing values in the sorting attributes (district, sub district, ID, and PID) will be misplaced and then excluded. In the 1891 and 1901 censuses, only district and subdistrict have missing values. A record X with a missing district value is a suspended record because the record X is sorted in a wrong location, which results in having different surname and non sequential PID. The solution for this case is to include in the sorting the district name as well. It is obvious that comparing district numbers instead of names will avoid any possible errors caused by spelling mistakes; however, when the number is missing we can use

37 the name. This adjustment can only be applied on the 1891 census because the 1901 census has only one district value which is the name.

Case 2

For the same reasons of Case 1, record X with a missing subdistrict value is a suspended record. However, we cannot apply the same solution as Case 1 because both datasets have only one value to describe the subdistrict information.

Case 3

A record X with a sequential PID, different surname, and a family relation value is a suspended record because if it is a member-record, it should share the same surname. Almost 5% of the records in the 1891 census and 2% of the records in the 1901 fall in this case. After analyzing some of these records manually and discussing them with domain experts, we determined some of the causes and for which we developed solutions. The causes and solutions are described below.

1. Incomplete information: there are two cases of incomplete information in the relation attribute (i.e., relation to household member rather than head of house- hold and children relation rather than adoption or stepchildren relation), which result in suspended records. The first is if the relation refers to a household member rather than the head. Table 4.3 shows a household with this issue. According to the relation values in this household, the record containing “Jack Wirth” is a son of the head “W W Rowe”, but they do not share the same surname. According to the surnames of the household members, the record containing “Jack Wirth” shares the same surname with the domestics “Hen- rich” and “Elizabeth Wirth”. It is reasonable to say that “Jack Wirth” is a son of the domestics “Henrich” and “Elizabeth Wirth” not son of the head “W W Rowe”. Inserting the value “Son” instead of “Son of domestic” is an incom-

38 plete information. Hence, comparing surnames to the head-record will not be sufficient. Table 4.3: A relation to household member example PID Given name Surname Relation HID 598316 W W Rowe Head 1349 598317 P L Rowe brother 1349 598318 Henrich Wirth Domestic 1349 598319 Elizabeth Wirth Domestic 1349 598320 Jack Wirth Son Case 3

The solution of this issue is to compare the surname of any new record with the surnames of the household members who have different surnames, but one of the following relations to the head: Married/widowed/divorced daughter, Domestic, Lodger, Sister, or Brother.

The second incomplete information in the relation attribute is if the value of the relation attribute has missing adoption or stepchild information. Table 4.4 shows an example of this issue. Although there is an “adopted” value , sometimes it seems that other values (such as “son” or “daughter”) are used to refer to an adopted child. In Table 4.4, the children-records have sequential PIDs, but the surnames are different from the head-record. We can assume that theses records refer to household members, but there is an incomplete information about adoption or stepchildren.

Table 4.4: Incomplete information example PID Given name Surname Age Relation HID 923002 Annie Stewart 55 Head 8285 923003 Wm I Smith 27 Son Case 3 923004 Annie F Smith 25 Daughter Case 3 923005 Lila E Young 31 Daughter Case 3

For this issue, we cannot apply any solution, and the HIDs of these records will

39 not be generated and they will be excluded because we cannot be certain if they belong to the household or not.

2. Misplaced information: inserting given names in the surname field results in suspended records. In Table 4.5, the children-records have missing given names and “Askanee”, “Christine” and “Lottie” as surnames. Clearly, the surnames are given names which were entered in the wrong field. This problem is difficult to solve because it is not easy to detect whether the last name value is a surname or given name.

Table 4.5: Given name instead of surname example

PID Given name Surname Relation HID 837998 John Haines Head 259 837999 - Askanee Wife Case 3 838000 - Christine Daughter Case 3 838001 - Lottie Daughter Case 3

Also, inserting member-record information before head-record information re- sults in suspended records. Table 4.6 shows an example where children-records come before the head-record. The daughter-record “Nellie Galbraith” shares the surname with the head-record “William Galbraith”. The daughter-record has a PID that is less than the PID of the head-record; hence, the daughter- record comes before the head-record. The solution for this case is to search for the “Galbraith” household in the neighboring pages using a sliding-window algorithm.

40 Table 4.6: Member-record before head-record example PID Given name Surname Relation HID 735357 Edward Tombs Head 989 735358 Mary Tombs Wife 989 735359 Eliza Tombs Daughter 989 735360 Alexander Tombs Son 989 735361 Albert Tombs Son 989 735362 Nellie Galbraith Daughter Case 3 735363 William Galbraith Head 990

3. Spelling errors: are common mistake that appear, especially in the transcription of historical documents. However, they are still fairly solvable. For instance, spelling errors in surnames will lead to different last names. In Table 4.7, “Jennie Murray” seems to be a daughter of “David Murry”, but the spelling error (extra “a”) in her name, makes the record suspended from the initial assignment.

Table 4.7: Spelling errors example

PID Given name Surname Relation HID 805206 David Murry Head 622 809827 Isabela Murry Wife 622 809828 Isabela Murry Daughter 622 809829 Margret Murry Daughter 622 809830 Peter Murry Son 622 809831 Jennie Murray Daughter Case 3

The solution is to ignore spelling errors by using a string comparison measure instead of looking for an exact match between surname strings. In the proposed household identification system, we employ the Jaro-Winkler string comparison measure [48]. The Jaro-Winkler similarity measure assigns a similarity score between two strings with a scale from 0 to 1. The score 0 equates to no similarity between the two strings and the score 1 shows an exact match [48]. All strings that have a similarity score above 0.85 are considered matches by our system.

41 4. Nonstandard values: using different values (i.e., head equivalents) to refer to the relation “head” results in suspended records. Table 4.8 show an example of this issue. According to the relation value, the record containing “Samuel Seifert” should be the father of the head “Robert Giord”, but they do not share the same surname. Based on all surnames of the household members and the relation values, it is reasonable to say that “Samuel” is a head of a new household “Seifert”. Table 4.8: Head equivalent example PID Given name Surname Relation HID 553690 Robert Giord Head 305 553691 Rachel Giord Wife 305 553692 Elise Giord Daughter 305 553693 Walter Giord - 305 553694 Walter Colbourne Lodger 305 553695 Samuel Seifert Father 305 553696 Emily Seifert Wife Case 3 553697 Percy Seifert Son Case 3

However, using the value “father” sometimes as a head equivalent and some- times as a father of a head prevents from using it permanently as a head equiv- alent. The only value we can use it as a head equivalent is “widowed wife”.

Case 4

If a record X has a non-sequential PID and shares the same surname with the head-record, it is considered a suspended record. We can resolve these cases because they share the same surname with the head of household. However, before we can assign them a HID, we have to be certain that there are no other records that share the surname with these records and have sequential PIDs in neighboring pages (current, previous, next). We do that by using a sliding-window algorithm to search for similar records.

42 Table 4.9: Suspended records of case 4 example PID Given name Surname Relation HID 743123 W Eliott Head 891 743124 Mariah Eliott Wife 891 751644 Foster Eliott Son Case 4 757850 Alexr Eliott Son Case 4

Case 5

A record X is a suspended record when it has a non-sequential PID, different surname, and a family relation value (e.g., son). The main reason for having records in this case is PID error. Having different PID ranges in consecutive pages causes different household records to mix together. If the last record of a page has PID higher than the PID of the first record in the next page, household-member records will be missed. Table 4.10 shows an example of this situation. Page 108 ends with record containing PID (889004), so page 109 should start with record containing PID (889005). However, for unknown reason, page 109 starts with record containing PID (333291), and the record with PID (889005) comes later in this page. As a result, the records in the household “Lafontine” split into two pages with many other recodes between them. As a result, the records with PIDs (889005, 889006, 889007) are suspended records. The solution for this case is to search for the “Lafontine” household in the previous page using a sliding-window algorithm. Table 4.10: Different PIDs range in consecutive pages example ID PID Given name Surname Relation HID 108 889003 Louis lafontine Head 721 108 889004 Sarah lafontine Wife 721 109 333291 Justine levielle Head 722 .. .. 28 other records ...... 109 889005 Philomina lafontine Daughter Case 5 109 889006 James lafontine Son Case 3 109 889007 Elear lafontine Son Case 3

43 District No: 56 District No: 56 District No: 56 Sub District No: 6 Sub District No: 6 Sub District No: 6

PID SN REL …. HID PID SN REL …. HID PID SN REL …. HID … … … … … 60 Smith Wife … 45 … … … … … … … … … … 61 Smith Son … 45 94 Rowe Head …. 50 … … … … … 62 Rowe Son … C3 95 Rowe Wife …. 50 96 Reed Head ….. 51 … … … … … … … … … …

Page: 587 Page: 588 Page: 589

Sliding Window

Figure 4.3: A sliding window search through census pages

4.1.3 Sliding Window Algorithm

To resolve the unassigned HIDs, we incorporate domain knowledge into our sys- tem and employ a search for any possible household for the suspended records in their nearby pages using a sliding-window algorithm. After assigning all HID labels, Algo- rithm 2 searches for any possible household for the suspended records in the search area. The search area consists of three neighboring pages in the same district and subdistrict: current page, previous page, or next page (Fig 4.3). The average number of records in each page is 50, so the size of search area is 150 records. This step is used to help make a decision about the suspended records; that is, if there is any household that shares household information with a suspended record in the search area, the suspended record will assume the HID of this household. The term “window” represents the search area which is three pages, and the term “sliding” refers to the motion of the search window. For example, the search window first covers pages 1, 2, and 3; then it covers pages 2, 3, and 4, and so on.

44 The sliding-window algorithm is a separate step because it requires known HIDs. For example, in page “588” in Fig 4.3, the record 62 is a suspended record of type case 3 (i.e., it has sequential PID, family relation, and different surname). The sliding- window will search for a possible household for record 62 in the page the previous page “587” first, the current page “588”, and then the next page “589”. The household number “50” in the page “589” is a potential household for record 62. If the HID of this household is unknown, we cannot identify the HID of record 62. As a result, we need to define all HIDs first; then we can apply the sliding-window search. The sliding-window algorithm handles only the suspended records of cases 3, 4, and 5 because the cases 1 and 2 are caused by missing values in the sorting attributes. As we explained before, the first step of our system is sorting census data with district, subdistrict, ID, then PID, and in our datasets, only district and subdistrict have missing values. The suspended records with missing district and subdistrict are misplaced records, and finding the households of these records requires a different search step. For the suspended records of cases 3 and 5, the sliding-window algorithm searches for a household that shares a surname (Lines 7-11), and for the suspended records of case 4, it searches for a household that shares a surname and also have a sequential PID (Lines 13-21).

4.2 Complexity

Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm. “big-O” notation is used to express an algo- rithm runtime complexity. Our system consists of three steps: sorting, comparing, and sliding-widow search. For the sorting step, many possible functions with different complexity time can be used. In our implementation, we used the “sort” function in Python, and its

45 Algorithm 2 Sliding-window Search Step Require: Sorted CensusRecords[N,M] with HID 1: i ← 1 2: R ← CensusRecords[, i] 3: LastHID ← R[HID, i] 4: while i < M do 5: i = i + 1 6: if R[HID, i] == Case3 OR R[HID, i] == Case5 then 7: for all records r in the last, current, next page do 8: if R[surname, i] == r[surname] then 9: R[HID, i] ← r[HID] 10: end if 11: end for 12: else 13: if R[HID] == Case4 then 14: for all records r in the last, current, next page do 15: if R[surname, i] == r[surname] and R[P ID, i] − r[PID] == 1 then 16: R[HID, i] ← r[HID] 17: else 18: R[HID, i] ← lastHID 19: end if 20: end for 21: else 22: LastHID ← R[HID, i] 23: end if 24: end if 25: end while

46 complexity is O(n log n), where n is the number of records to be sorted. During the comparison step, each record in census data is scanned and compared, so its complexity is O(n), where n is the number of record in census data. The sliding- window algorithm deals with the suspended records (in the worst case all census records are suspended records). For each suspended record, it evaluates all records in three pages. Hence, the worst case complexity of the sliding-window algorithm is O(nm), where m = 3×Maximum page size in the census.

4.3 Evaluation and Results

In this section, we discuss the results of applying the proposed household iden- tification system on the 1891 and 1901 Canadian censuses. Table 4.11 shows the percentage of the suspended records in the five cases before and after applying the sliding-window step. The HID identification system will exclude the remaining sus- pended records. This step is effective in decreasing the number of excluded records that belong to the five cases from almost 10% and 2% in the 1891 and the 1901 respectively to less than 1% in both censuses (Table 4.11). Table 4.11: The percentage of the suspended records in the censuses before and after applying sliding-window search

1891 before 1891 after 1901 after 1901 after Case 1 0.1% 7045 0.0% 0 0.002% 149 0.002% 149 Case 2 0.0% 0 0.0% 0 0.2% 11,109 0.2% 11,109 Case 3 4.6% 223,707 0.8% 41,801 1.6% 88,132 0.6% 32,062 Case 4 4.2% 209,530 0.0% 0 0.06% 3,269 0.0% 0 Case 5 1.0% 50,876 0.1% 4,847 0.02% 1,562 0.005% 268 Total 10% 491,158 0.9% 46,648 1.9% 104,221 0.8% 43,588

In general, there is no precise way to evaluate the generated households except through manual evaluation. However, given the total number of records present in both the 1891 and 1901 censuses, this is not possible. Therefore, a small sample of

47 the HIDs produced by the proposed system were presented for manual inspection. In addition, a comparison is made between the automatically produced households of the transcribed copies of the 1891 and the 1901 censuses and the manually produced households of the transcribed copies of the 1871 and the 1881 Canadian censuses [4], both in terms of total households and household sizes. Finally, we compared the manual and automatic HID identification of the four censuses to the aggregate household information in the Statistics Canada website 1. Table 4.12 shows the households information in the transcribed censuses which resulted from the proposed identification system for the 1891 and the 1901 censuses, the manual households identification of the 1871 and 1881 censuses, along with the household information from the Statistics Canada website. Because of the unsolved data issues, 0.9% and 0.8% of the 1891 and 1901 census records respectively were excluded from the proposed household identification system, so the information in Table 4.12 (i.e., average household size) is for 99% of the census records and not for the whole census. All of the records that refer to the same household can be detected by grouping records with the same automated generated HID. The average household size is the average number of people per household. While the average household sizes for the 1871, 1881, and 1891 censuses are the same in the transcribed census data and the information that provided by the Statistics Canada Website, the average household size of the 1901 is 4.8 in the transcribed copy and 5 in the Statistics Canada Website. The majority of households have a size that ranges from 1 to 15 (Fig 4.5)), but in some cases a household can have hundreds of members because it represents hotel, prison, orphanage, or factory. According to the average of household size and the size distribution of the ma- jority of households (Fig 4.5) of the four censuses, there is a trend toward smaller households with the passage of time, which corresponds to the information stated in

1Statistics Canada website, URL: http://www.statcan.gc.ca/pub/11-630-x/11-630-x2015008-eng.htm

48 Table 4.12: Summary of household information

Transcribed census data 1871 1881 1891 1901 Manual Automatic Data set size 3,466,427 4,277,807 4,787,244 5,343,566 Included records -- 99.1% 99.2% Total of households 609,300 801,052 881,923 1,096,878 Size range 1 : 761 1 : 625 1 : 1260 1 : 3190 Size average 5.671 5.340 5.375 4.831 Statistics Canada 1871 1881 1891 1901 Census data size 3,485,761 4,278,327 4,833,239 5,371,315 Total of households 622,719 800,410 900,080 1,058,564 Size average 5.6 5.3 5.3 5.0

Statistics Canada website “household size in Canada has declined over time”. Fig- ure 4.5 (from Statistics Canada Website) shows the decrease in the average size of households in Canadian censuses. Also, if we compare the distributions of the manual HIDs for the 1871 and 1881 censuses with the distributions of the automatic HIDs for the 1891 and 1901 censuses as shown in Fig 4.4, we can see similar distribution of the household sizes but the curve of the 1901 census is more skewed to the left toward smaller household sizes. For example, households of size two and size seven both form around 10% of the households in the 1871 census. However, in the 1901 census, households of size two form 14% of all households while households of size seven form 8% of all households.

49 Figure 4.4: The distribution of household size of the manual HID (1871, 1881) and the automatic HID (1891, 1901)

Figure 4.5: Number of households and average number of people per house- hold, Canada, 1851 to 2011. Reprinted from The shift to smaller house- holds over the past century, by Statistics Canada, April 8 2017, retrieved from http://www.statcan.gc.ca/pub/11-630-x/11-630-x2015008-eng.htm

50 Chapter 5

Linking Canadian Historical Census Collocations

Having assigned HIDs to the records in the 1891 and 1901 censuses, we now discuss the process of linking the 1871, 1881, 1891, and 1901 censuses. Section 5.1 describes the methodology used to linking the four censuses based on HIDs groups. Section 5.2 discusses the evaluation process employed, along with the results produced by the linkage system. A comparison of seven set similarity measures appears in Section 5.3. Finally, a study of the bias of the linked data is presented in Section 5.4.

Figure 5.1: The process of generating longitudinal data over 3 decades

51 5.1 Method Overview

This section explains the process used to link the four historical Canadian cen- suses: 1871, 1881, 1891, and 1901. Recall that the 1871 and 1881 were successfully linked in [4] using a pairwise linkage (PiM) system and a group-linkage (disambigua- tion) system [42]. We now explore these works by linking the 1881 to 1891 and 1891 to 1901 censuses using the same linkage systems: the PiM system and the disam- biguation system. Figure 5.1 shows an overview of the process.

5.1.1 The PiM System

The People-in-Motion (i.e., PiM) system [4] is a pairwise record linkage system. This system was designed by a group of researchers at the University of Guelph [6] to link the approximately 3.5 million records present in the 1871 Canadian census to the approximately 4.3 millions records present in the 1881 Canadian census. Figure 5.2: The process of the PiM system

Figure 5.2 shows an overview of the PiM system. (The process of pairwise link- age is defined in Chapter 2). In the first step (blocking), three different attributes are used as blocking keys (i.e., first name, the first letter of the last name, and the birthplace). Then, in the comparison step, six attributes (i.e., surname, given name, age, gender, birthplace and marriage status) of record pairs are compared to gener- ate feature vectors. While exact comparison was performed on the birthplace and

52 gender, an approximate comparison was performed on the remaining attributes. In the classification stage, the PiM system uses a supervised classifier (Support Vector Machine (SVM) ) trained using labeled data (i.e., matching and non-matching record pairs) constructed from the 1871 and 1881 Canadian census datasets. The SVM clas- sifier labels all record pairs as matches or non-matches based on their feature vectors. Only the single links are considered as matches. (link types are defined in Chapter 2 Section 2.2.5). The multiple links are considered as ambiguous links, and they are discarded by the PiM system. The PiM system obtained a linkage rate of 15.3% and a false-positive rate of 4.9%. (The linkage rate and the false positive rate are defined in Chapter 2 Section 2.2.5). As stated in [4] that maintaining the false positive rate below 5% is required to have high-quality links. The PiM team has linked the censuses: 1881 to 1891 and 1891 to 1901 using the PiM system. They provided us with the generated single and multiple links to use them in the disambiguation system.

5.1.2 The Disambiguation System

The disambiguation system [42] is a group-record linkage system that uses house- hold information to disambiguate the multiple links and increase the linkage rate of the PiM system. As stated in [42], approximately 92% of the links that generated by the PiM system were not evaluated, but removed, because they were multiple links. The disambiguation system employs group linkage to compares households with the aim of finding more single links. This is done by favoring links of the household that has more members in common. This way the disambiguation system seeks to increase the linkage rate. As shown in Fig 5.3, the disambiguation system takes the single and multiple links of the PiM system as input, and then disambiguates the multiple links by performing a two-step process. The first step is to calculate scores for the multiple links by comparing the households that they belong to. Census records are grouped

53 Figure 5.3: The process of the disambiguation system

by household identifiers. Then, each multiple link (a, b) gets a score based on the

similarity of the two households (Ha,Hb). The Jaccard measure [32], given in equation 5.1, is used to compare households.

|Ha ∩ Hb| |Ha ∩ Hb| Jac sim(Ha,Hb) = = (5.1) |Ha ∪ Hb| |Ha| + |Hb| − |Ha ∩ Hb| The second step is to generate single links from the highest scored multi-link groups. A visual example of this step is shown in Fig 5.4. This step begins by examining the one to many multi-link groups, which are (A-E, A-F, A-G) and (C-G, C-H) in Fig 5.4-(1). The one to many multi-links with the highest scores are kept, and the rest of the links are removed (Fig 5.4-(2)). Then, the many to one multiple link groups ((A-G, B-G, C-G) and (C-H, D-H) in Fig 5.4-(3)) are examined from the remaining links by using the same process. In the end, we get a set of links that have the highest scores (see (4) in Fig 5.4). After disambiguating the multi-link groups, the remaining links are examined to see if they are single links. Using a pairwise linkage (the PiM system) first to link the 1871 and 1881 censuses and then a group linkage (the disambiguation system), prevents links from being biased toward the households. The matching decision is made based on pairwise

54 Figure 5.4: Example of disambiguating multiple groups. (1) Starting one to many and many to one link groups (2) Disambiguated one to many link groups (3) Dis- ambiguated many to one link groups (4) Final set of single links, by Richards, 2014, retrieved from [42]

55 linkage first. Then, by using extra information (i.e., households), the remaining links are examined to find more matches. This process allows links based on the similarity of their information and links based on the similarity of their households. For example, a single link (a, b) was seen as a match by the SVM classifier. However, there is no other common links between the household of record a and the household of record b. Based on the household information, the link (a, b) may be seen as a non- match. If the matching decision on this link is made after comparing the households, then the matched link (a, b) may be lost. Further, the links will be matched in favor of households that have more members in common, and the generated links will be biased toward their households. Consequently, keeping the group-linkage step separated from the pairwise linkage step can avoid the household bias. The disambiguation system increased the linkage rate of linking 1871 to 1881 Canadian census from 15.3% to 31.8%, while it reduced the false-positive rate by 0.4%. It is an efficient linkage system, and to confirm its efficiency, we will use it to link the 1881 to 1891 and the 1891 to 1901. The household identifiers of the 1871 and the 1881 were generated manually by the PiM group, and the household identifiers of the 1891 and the 1901 were generated automatically by the HID identification method in Chapter 4.

5.1.3 Links Integration

After linking the four censuses, we have separated links for 1871-1881, 1881-1891, and 1891-1901. To find the records that refer to the same individual across the four censuses, we use the transitive relation. Transitivity requires that if (a, b) and (b, c) are present in the relation, then so is (a, c). For example, in Figure 5.5, because we have the link (a, b) between the 1871 and 1881, the link (b, c) between 1881 and 1891, and the link (c, d) between 1891 and 1901, then there is the link (a, d). Hence, we know that a, b, c and d are records that refer to the same person.

56 Figure 5.5: Same person across 4 censuses

5.2 Evaluation and Results

Tables 5.1 and 5.2 show the results of linking the 1871, 1881, 1891, and 1901 Canadian censuses using two systems, the PiM system and the disambiguation sys- tem. While the linkage rates of the PiM system are very close (15.8%, 14.8%, and 14.8%), the difference between the linkage rates of the disambiguation system is higher (31.8%, 28.2%, and 26.4%). However, based on the percentage of the single links are generated by the disambiguation system from the total multiple links, the perfor- mance of the disambiguation system is almost the same. The single links from the disambiguation system form 3.7%, 3.5%, 3.5% of the multi-links for 1871-1881, 1881- 1891, and 1891-1901 respectively.

Table 5.1: Linking census results: The PiM system

PiM multi-links PiM single links LR 1871-1881 14,831,427 550,726 15.88% 1881-1891 16,422,428 635,161 14.84% 1891-1901 15,834,832 712,318 14.87%

57 Figure 5.6: The generated longitudinal data

Table 5.2: Linking census results: The disambiguation system

Disambiguated single links Total single links Final LR 1871-1881 552,987 1,103,713 31.84% 1881-1891 574,704 1,209,865 28.27% 1891-1901 553,680 1,265,998 26.43%

Overall, as shown in Fig 5.6, we generated longitudinal data over three decades (1871-1881, 1881-1891, 1891-1901) for 159,872 individuals. Also, we created 406,702 links over two decades (1871-1881, 1881-1891) and 455,184 links over two decades (1881-1891, 1891-1901).

5.3 The Disambiguation System with Different Group Simi-

larity Techniques

As mentioned in the last section, the disambiguation system is a group-record linkage method that use households as groups to disambiguate the multi-links. In the first step of this system, it calculates scores for the multiple links based on households

58 similarity. The disambiguation system relies on the use of a suitable group similarly measure, therefore, in this thesis, we explore the performance of various measures. In this section, we investigate the effect of using diverse methods (i.e., five set similarity measures, bipartite matching technique, and Winkler bipartite matching technique) to calculate the scores of the multiple links in the disambiguation system.

5.3.1 Set Similarity Measures

A set similarity measure tests the similarity between two finite sets, and assigns scores between zero (dissimilar sets) and one (exact sets). In our research, the sets are households, and set items are household members. The Jaccard measure [32] is a common group or set similarity measure in historical group linkage systems [21] [42]. The households can be presented in Venn diagram, such as Figure 5.7 where:

• Ha and Hb are two compared households.

• The letter “a” refers the number of Ha members that are not linked to Hb.

• The letter “b” refers the number of Hb members that are not linked to Hc.

• The letter “c” refers to the number of Ha and Hb members that are linked to each other (a single links or multiple links), which is the intersection between

the two households (Ha ∩ Hb).

59 Figure 5.7: Venn diagram of 2 households

Table 5.3: Set similarity measures

Measure Equation Jaccard Measure |c| (|Ha|+|Hb|−|c|) Dice Measure 2|c| (|Ha|+|Hb|) Overlap Measure |c| min(|Ha|,|Hb|) Sbraun Banquet Measure |c| max(|Ha|,|Hb|) |c| + |c| |Ha| |Hb| Kulczynski Similarity Measure 2

Table 5.3 shows the set similarity methods [10] that we investigate along with their definitions. Each measure defines the similarity in different way. For example, the households in Figure 5.8 look the same if we assume that “Jack” in Ha and

“Jackson” in Hb are the same person, and “Lily” who is 10 years old and “Adam” who is 6 years old are two children that are added to the household during the ten years between the censuses. To calculate the similarity score for the 1871 household

Ha and the 1881 household Hb, we calculate,

• The size of first household is |Ha| = 3

60 Figure 5.8: Two linked households example

• The size of second household is |Hb| = 5

• The number of links between the two households is c = 3

Using Jaccard measure, Ha and Hb have a similarity score of 0.6.

3 Jac(Ha,Hb) = (3+5−3) = 0.6

Knowing that two households have common members is important when deciding whether they are the same. While the Jaccard measure gives equal weight for match- ing and non matching members, Dice assigns more weight to the matching members between any two household by doubling the number of links. Using the Dice measure, the similarity score of Ha and Hb is 0.75.

2×3 dice(Ha,Hb) = (3+5) = 0.75

Because changes in the households over time (by birth or death) is likely to occur, the Overlap measure disregards this change by dividing the number of links between the households by the minimum household size between Ha and Hb. As a result, it gives 1 as the similarity score between households.

3 overlap(H1,H2) = 3 = 1

Ignoring the change between households is not always a correct action. For instance, if the age of “Lily” in the 1881 household in the example in Figure 5.5 is

61 23, and the age of “Adam” is 21, can we still consider these two households as the same? In this case, assigning 1 as a similarity score may be not right. By dividing the number of links by the maximum size between the households, Sbraun Banquet

measure assigns 0.6 for Ha and Hb .

3 sbraun(H1,H2) = 5 = 0.6

The Kulczynski similarity measure assigns similarity score between two house- holds based on the average of the percentage of linked members to the household sizes.

3 3 3 + 5 kul(H1,H2) = 2 = 0.8

Overall, each measure has a different emphasis when calculating the similarity of households. Although it is not practical to introduce a “best” household similarity measure in general, a comparison study could shed light on the performance of these measures for the system we are proposing.

5.3.2 Bipartite Matching Technique (BM)

The BM measure [39] (as defined in Equ 5.2) is the normalized weight of the maximum matching weight in the bipartite graph (the nodes are group members, the edges are the links between them). Each edge between two nodes, has the similarity

score of these nodes sim(r1a, r2b).

P sim(r , r ) (r1a,r2b) 1a 2b BM sim(Ga,Gb) = (5.2) |Ga ∪ Gb| We can use this measure to compare households in the disambiguation system. Figure 5.9 shows an example of households bipartite graph, where the nodes are household members and the edges are the (single or multiple) links between them. These links (edges) have probability scores from the SVM classifier. The scores rep- resent the confidence of the classifier that the link is a match or non match, so higher

62 Figure 5.9: Bipartite graph of households links score means more confidence. The BM measure for household matching is defined in Equ 5.3.

P P rop(r , r ) (Ha∩Hb) 1a 2b BM sim(Ha,Hb) = (5.3) |Ha ∪ Hb| Depending only on the size of households and the number of common members to define matched households (as in the set similarity measures in the previous sub- section) can be inefficient. Hence, the BM measure includes the probability scores of household members to help in the matching decision. For example, in Fig 5.7, Ha and

Hc from the 1871 census, and Hb from the 1881 census. Using the Jaccard measure, the household-pairs (Ha,Hb) and (Hc,Hb) have same similarity score which is 0.6. On the other hand, the BM gives different scores depend on the probability scores of the links. The BM score for (Ha,Hb) is 0.39, and the BM score for (Hc,Hb) is 0.54.

63 5.3.3 Winkler Bipartite Matching Technique (WBM)

The link type (single or multiple) is another piece of information that can be used to find if two households match or not. The WBM measure is a new group similarity measure, which is a combination of the bipartite matching technique and Jaro Winkler (JW) measure. The Jaro Winkler measure (defined in[48]) is a string similarity measure that gives more favorable ratings to strings that match from the beginning, based on the length l of the prefix set. In our case, we can use similar con- cept while comparing households. We can give more favorable ratings to households that share more single links. Equation 5.4 shows the WBM measure that uses BM scores and the number of single links (l).

WBM(Ha,Hb) = BM(Ha,Hb) + (l ∗ 0.1(1 − BM(Ha,Hb))) (5.4)

The similarity scores for (Ha,Hb) and (Hc,Hb) in the previous example (in Figure 5.9) will be:

WBM(Ha,Hb) = 0.39 + (0 ∗ 0.1(1 − 0.39)) = 0.39

WBM(Hc,Hb) = 0.54 + (2 ∗ 0.1(1 − 0.54)) = 0.63

The choice of similarity measure is likely to have major impact on the disam- biguation system results. Studying the effect of each measure is an essential step in developing this system.

5.3.4 Applying Thresholds

The disambiguation system uses only the multi-links with scores that are higher than a specific threshold. The authors of this system [42] argued that the multiple links with the highest scores are not always matches. The births, deaths, immigration, and emigration result in new individuals in the second census that do not appear in

64 the first census and some individuals in the first census that do not appear in the second census. This could result in linking records that have similar attributes, but they do not refer to the same person. In order to cut off similar links and to make sure that the chosen links are true matches, the system keeps only the multiple links with the highest scores that exceed a certain threshold. They explored the performance of various threshold values and chose 0.25 as the best threshold. In our investigation about the performance of the disambiguation system with different group similarity techniques, we used the same threshold which is 0.25. In our study, for comparison purpose, we focus on the performance of different measures rather than applying different thresholds.

5.3.5 Evaluation and Results

Evaluating a record-linkage system requires testing data, which is matching record-pairs that have been identified by human experts as being the same people between two data sets. In this thesis, the only testing data we have is a list of 11,716 matches between the 1871 and 1881 censuses. Hence, we evaluate the disambiguation system with the group similarity measures on linking the 1871 and 1881 censuses only. However, we assume that if there is any improvement over the 1871 and 1881 linkage, it may be an improvement on other censuses linkage also. In the evaluation step, we calculate the linkage rate, the false positive rate, and the true positive rate. Table 5.4 shows the results of disambiguating the links between the 1871 and 1881 censuses using different group similarity measures. While the highest linkage rate is obtained by the Dice Similarity Measure, the lowest false positive rate is achieved by the Jaccard and the WBM measures. In practice, achieving a false positive rate below 5% is important to generate high quality links, and all the methods accomplish that. Because there is no measure that achieved both the highest linkage and lowest false positive, we cannot choose a preferable measure for the disambiguation system. However, if research requires as many links as possible regardless of the false positive

65 rate, then the Dice measure is the best choice. Further, if links quality matters, the measure with the lowest false positive rate (i.e., WBM) is the best choice.

Table 5.4: Evaluation results of the disambiguation system versions

Measure TPR FPR LR Jaccard Coefficient 80.72% 2.08% 31.82% Dice Coefficient 83.72% 3.50% 35.62% Overlap Coefficient 78.81% 3.61% 33.60% Sbraun Banquet Measure 82.50% 2.95% 33.67% Kulczynski Similarity Measure 82.97% 3.75% 35.90% BM 83.05% 2.28% 33.34% WBM 82.39% 2.05% 32.52%

5.4 The Bias of the Linkage Process

There are two primary goals when we create linked data (longitudinal data); this data should be accurate and representative. The linked data is accurate if it has low False Positive Rate, and the linked data is representative if it accurately reflects the entire population. The individuals who are in the longitudinal data should be representative of the population in the original census, and any bias in the longitu- dinal data can affect this representative. For example, if half of the population is women, then 50% of the longitudinal data should refer to women. In this section, we investigate if the generated longitudinal data is biased or not, and on what group of individuals the bias is. First of all, to investigate the bias, we compare records from the censuses datasets (1871, 1881, 1891, and 1901), the longitudinal data from the pairwise (individual) linkage, and the longitudinal data from the group (household) linkage which is the disambiguation system with the Jaccard measure. We study the bias over the follow- ing attributes: gender, age groups, marital status, birth place, origin, and religion.

66 For the gender attribute, both linkage methods (the pairwise and the group link- age) over represent males and under represent females as shown in Fig 5.10; however, the group linkage has slightly lower gender bias than the pairwise linkage. For the age groups: children (0-14), young (15-24), middle (25-49), older (50+), the group linkage under represents the young females and over represents children group for males and females (Figures 5.11, 5.12). One explanation of the bias against young and single females is because the young women who are single and more likely to get married could change their surnames over the next 10 years. As a result, it is impossible to link them using the current record linkage techniques. In general, a person who remained with the same group (e.g., married people) between census years would be over represented in the linked data. The pairwise linkage over represents married people and under represents single people (Figures 5.13, 5.14, 5.15) more than the group linkage. Figures 5.16 and 5.17 show the distribution of birthplaces and origins respec- tively. The pairwise and group linkage under represent people born in Quebec and Ireland, and they over represent people born in Nova Scotia. While the group link- age over represents the people born in Ontario, the pairwise linkage under represents them. who were born in England or they have English origin are over represented by the pairwise linkage, while French origins were under represented by both linkage methods. Further, as shown in Figure 5.18, Catholics are under represented, and Protestants are over represented. Based on these results, we can say that both methods have a bias, and we cannot prefer one method over another. However, by providing researchers with the results and the bias of both linkage methods, they can choose their preferred linkage method based on their research needs. Secondly, we compare the bias of the links generated by different versions of the disambiguation system between the 1871 and the 1881 censuses over the following attributes: gender, age groups, marital status, birth place, origin, and religion. The

67 Figure 5.10: Bias of the PiM and disambiguation systems - Gender

Figure 5.11: Bias of the PiM and disambiguation systems - Age by Female

68 Figure 5.12: Bias of the PiM and disambiguation systems - Age by Male

Figure 5.13: Bias of the PiM and disambiguation systems - Marital status

69 Figure 5.14: Bias of the PiM and disambiguation systems - Males Marital status

Figure 5.15: Bias of the PiM and disambiguation systems - Female Marital status

70 Figure 5.16: Bias of the PiM and disambiguation systems - Birthplace

Figure 5.17: Bias of the PiM and disambiguation systems - Origin

71 Figure 5.18: Bias of the PiM and disambiguation systems - Religion different versions of the disambiguation system have the same bias on the linked data through all discussed attributes as shown in the Figures 5.19 to 5.27. For example, the disambiguation system under represents females and over represents males in all its versions (Fig 5.17). In the census data, the females form around 49% while in the linked data by the disambiguation system with Jaccard, Sbraun, Kulczynski, Overlap, and Dice measures, the females form 45.78%, 45.82%, 46.12%, 46.57%, 45.90% respectively. However, the Overlap measure represents slightly more females and less males than the other methods (Fig 5.19), and the increase in females is in the children age group (Fig 5.20). Overall, the method of linkage (pairwise or group) has the major effect on the bias of linked data.

72 Figure 5.19: Bias of the versions of the disambiguation system - Gender

Figure 5.20: Bias of the versions of the disambiguation system - Age by Females

73 Figure 5.21: Bias of the versions of the disambiguation system - Age by Males

Figure 5.22: Bias of the versions of the disambiguation system - Marital Status

74 Figure 5.23: Bias of the versions of the disambiguation system - Marital Status by Females

Figure 5.24: Bias of the versions of the disambiguation system - Marital Status by Males

75 Figure 5.25: Bias of the versions of the disambiguation system - Birthplace

Figure 5.26: Bias of the versions of the disambiguation system - Origin

76 Figure 5.27: Bias of the versions of the disambiguation system - Religion

77 Chapter 6

Conclusions and Future Work

In this thesis, we linked four historical Canadian censuses: 1871, 1881, 1891, and 1901 to generate longitudinal data that establishes the history of individuals across the 30-year span of the censuses. To link the four censuses, we used a two-step linkage system. First, the pairwise linkage (the PiM system) generated the single and the multiple links between each two censuses. Then, the group linkage (disambiguation system) disambiguated the multiple links using the household information to improve the linkage rate of the PiM system. The following datasets have been constructed:

• The 1871 and the 1881 censuses (1,103,713 links)

• The 1881 and the 1891 censuses (1,209,805 links)

• The 1891 and the 1901 censuses (1,265,998 links)

• The 1871, the 1881, and the 1891 censuses (406,702 links)

• The 1871, the 1881, and the 1891 censuses (455,184 links)

• All four censuses (159,872 links)

To overcome the lack of HIDs in the 1891 and the 1901 censuses, we have in- troduced an automatic household identification system. The household identifica- tion system is implemented in two steps (first-pass HID assignment, and resolve the unassigned cases with the sliding-window algorithm). The experimental results show that the proposed system can perform efficiently without requiring a (possibly time consuming) further data cleaning step. We could accurately detect households by

78 excluding only less than 1% of the 1891 and 1901 census records. Thus, when his- torical censuses lack HIDs, it may be faster and cost effective to adapt our HIDs identification strategy rather than the manual process. In addition, we explored the performance of seven various set similarity mea- sures (Jaccard, Dice, Overlap, Sbraun Banquet, Kulczynski, Bipartite Matching, and Winkler Bipartite Matching similarity measures) in the disambiguation system. The experimental results on the 1871 and 1881 test data show that all of the methods produced false positive rates that were lower than 5% which is the threshold of ac- ceptable rate. While Kulczynski measure achieved the highest linkage rate, the WBM has the lowest false positive rate. The bias of the generated longitudinal data was investigated over six attributes (i.e., gender, age, marital status, birthplace, origin, and religion). We summarize that both linkage methods (pairwise and group linkage) have a bias, and the group similarity measure does not have a major effect on the bias of the disambiguation system. In this thesis, we show that a longitudinal dataset can be constructed by linking individuals and households in four historical Canadian censuses: 1871, 1881, 1891, and 1901, and identifying the households automatically allows us to use a group-record linkage technique which increases the linkage rate.

6.1 Future Work

Although this system is capable of producing households of the 1891 and 1901 censuses, there are still areas of possible improvement which can be applied to various steps such as:

• A prior data cleaning step, such as normalizing the “Relation” attribute by translating it into a standardized list, could result in a higher quality households with no need to exclude any census-record.

79 • The limitation of the attribute number in the transcribed copy of the used censuses made extracting household information a complicated task. Adding any extra attributes that have information about households (e.g. address) could be helpful in producing better households.

• In terms of the string comparison method chosen, an investigation into the performance of different methods could potentially be helpful to avoid more spelling errors.

In addition, there is another area for further improvement, which is to improve the four censuses linking process. For example:

• Even though we used efficient linkage techniques, the lack of testing data for evaluating the links of the 1881-1891 and the 1891-1901 prevented us from confirming the quality of these links. Labeling testing data by experts manually could confirm the quality of the generated links.

• Although over 80% of the multi-links have a SVM score of 0.96, the WBM similarity measure achieved higher linkage rate and lower false positive rate than the Jaccard measure. Hence using the similarity scores from the comparison step in the PiM system instead of using the probability scores from the classification step (SVM) may affect the linkage rate and the false positive rate.

• One way to expand the comparison of the disambiguation system with different group similarity measures would be comparing these measures using different thresholds.

The longitudinal data created through this thesis will facilitate the study of the changes in Canadian society, history and economy over a thirty-year period. There are many possible research topics such as migration, social mobility, labor market adjustment or even individual health. In addition, these data could be harmonized

80 with corresponding historical census data from different countries (e.g., the United States) to build expanded longitudinal data.

81 Bibliography

[1] Minnesota Population Center: North Atlantic Population Project, 2016. [2] L. Antonie, P. Baskerville, K. Inwood, and J. A. Ross. Change amid continuity in Canadian work patterns during the 1870s. Lives in Transition: Longitudinal Perspectives from Historical Sources, 2014. [3] L. Antonie, G. Grewal, K. Inwood, and S. Zarti. Automatic household iden- tification for historical census data. 30th Canadian Conference on Artificial Intelligence, 2017. [4] L. Antonie, K. Inwood, D. J. Lizotte, and J. A. Ross. Tracking people over time in 19th century Canada for longitudinal analysis. Machine Learning, 95(1):129– 146, 2014. [5] L. Antonie, K. Inwood, and J. A. Ross. A first look at longitudinal data from the Canadian censuses of 1871 and 1881. [6] L. Antonie, K. Inwood, and J. A. Ross. People in Motion longitudinal data from historical sources, 2012. [7] I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. [8] J. W. Buehler, K. Prager, and C. J. Hogue. The role of linked birth and infant death certificates in maternal and child health epidemiology in the United States. American Journal of Preventive Medicine, 19(1):3–11, 2000. [9] M.-S. Chen, J. Han, and P. S. Yu. Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6):866– 883, 1996. [10] S.-S. Choi, S.-H. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43– 48, 2010. [11] P. Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012. [12] P. Christen. A survey of indexing techniques for scalable record linkage and dedu- plication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537– 1555, 2012.

82 [13] P. Christen, M. B. ANU, and V. S. Verykios. Advanced record linkage methods and privacy aspects for population reconstruction. Population Reconstruction, 2014. [14] L. Y. Dillon. Integrating nineteenth-century Canadian and American census data sets. Computers and the Humanities, 30(5):381–392, 1996. [15] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pages 85–96. ACM, 2005. [16] H. L. Dunn. Record linkage. American Journal of Public Health and the Nations Health, 36(12):1412–1416, 1946. [17] J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57, 1973. [18] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detec- tion: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. [19] D. P. Farrington. Longitudinal research strategies: Advantages, problems, and prospects. Journal of the American Academy of Child & Adolescent Psychiatry, 30(3):369–374, 1991. [20] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. [21] Z. Fu, H. Boot, P. Christen, and J. Zhou. Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 8(2):204–225, 2014. [22] Z. Fu, P. Christen, and J. Zhou. A graph matching method for historical census household linkage. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 485–496. Springer, 2014. [23] Z. Fu, J. Zhou, P. Christen, and M. Boot. Multiple instance learning for group record linkage. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 171–182. Springer, 2012. [24] E. Fure. Interactive record linkage: The cumulative construction of life courses. Demographic Research, 3, 2000. [25] E. Glasson, N. de Klerk, A. Bass, D. Rosman, L. Palmer, and C. Holman. Co- hort profile: the Western Australian family connections genealogical project. International Journal of Epidemiology, 37(1):30–35, 2008. [26] R. Goeken, L. Huynh, T. Lynch, and R. Vick. New methods of census record linking. Historical Methods, 44(1):7–14, 2011.

83 [27] P. Grainger. The census: One hundred years ago. Perspectives on Labour and Income, 3(2):277, 1991. [28] J. Han, J. Pei, and M. Kamber. Data mining: concepts and techniques. Elsevier, 2011. [29] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering al- gorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. [30] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al. A practical guide to support vector classification. 2003. [31] K. Inwood and R. Reid. Introduction: The use of census manuscript data for historical research. Histoire Sociale/Social History, 28(56), 1995. [32] P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(11):37–50, 1912. [33] M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414–420, 1989. [34] C. W. Kelman, A. J. Bass, and C. Holman. Research use of linked health dataa best practice protocol. Australian and New Zealand Journal of Public Health, 26(3):251–255, 2002. [35] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton. Big data. The Management Revolution. Harvard Bus Rev, 90(10):61–67, 2012. [36] M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 440. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006. [37] H. B. Newcombe, J. M. Kennedy, S. Axford, and A. P. James. Automatic linkage of vital records. Science, 130(3381):954–959, 1959. [38] B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped- entity resolution using quasi-cliques. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 1008–1015. IEEE, 2006. [39] B.-W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In 2007 IEEE 23rd International Conference on Data Engineering, pages 496–505. IEEE, 2007. [40] C. Parent and S. Spaccapietra. 9 database integration: The key to data interop- erability. Advances in Object-Oriented Data Modeling, page 221. [41] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

84 [42] L. Richards. Disambiguating multiple links in historical record linkage. Master’s thesis, 2013. [43] L. Richards, L. Antonie, S. Areibi, G. Grewal, K. Inwood, and J. A. Ross. Com- paring classifiers in historical census linkage. In 2014 IEEE International Con- ference on Data Mining Workshop, pages 1086–1094. IEEE, 2014. [44] S. Ruggles. Linking historical censuses: A new approach. History and Computing, 14(1-2):213–224, 2002. [45] D. F. Specht. Probabilistic neural networks. Neural Networks, 3(1):109–118, 1990. [46] A. D. Stark, D. T. Janerich, S. K. Jereb, and M. Hoff. An example of record link- age methods to monitor mortality and cancer incidence. Public Health Reports, 98(3):277, 1983. [47] R. Vick and L. Huynh. The effects of standardizing names for record linkage: Evidence from the United States and Norway. Historical Methods, 44(1):15–24, 2011. [48] W. E. Winkler. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990. [49] W. E. Winkler. Overview of record linkage and current research directions. In Bureau of the Census. Citeseer, 2006. [50] M. J. Wisselgren, S. Edvinsson, M. Berggren, and M. Larsson. Testing meth- ods of record linkage on Swedish censuses. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 47(3):138–151, 2014.

85