Disambiguating Multiple Links in Historical Record Linkage

by

Laura Richards

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Master of Science in Computer Science

Guelph, , c Laura Richards, August, 2013 ABSTRACT

Disambiguating Multiple Links in Historical Record Linkage

Advisors: Laura Richards Dr. L. Antonie University of Guelph, 2013 Dr. G. Gr´ewal Dr. S. Areibi

Historians and social scientists are very interested in longitudinal data created from historical sources as the longitudinal data creates opportunities for studying people’s lives over time. However, its generation is a challenging problem since historical sources do not have personal identifiers. At the University of Guelph, the People-in-Motion group have currently constructed a record linkage system to link the 1871 Canadian census to the 1881 Canadian census. In this thesis, we discuss one aspect of linking historical census data, the problem of disambiguating multiple links that are created at the linkage step. We show that the disambiguating techniques explored in this thesis improve upon the linkage rate of the People-in-Motion’s system, while maintaining a false positive rate no greater than 5%. Acknowledgements

First, I would like to thank my advisors, Dr. L. Antonie, Dr. G. Gr´ewal and Dr. S. Areibi, for their assistance throughout my research. Without their knowledge and guidance, I would not have been able to learn as much as I have through my studies. I am also thankful to Dr. F. Song, for his timely feedback and suggestions towards my thesis and to Dr. K. Inwood, for providing thoughtful insights into the record-linkage problem.

Also, I would like to thank my parents and my friends for their unwavering support throughout my studies, your love and encouragement means the world to me. And finally a special thanks to Adrian and Jalisa for the ability to make me laugh, no matter what’s happening.

iii Contents

1 Introduction 1

1.1 PiM Record-Linkage System ...... 2

1.2 Thesis Statement ...... 3

1.3 Approach ...... 3

1.4 Organization ...... 4

2 Background 5

2.1 Record Linkage ...... 5

2.1.1 Data Pre-processing and Blocking ...... 7

2.1.2 Comparison ...... 8

2.1.3 Classification ...... 9

2.1.4 Evaluation ...... 10

2.2 Classification Methods ...... 10

2.2.1 Support Vector Machines ...... 10

2.2.2 K-Nearest Neighbours ...... 12

2.2.3 Na¨ıve Bayes ...... 13

2.3 Evaluation ...... 13

iv 2.4 Canadian Census ...... 14

2.5 Current Record Linkage (Base) System ...... 17

2.5.1 Blocking ...... 18

2.5.2 Comparison ...... 18

2.5.3 Classification ...... 19

2.5.4 Evaluation ...... 20

3 Literature Review 22

3.1 Blocking ...... 22

3.2 Comparison ...... 24

3.3 Record Linkage Classification ...... 26

3.4 Current Tools ...... 27

3.5 Current Record Linkage Systems for Census Data sets ...... 28

3.6 Summary ...... 30

4 Disambiguation Record Linkage System 31

4.1 Exploration of Classification Techniques ...... 31

4.2 Proposed Record Linkage System ...... 33

4.3 Multiple-Links Group-Disambiguation Algorithm ...... 34

4.4 Probability Score from the Classifier ...... 37

4.5 Extra Attributes ...... 39

4.5.1 Origin and Attributes ...... 40

4.5.2 Household Attribute ...... 42

4.6 Using the Origin and Religion Attributes ...... 43

v 4.6.1 Integrated into a Similarity Measure ...... 44

4.6.2 Used as a Filter and paired with a Similarity Measure ...... 46

4.6.3 Summary of Analysis ...... 48

4.7 Using the Household Attribute ...... 49

4.7.1 Jaccard Coefficient ...... 49

4.7.2 Using Origin and Religion in the Jaccard Measure ...... 52

4.7.3 Current Links with Jaccard ...... 54

4.7.4 Applying Thresholds on Record based Jaccard ...... 59

4.7.5 Applying the Origin and Religion Filter to Household Similarity . . 63

4.7.6 Summary of Analysis ...... 66

4.8 Bias of Disambiguation Methods ...... 67

5 Conclusions 72

5.1 Summary ...... 72

5.2 Future Work ...... 75

A Comparison of Classification Techniques 83

A.1 Evaluation ...... 84

A.2 Experimental Setup ...... 85

A.2.1 Support Vector Machine ...... 86

A.2.2 K-Nearest Neighbour ...... 86

A.2.3 Naive Bayes ...... 87

A.3 Results ...... 87

A.4 Analysis ...... 89

vi A.5 Testing the Affect of Different Feature Sets on Classifier Performance . . . . 90

A.5.1 Experimental Setup ...... 91

A.5.2 Results ...... 91

A.5.3 Analysis ...... 93

A.6 Summary ...... 94

B 1M ∪ M1 links between Household-pairs 96

vii List of Tables

1.1 Links generated by the P iM system ...... 3

2.1 Simple census data sets ...... 7

2.2 Simple census data sets - Data pre-processing ...... 7

2.3 Simple census data sets and their corresponding feature vectors resulting from a comparison ...... 9

2.4 Known record-pairs with imprecise recording ...... 15

2.5 Census records with similar attributes ...... 16

2.6 Details on the feature scores in a feature vector ...... 19

2.7 Positive links generated by the Base system ...... 21

2.8 Base Results - 5 Fold Cross Validation ...... 21

4.1 Averaged Baseline Results - 5 Fold Cross Validation ...... 31

4.2 Comparison of P rob disambiguation system against the Base system . . . . 38

4.3 Comparison of COR disambiguation system against the P rob disambiguation system and Base system ...... 44

4.4 Comparison of ORF ilter and ORMatch disambiguation systems against the P rob disambiguation system and Base system ...... 47

viii 4.5 Comparison of AvgOR J disambiguation system against the ORF ilter dis- ambiguation system and Base system ...... 54

4.6 Comparison of Rc J disambiguation systems against the ORF ilter disam- biguation system and Base system ...... 58

4.7 Comparison of Rc J1:1 > threshold disambiguation systems against the

Rc J1:1 disambiguation system and Base system ...... 62

4.8 Comparison of Rc J1M∪M1 > threshold disambiguation systems against the

Rc J1M∪M1 disambiguation system and Base system ...... 63

4.9 Comparison of Rc J1:1+1M∪M1 > threshold disambiguation systems against

the Rc J1:1+1M∪M1 disambiguation system and Base system ...... 64

4.10 Comparison of OR − Rc J disambiguation systems against the Rc J disam- biguation systems and Base system ...... 65

A.1 F-B data set results across three different classifiers ...... 87

A.2 Percentage of total links unique to each classifier for F-B data set ...... 88

A.3 Intersection of testing set between classifiers for F-B data set ...... 89

A.4 Results from SVM and KNN tuning ...... 91

A.5 SVM, KKN and NB results over three different feature sets, F-B, F-1 and F-2 92

A.6 Percentage of total links unique to each feature set ...... 93

ix List of Figures

2.1 Overview of Record Linkage System ...... 6

2.2 A trained support vector machine classifier, where the circles and triangles represent different classes...... 11

2.3 An Example of K-Nearest Neighbour Classification ...... 12

2.4 Overview of Base Record Linkage System ...... 18

4.1 Overview of Proposed Disambiguation Record Linkage System ...... 33

4.2 Example of disambiguating 1M and M1 groups. (i) Starting 1M and M1 link groups (ii) Disambiguated 1M link groups (iii) Disambiguated M1 link groups (iv) Final set of 1:1 links ...... 36

4.3 Distribution of P rob ...... 39

4.4 Difference in Matching vs Non-Matching origin and religion Codes . . . . . 41

4.5 Distribution of household sizes between 1871 and 1881 censuses ...... 43

4.6 Distribution of COR ...... 46

4.7 Example of Simple Househould ...... 50

4.8 Example of 1M Link Group ...... 51

4.9 Example of Households belonging to a 1M Link ...... 51

4.10 Origin/Religion Household setup ...... 53

x 4.11 1:1 and 1M ∪ M1 Links between Households ...... 56

4.12 Average distribution of Rc J1:1 across 1M ∪ M1 links ...... 60

4.13 Average distribution of Rc J1M∪M1 across 1M ∪ M1 links ...... 60

4.14 Average distribution of Rc J1:1+1M∪M1 across 1M ∪ M1 links ...... 61

4.15 Distribution of gender in 1:1 links compared to 1871 Canadian census . . . 68

4.16 Distribution of age in 1:1 links compared to 1871 Canadian census . . . . . 68

4.17 Distribution of marriage status in 1:1 links compared to 1871 Canadian census 69

4.18 Distribution of birthplace in 1:1 links compared to 1871 Canadian census . 70

4.19 Distribution of origin in 1:1 links compared to 1871 Canadian census . . . . 70

4.20 Distribution of religion in 1:1 links compared to 1871 Canadian census . . . 71

A.1 Intersection of 1:1 links between classifiers for F-B data set ...... 88

A.2 Intersection of 1:1 links between F-B, F-1, and F-2 for each classifier . . . . 93

B.1 1M ∪ M1 links between Household-pairs ...... 97

xi Chapter 1

Introduction

Record linkage is the process of identifying and linking records that refer to the same entities across several databases [51]. Without an unique identifier, record-linkage techniques must decide if two records are the same entity based solely on the similarity between their common attributes. Unfortunately, the attributes in question are usually of low quality, making the record linkage process even more difficult.

For history and social sciences, one of the most important topics of study is the impact of industrialization. However, studying this area is challenging without a way to follow individual people throughout their lives. With census, church and military data sources from the 19th century, historians and social scientists have access to millions of records, which first must be linked together to reconstruct the life-courses (longitudinal data) of individual people before the information is of any use.

Historical record linkage refers to creating longitudinal data from census data, and involves automatically identifying the same person across two (or more) censuses. The availability of high-quality historical longitudinal data is extremely important to both social scientists and historians, as it creates significant research opportunities to study people and societies. However, the design of an effective historical record-linkage system is fraught with challenges. In particular, the matching of records often relies on a limited number

1 of attributes which can lead to groups of people with similar attributes, thus making the linkage process difficult.

1.1 PiM Record-Linkage System

At the University of Guelph, the People-in-Motion (PiM) [15] group are working on a system to link millions of records from the Canadian censuses taken every ten years (1852-1911) in order to construct longitudinal data. Currently, they have constructed a record-linkage system [1] to link the 3,466,427 records present in the 1871 Canadian census [30] to the 4,277,807 records present in the 1881 [34] Canadian census.

The number and type of links generated by the system is shown in Table 1.1. A One- to-One link refers to a link (a potential match) between two individuals: one from the 1871 census and the other from the 1881 census. A One-to-Many link, on the other hand, refers to a link from a single individual in the 1871 census to two or more individuals in the 1881 census. Similarly, a Many-to-One link refers to a link from a single individual in the 1881 census to two or more individuals in the 1871 census. Finally, a No-Link refers to the case where the system was not able to find a link between an individual in the 1871 census and the 1881 census.

The current record-linkage system treats all One-to-Many and Many-to-One links as ambiguous, and does not consider them for evaluation. Nor are these links presented to the user. Rather, the focus of the current record-linkage system is on producing as many One-to-One links as possible. Currently, the system achieves a linkage rate of 17.17%, where the linkage rate is simply the percentage of the 1871 records that are included in the One-to-One links. The system also achieves a false-positive rate of 4.98%. The false positive rate corresponds to the the percentage of One-to-One links that are falsely seen to be a match by the system. In practice, achieving a false-positive rate below 5% is crucial to social scientists [1], who require high-quality links with which to work.

2 Number of 1871 Records Percentage One to One 595,218 17.17% One to Many 831,145 23.98% Many to One 240,482 6.94% No Link 1,799,581 51.91%

Table 1.1: Links generated by the P iM system

1.2 Thesis Statement

The linkage rate of the current record-linkage system can be improved by utilizing the One- to-Many and Many-to-One links created by the system, while maintaining a false-positive rate below 5%. Exploiting the information contained in the One-to-Many and Many-to- One links is challenging since (i) there are a limited number of features associated with each individual record, and (ii) there are over 7 million such links, thus making the problem computationally challenging.

1.3 Approach

In this thesis, we extend the current record-linkage system [1] to include not only One-to- One links, but Many-to-One and One-to-Many links, with the goal of improving the linkage rate of the current system. The incorporation of additional attributes is explored. These attributes include extra information not presently used in the current system, and corre- sponds to information related to an individual’s place of birth, religion, and relationships between members of an individual’s family. The main contributions of this thesis are listed below:

• A novel record-linkage system that incorporates both One-to-Many and Many-to-One links

• The first study into the effects of adding different types of features (origin, religion and households) to help redefine the One-to-Many and Many-to-One links

3 • A novel disambiguation algorithm for exploiting these additional features

• A novel mechanism for incorporating household data into the Jaccard measure [12, 35]

• A comparison between the performance of the current record-linkage system, which is based upon a Support Vector Machine classifier [49], to that of a system based upon a K-Nearest Neighbour [14] and Naive Bayes [31] classifier.

The practical outcome of this thesis is the creation of reliable longitudinal data. The longitudinal data created through the proposed record linkage techniques can be used by researchers to investigate historical trends and to address questions about society, history and economy.

1.4 Organization

The remainder of this thesis is organized as follows: Chapter 2 provides background infor- mation on the area of record linkage, the classification models chosen and the performance evaluation measures employed. It also describes the Canadian census data, and the cur- rent record-linkage system. Chapter 3 discusses previous work in the literature related to record linkage, with a focus on census related techniques. Chapter 4 presents the pro- posed One-to-Many and Many-to-One record-linkage system. This chapter also includes an exploration into various similarity measures that can be used on the One-to-Many and Many-to-One links in the disambiguation record-linkage system. Finally, Chapter 5 high- lights the achievements and important conclusions of this research, along with ideas for future work.

4 Chapter 2

Background

In this chapter we discuss several fundamental concepts to provide the necessary background knowledge for the research presented in this thesis. In particular, Section 2.1 discusses the major steps involved in the record-linkage process. Section 2.2 describes the classification techniques that are employed in this research, with Section 2.3 introducing the measures used to evaluate the performance of a record-linkage system. The 1871 and 1881 Canadian censuses and the challenges they present are discussed in Section 2.4. Finally, Section 2.5 describes the record-linkage system currently being employed by People-in-Motion at the University of Guelph.

2.1 Record Linkage

The area of record linkage is concerned with identifying and linking information that refers to the same entities across one or more data sources [51]. With the presence of linkage information researchers can explore a potentially important source of statistical information. Tracking the long-term effect of heart surgery in individuals, or following an individual through historical censuses, is all made possible through record linkage. Ultimately, if there exists a unique identifier for each record, the process of record linkage can be solved with

5 a standard database join, but in most situations no such identifier is available and linkage techniques have to be employed.

There are 5 steps involved in the record linkage process [12, 25]:

1. Data pre-processing,

2. Blocking,

3. Record pair comparison,

4. Classification, and

5. Evaluation.

Figure 2.1: Overview of Record Linkage System

An overview of the flow of these five steps is shown in Fig. 2.1. Each step is described in detail in subsections 2.1.1 to 2.1.4 using the simple census data sets shown in Table 2.1 as an example.

6 Census A: year 1 First name Last name Age Birthplace R1: Lewis Barns 15 15030 R2: Alexander Martin 23 45300

Census B: year 10 First name Last name Age Birthplace R3: Luis Barns 25 15030 R4: Dr. Alex McMartyn 36 45300

Table 2.1: Simple census data sets

2.1.1 Data Pre-processing and Blocking

Data pre-processing is an extremely important step in the record linkage process as the data sets used for record linkage can vary widely in format and content. It involves removing noise and irrelevant information, standardizing formats and picking attributes that are well defined across the data sets. A simple example of this is shown in Table 2.2 where titles before the first name, in this case Dr., are stripped out of census B so that it matches the format of census A.

Census A: year 1 First name Last name Age Birthplace R1: Lewis Barns 15 15030 R2: Alexander Martin 23 45300

Census B: year 10 First name Last name Age Birthplace R3: Luis Barns 25 15030 R4: Alex McMartyn 36 45300

Table 2.2: Simple census data sets - Data pre-processing

Once the data sets have been processed it is (usually) infeasible to proceed to compare all possible pairs between the data sets. With multiple comparisons needed for a single

7 record-pair comparison, the total number of comparisons needed to compare all possible record-pairs becomes unmanageable. For example, a single record-pair comparison in Table 2.2 actually corresponds to 4 different comparisons, one for each attribute. If census A and census B each have 1 million records, this results in a total of 1012 record-pairs with 4 comparisons each, for 4012 total comparisons. If we assume that the comparison step can perform 4 million comparisons per second, computing the similarity between all 1012 record- pairs would give a run-time estimate of: (1012 ∗4)/(4∗106 comparisons/s)/(86400 s/day) = 11.57 days. To reduce the number of comparisons performed between data sets the blocking step is needed.

Blocking helps to filter out record pairs that are highly unlikely to be matches, by using varying techniques to partition the data set into mutually exclusive blocks, and restricting the comparison step to records that are only in the same block [19]. In the case of our simple example, one might block on the first letter of the last name, meaning only (R1,R3), and (R2,R4) would be compared between the two data sets.

2.1.2 Comparison

During this step all possible record-pairs are compared using similarity measures to pro- duce a feature vector. A Feature Vector (FV ) consists of one or more numerical values (feature scores) that indicate how alike the records’ attributes are to each other, usually on a continuous scale from 0 to 1. The similarity measures used are chosen based on the attribute types that need to be compared. An example of FVs for record-pairs (R1,R3) and (R2,R4) are shown at the bottom of Table 2.3. Since there is an exact match between the last name, age and birthplace of record-pair (R1,R3), the feature scores produced are 1.0, all other compared attributes will have a feature score that reflects the similarity measure being used.

8 Census A: year 1 First name Last name Age Birthplace R1: Lewis Barns 15 15030 R2: Alexander Martin 23 45300

Census B: year 10 First name Last name Age Birthplace R3: Luis Barns 25 15030 R4: Alex McMartyn 36 45300

FV (R1,R3): [0.9, 1.0, 1.0, 1.0] FV (R2,R4): [0.7, 0.4, 0.5, 1.0]

Table 2.3: Simple census data sets and their corresponding feature vectors resulting from a comparison

2.1.3 Classification

During the classification step, the chosen classification method uses the feature vectors constructed in the comparison step to make decisions on whether the record-pair is a match or a non-match. By looking at Table 2.3, we would expect FV (R1,R3) to be classified as a match due to the high similarity scores between attributes, where as FV (R2,R4) would be less likely to be classified as a match because it’s similarity scores are lower.

Classification techniques can be broken down into two major types:

• Supervised, where the classifier is trained on data that has been pre-labeled, and

• Unsupervised, where the classifier has to train on unlabelled data and come up with patterns and relations itself [9, 20].

For supervised classification, the given data set is spilt into three disjoint sets, a training set, a validation set and a testing set. The training set is used to train the classification model, with the validation set being used to determine where or not it is necessary to continue training specific classification models. The final performance of the classifier is

9 evaluated using the testing set. A discussion on specific supervised classification techniques will be given in Sec. 2.2.

2.1.4 Evaluation

The overall quality of the matches made is assessed in the last step of the record-linkage system. This is done by using different measures of interest, depending on the application being tested. It should be noted that each decision made in the previous steps affects the evaluation outcome, and therefore it may be necessary to go back to a previous step and try a different approach to produce better results.

2.2 Classification Methods

A variety of supervised classification algorithms have been used to preform the record- linkage classification step, such as Support Vector Machines [49], K-Nearest Neighbours [14], Bayasian Classifiers [31], Rule-Based Classifiers, Decision Tree’s and many others. Subsections 2.2.1 to 2.2.3 will discuss three common supervised classification techniques that we chose to investigate in the context of this application.

2.2.1 Support Vector Machines

The main premise of a Support Vector Machine [49] is to find a hyperplane h in the training space that best discriminates between the classes in the training data. This is done by finding a hyperplane that gives the largest margin. In this context, margin refers to the distance between the given hyperplane and the closest training data point, and is measured in relation to both sides of the hyperplane. The largest margin for the hyperplane is desirable because it has been shown to give a better generalization error than small margins. Small margins can lead to model over-fitting and tend to give poor results on new data [46].

The actual support vectors of a SVM consist of the training data points that lie closest to the hyperplane. These support vectors are used to define the decision boundary, and

10 overall margin distance. An example of the setup for a support vector machine can be seen in Fig. 2.2, along with the chosen hyperplane, margin and support vectors. Once the hyperplane with the largest margin is found, new objects are easily classified based on which side of the hyperplane they fall on.

Figure 2.2: A trained support vector machine classifier, where the circles and triangles represent different classes.

Before running a SVM, kernels functions are applied to the data set. A kernel function is used to map the data set into a higher dimensional space, in the hopes that a hyperplane can be more easily found. Three basic kernels are polynomial, radial basis function and sigmoid [28]. The use of some kernels require a user to tune various parameters to achieve higher accuracy. Tuning can be done by a grid search through the parameter space or by manual tuning.

11 2.2.2 K-Nearest Neighbours

K-Nearest Neighbour (KNN) [14] works differently from most supervised classification meth- ods, as it only looks at the training data when a new object needs to be classified. KNN classifies a new instance by gathering the K nearest training records to the new instance, based on a given distance function, and assigning the new instance the majority class out of the K training records.

Figure 2.3 shows a visual example of this process. Here the star represents an instance we would like to classify, and the triangles and pentagons represent class 1 and class 2, respectively. Depending on the K value chosen, the predicted class for the new instance will be different. If K = 3, shown by the dotted circle, the star will be classified as a class 2 (pentagon), but if K = 5 the star will be classified as a class 1 (triangle), which is shown by the unbroken circle. It is important for this classifier to have an appropriate value for K, so that the misclassification of new instances is kept to a minimum.

Figure 2.3: An Example of K-Nearest Neighbour Classification

12 2.2.3 Na¨ıve Bayes

A Na¨ıve Bayes (NB) [31] classifier assumes the attributes in a feature vector are independent of one another, given the class label y. When given a set of attributes X = {Xi, ..., Xk}, a NB classifier will calculate the probability of y given X, using Eq. 2.1.

Q P (x | y) ∗ P (y) P (y|X) = i i (2.1) P (X)

The class label with the highest probability, given the same set of attributes, is consid- ered the predicted class for the attribute set in question. NB uses training data to estimate the value of P (xi | y). If the attributes are categorical, the estimation is calculated by taking the fraction of training instances in class y that have the attribute value xi. If the attributes are continuous, the estimation is calculated by using the Mean and Variance of the attributes in the form of the Gaussian equation shown in Eq. 2.2.

2 1 − (x−µ) f(x, µ, σ) = √ e 2σ2 (2.2) 2πσ

2.3 Evaluation

In order to produce a reliable performance evaluation for the record linkage system, cross validation it used. Cross validation, which is an iterative process, is done by partitioning the expert data set into N folds. During each cross validation iteration, the classifier’s training set is constructed out of N − 1 folds, with the remaining fold becoming the testing set. The average results produced across the N folds is used to represent the performance of the record linkage system.

Equations 2.3 to 2.5 define the evaluation measures that are used within this thesis. In these equations, True Positive (TP) refers to the total number of record-pairs that have been labelled as a match by both the record linkage system and the testing set, False Positive (FP) refers to the record-pairs that have been labelled as a match by the record

13 linkage system, but have been labelled as a non-match by the testing set. Finally, False Negative (FN) is the total number of record-pairs that have been labelled as a match by the testing set, but are not seen as matches by the record linkage system.

TP TPR = ( ) ∗ 100 (2.3) TP + FN

FP FPR = ( ) ∗ 100 (2.4) (TP + FP )

# of 1:1 links LR = ( ) ∗ 100 (2.5) # records in 1871 census

The True Positive Rate (TPR), which is present in Eq. 2.3, showcases the record-linkage systems ability to link true matches from the testing set. In other words, it shows the percentage of true matches found by the record linkage system, out of all the true matches in the testing set. The False Positive Rate (FPR), Eq. 2.4, corresponds to the percentage of links that are falsely labeled as a match by the record linkage system. If a record pair (a, b) is labelled as a match in the testing set, but the record linkage system has labelled (a, c) as a match instead we can conclude that link (a, c) is a false positive. Finally, the Linkage Rate (LR), presented in Eq. 2.5, shows the percentage of records matched by the record linkage system, out of all the records that need to be matched.

In this thesis, success is defined in terms of FPR, TPR and LR. From the social scientists point of view, a FPR of more than 5% is not acceptable. Therefore, our goal is to maximize both TPR and LR, while minimizing FPR and ensuring that it stays below 5%.

2.4 Canadian Census

The data used in this thesis consists of two Canadian census data sets, specifically the Canadian censuses from 1871 and 1881. The 1871 census [30], which includes 3,466,427

14 records, was digitized, cleaned and compiled by the Church of Latter-Day Saints (LDS) and the 1881 census [34], which includes 4,277,807 records, was digitized, cleaned and compiled by LDS, the University of Ottawa Institute for Canadian Studies and Le Programme de recherche en d´emographie historique at Universit´ede Montr´eal. A unique identifier was created for each record during digitization to help keep track of each record.

Extra cleaning and standardization across the 1871 and 1881 censuses was performed by People in Motion [15], at the University of Guelph. All non-alphanumeric characters and titles (e.g., Dr., Rev.) were removed from strings representing names and the English and French enumerated information (e.g., jours, mari´e)was standardized across the two censuses. Duplicate records appearing in 1871 were removed, as well as records of people that died in 1871. 1871 Census ID Last name First name Gender Age Birthplace Marriage status 804755859 Bagg Addia 0 12 15030 6 804756984 Pritchard Thomas 1 14 15030 6 804476817 Bambridge M 1 36 41000 1

1881 Census ID Last name First name Gender Age Birthplace Marriage status 710507210 Bagg Adelia 0 23 15030 6 710503109 Pretchard Thomas 1 24 15030 6 1170303704 Bambrige Martin 1 45 41000 1

Table 2.4: Known record-pairs with imprecise recording

Performing record linkage on the 1871 and 1881 censuses proposes some challenges due to imprecise recording and the extensive duplication of attributes between records. Table 2.4 showcases some examples of imprecise recording occurring in known record-pair matches between the censuses. In this case, the record-pairs (804755859, 710507210), (804756984, 710503109) and (804476817, 1170303704) refer to the same people in the 1871 and 1881 censuses. As emphasized in bold, names have changed (Pritchard → Predtchard), ages vary by more than ten years (12 → 23), and in some cases attributes aren’t complete (M

15 → Martin). This emphases that record-pairs with dissimilar attributes can be the same person, due to imprecise recording, making the task of judging what constitutes a matching record-pair difficult.

1871 Census ID Last name First name Gender Age Birthplace Marriage status 805971460 Barns Mary 0 11 15030 6 804507290 Barns Mary 0 9 15030 6 805857311 Barns Mary 0 8 15030 6 805857328 Barns Mary 0 12 15030 6 805387235 Barns Mary 0 10 15030 6 803518187 Barns Mary 0 10 15030 6

1881 Census ID Last name First name Gender Age Birthplace Marriage status 1121501415 Barns Mary 0 20 15030 6 1092100809 Barns Mary 0 22 15030 6

Table 2.5: Census records with similar attributes

An example of the extensive duplication of attributes between records is shown in Table 2.5, where all of the “Mary Barns” records with the same birthplace are displayed. All of the attributes between the 1871 and 1881 records are extremely similar, or even match, making the task of finding the correct matching record-pair between these records impossible with the information at hand.

Along with the 1871 and 1881 Canadian censuses, we have a list of 11,716 matching record-pairs that have been identified, by human experts, as being the same people between the 1871 and 1881 censuses. The 11,716 matching record-pairs, also known as expert links, are broken down into four subsets:

1. ON Prop: 8429 family members of 1871 Ontario industrial proprietors

2. Logan: 1760 residents of Logan Township, Ontario

3. St James: 232 family member of communicants of St. James Presbyterian Church in

16 Toronto, Ontario

4. Les Boys: 1403 family members of 300 City boys who were ten years old in 1871

The above sets will be used for training and testing data in the classification and eval- uation step of the record-linkage system described in the next section.

2.5 Current Record Linkage (Base) System

The purpose of the record linkage system currently in use by People-in-Motion, at the University of Guelph [1], is to link records from the 1871 Canadian census to the 1881 Canadian census. Multiple attributes were collected about each individual, but for this record linkage system the number of attributes that define each record is restricted to six: last name, first name, age, gender, birthplace and marriage status. The record linkage system’s goal is to find all the record-pairs between the two censuses that refer to the same entity, these record-pairs are also known as matches.

There are three main steps in the current record linkage process. Step one consists of sectioning each census into smaller blocks, to cut down on the number of record-pairs produced between the two censuses. Step two consists of comparing the records in each record-pair and creating a feature vector that contains information about how similar the records in the record-pair are to each other. In step three, the constructed record-pair feature vectors are labeled as matches or non-matches via a classifier that has been learned from a training set constructed from the 1871 and 1881 Canadian census data sets (described in Sec. 2.4).

An overview of the current record-linkage system, which we’ll denote as Base for the remainder of this thesis, is shown in Fig. 2.4. A brief description of the three main steps of the system can be found in subsections 2.5.1, 2.5.2 and 2.5.3. A more detailed description of the Base system can be found in [1].

17 Figure 2.4: Overview of Base Record Linkage System

2.5.1 Blocking

As mentioned in subsection 2.1.1, it is infeasible to compare all possible record-pairs between two large data sets. In the case of the 1871 and 1881 Canadian census data, doing so would result in 3, 466, 427 ∗ 4, 277, 807 ≈ 14.8 ∗ 1012 record-pairs. Therefore blocking techniques are applied on three different attributes [1] to help reduce the number of record-pairs being sent to the comparison step. The attributes in question are a name-code based on the first name, the first letter of the last name and the birthplace. This means that a record-pair is only sent to the comparison step if the two records reside in the same name-code and last name block of their respective censuses, and their birthplaces match. Blocking reduces the number of record-pairs to 90,178,727.

2.5.2 Comparison

During the comparison step, feature vectors are constructed for each record-pair (a, b) by comparing how similar the records attributes are to each other, by using various similarity measures. As mentioned, a record in this system is defined by six attributes:

18 • last name and first name (string),

• age (integer),

• gender (binary),

• birthplace and marriage status (categorical).

From the six attributes mentioned above, a feature vector with twelve different feature scores is produced for each record-pair, resulting in a total of 90,178,727 feature vectors. Details of each feature score and a summary of the similarity measures used to calculate it is given in Table 2.6. More details of how each feature is calculated can be found in [1].

Table 2.6: Details on the feature scores in a feature vector

2.5.3 Classification

During the classification step, each feature vector is labelled as a match or non-match. The classification method used in the classification step is a Support Vector Machine (SVM)

19 [49]. The SVM is trained on a labeled set of 81,281 record-pair feature vectors constructed from the set of known matching record-pairs (Sec. 2.4), and consists of 8,543 matching and 72,738 non-matching record-pair feature vectors. Details of the training set and SVM construction can be found in [1].

The record-pair feature vectors that are produced in the comparison step are given to the learned SVM classifier, from which they are labeled as positive or negative links. If the label for a feature vector is negative, the record-pair is seen as a non-match. If the feature vector label is positive, the record-pair is seen as a match if and only if this is the only positive record-pair link produced by the classifier for each record in the record-pair. For example, the positive record-pair (a, b) is only seen as a match if there is no other record c or d for which (c, b) or (a, d) is labeled positive. The group of record-pair links that fit this description are denoted by 1:1, and are used to evaluate the system. All other positive record-pair links are discarded, since the output is currently deemed ambiguous in this system.

The discarded, ambiguous, positive record-pair links can be broken down into two types of link groups. The first, called a One-to-Many links group, is where a single record in 1871 is linked to a group of records in 1881. We denote such a group as 1M. The second group, called a Many-to-One links group, is when a group of records in 1871 are linked to a single record in 1881. We denote such a group as M1. The full set of ambiguous positive record-pair links is denoted by 1M ∪ M1 and will be the focus of this thesis.

2.5.4 Evaluation

To evaluate the performance of the Base system, 5-fold cross validation is used. This is done by taking the current training set and dividing it into 5 equal sized subsets. Each subset then becomes a testing set for an SVM trained using the remaining four subsets (see Sec. 2.3 for more details on cross-validation). This results in five different sets of 1:1 links to evaluate, one set from each trained SVM. The results, denoted as RBase, are shown in

Table 2.8, along with the standard deviation, σBase, over the 5 different systems. Table 2.7

20 showcases the number of record-pair links present in each grouping after the Base system has been employed.

As can be seen in Table 2.8, the Base system achieves a false positive rate (FPR) below 5% (4.98%), with a linkage rate (LR) and true positive rate (TPR) of 17.17% and 40.06%, respectively. As shown in Table 2.7, the Base system discards 7,481,329 ambiguous positive record-pair links (1M ∪ M1), which makes up roughly 92% of the positive links returned by the classifier. In Chapter 4, we consider ways to use the 1M ∪ M1 links to improve LR, TPR and FPR.

Positive record-pairs Percentage 1:1 links 595,218 7.37% 1M ∪ M1 links 7,481,329 92.63%

Table 2.7: Positive links generated by the Base system

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

Table 2.8: Base Results - 5 Fold Cross Validation

21 Chapter 3

Literature Review

This literature review will be organized as follows: Sections 3.1 to 3.3 will cover current techniques relating to the blocking, comparison and classification steps in the record linkage process (Sec. 2.1), and how they are related to historical record linkage. Then, Section 3.4 will describe the more prominent record linkage toolkits and Section 3.5 will cover some specific record linkage areas, with a focus on current historical record linkage research. Finally, Section 3.6 will summarize the review.

3.1 Blocking

Since real world data sets can become quiet large, it is infeasible to compare all the possible record pairs between data sets. To reduce the amount of comparisons required, a method called blocking is usually employed. Blocking is the act of partitioning a data set into mutually exclusive blocks using a set blocking key and restricting comparisons to records residing in the same block [12]. Over the last few years two different categories of blocking research have arisen: the first involves the creation of new techniques and the improvement of existing ones, while the second involves techniques that can learn what the optimal blocking key is for a dataset. This subsection will cover both categories, as well as provide a discussion as to which blocking technique is best for historical record linkage.

22 Currently, there are many different ways in which blocking can be performed [2, 3, 32, 43, 48]. However, a downside to blocking is that it can lead to misplaced records, and the loss of potential matches if that record resides in a different block. In 2008, Goiser et al. [25] called for higher restrictions on the use of blocking, as it can lead to biased data and skewed results. Schraagen [43] recently developed a blocking technique that addresses this problem and allows complete coverage of the dataset space. By constructing a bit vector for each record and assigning multiple characters to certain vector positions, a binary tree index can be made where the leaf nodes point to actual records. This encoding of records provides complete coverage of the dataset when searching for a potential match in the comparison stage. However, even though this algorithm is shown to perform indexing and matching faster, and consume less memory than other blocking techniques, the overall quality of the matches made is not explored.

Blocking has two main goals [33]: first, the groups produced by blocking should be relatively small to reduce the number of comparisons needed, and second, the act of blocking should not leave out possible matches. Trying to satisfy both goals creates a trade off, since smaller blocks have a higher risk of missing potential matches, and to cover all potential matches usually larger blocks are needed. A good blocking key should try to balance the goals, and should consist of attributes that have a low probability of error and are, for the most part, uniformly distributed over the data set [27]. The problem of optimally selecting a blocking key was recently addressed in [3, 33] where the authors proposed two different supervised machine learning approaches based on predicate-based formulations [3] and the sequential-covering algorithm [33].

There are still major questions related to blocking that need to be answered, for in- stance how does the performance of various blocking techniques differ when used with different types of data, which techniques will show better scalability for increasing database size, or how is the number and quality of record pairs influenced by the choice of a blocking key. To address some of these questions, the authors in paper [11] did a comparison of six different blocking techniques, for a total of twelve different blocking variations. Each

23 blocking technique was tested with six different datasets, one being U.S. census data, and then evaluated based on 5 different measures: overall runtime, the fraction of record pairs removed, how effective the technique was in not removing true matched pairs, the quality of pairs made, and the F-measure (Eq. A.3). The simpler approaches like traditional block- ing (blocking by mutually exclusive blocks), and the sorted neighbour approach were the fastest. The authors noticed that effectiveness, pair quality, and F-measure differed more prominently between datasets, than between actual techniques, highlighting the need for careful definition of the blocking key to reflect the data being used. Overall, traditional blocking and threshold-cased canopy clustering performed the best on the census dataset and, therefore, are a good choice for a blocking technique.

3.2 Comparison

Often the attributes of a record are stored as strings [4], which can lead to problems when performing record comparisons. One of the main problems with strings is that the same information can be represented in many different typographical variations. “Christine” vs. “Christina” and “Street” vs. “St” are examples of such variations. An approximate measure of how similar two names are is desired in this stage, since exact string comparison will not lead to good results due to the many errors and variations in reporting.

There are three basic areas of comparison [8], pattern-based matching, which consists of character and token based metrics, phonetic-based matching and numeric matching. Most of the comparison methods currently employed in research revolve around pattern-based similarity metrics [20], as they have been designed to handle string variation. Character- based similarity metrics have been shown to perform well for typographical errors, and include metrics like edit distance [28] or Jaro distance [8]. Token-based similarity metrics are better suited for data sets where the strings have been rearranged, or truncated/shortened. Phonetic-based similarity metrics take into account that even though strings may be similar on a character level, they may produce the same phonetics, for example the last name Mayr

24 and Meyer. Overall, a phonetic-based measures takes a given name and produces a set code according to how the name is pronounced. Common phonetic techniques are Soundex [4], Metaphone [37] and Double Metaphone [38]. Research into numeric-based similarity metrics is still in its infancy, with exact matching or range queries being used [20]. Overall the large number of similarity measures that are currently present in research reflects the large number of errors that can appear when dealing with string matching.

There have been some different techniques proposed recently that differ from the previ- ous areas. The first is by Gollapalli et al. [26], who presents a new way to encode strings for comparison. By using a scale-based hashing, the string is transformed into corresponding numerical values (hash codes). The overall similarity score is produced by comparing only the first n number of hash digits and calculating the relative errors between the n hash gram subset. The second published work [18] focuses on the idea that most measures do not take into account the actual semantics of the words when comparing them together. To deal with this lack of semantics in the comparison stage the authors proposed a similarity measure based on a semantic threshold and a combination of Latent Semantic Analysis (LSA) and the Jaccard string similarity measure. When used in conjunction with sorted neighbourhood blocking and a SVM, the combined hybrid approach outperformed LSA and the Jaccard similarity measures when used on their own.

Picking the right similarity measures to use is extremely important when dealing with census data, as errors can be introduced to the final data set, for example foreign names can get recorded into the census incorrectly by the original census transcriber. Digitizing the primary census data also leads to data error, as the hand writing might be hard to decipher. Recently a paper by Christen [8] detailed a comprehensive look into the performance of the main similarity measures on certain types of name data.

Christen [8] compared six different phonetic measures, nineteen pattern-based measures, and two combined systems, with the overall goal of determining which matching technique produced the best quality matches for different name types. Testing on three different datasets taken from the midwives database, experimental results showed that there is cur-

25 rently no single technique that outperforms everything else when it comes to dealing with real-world data. This confirms the statement that researchers need to take into account the overall characteristics of the data they are using. The author gives a list of recommenda- tions on what to look at when deciding on a matching technique. Another approach that looked into comparing similarity measures [4], came to a similar conclusion, stating that even if you have a similarity measure that has been trained and tuned on many problems, it can still perform poorly on newly introduced or different data.

3.3 Record Linkage Classification

As previously described, record linkage classification systems are used on record-pair’s to classify if the pair in question is a match or a non-match. Classification methods can be broadly classified into deterministic, probabilistic and modern approaches [12].

Deterministic methods consist of techniques that can only be used if a unique identifier is available in all the data sets. This includes using a combination of attributes to create a linkage key in place of a unique identifier [13]. Deterministic methods are limited to smaller data sets, since it is unrealistic to compile a linkage key on large sets of data.

One of the classic probabilistic methods, explored by Fellegi and Sunter [21], gives an optimal decision procedure for record linkage. In their method, a composite weight is computed by summing up each attribute in the similarity vector. A record-pair is then classified as a match if the composite weight is above a given threshold value. If the composite weight is below another given threshold value, it is deemed a non-match, and any composite weight that is in between the two thresholds is labelled as a possible match. This method has influenced most of the classification work being done today [50]. A comparison of the performance of probabilistic verse deterministic record linkage was detailed in [47].

Today’s modern methods have evolved over the last decade, pulling ideas from Artifi- cial Intelligence, Information Retrieval, Machine Learning, and Data Mining [51]. Modern approaches can be broken down into two major types: Supervised, where the classifier is

26 trained on data that has been pre-labeled, and Unsupervised, where the classifier has to train on unlabelled data and come up with patterns and relations itself [9, 20].

Most of the classification methods employed in record linkage today are based on su- pervised learning, and require training data, something that is hard to come by and costly to prepare [25], especially when dealing with census data. This problem has led research to explore unsupervised methods with hopes that performance is on par with supervised ap- proaches. One recently proposed method by Su et al. [45] uses two different classifiers that work in an iterative manner to label record pairs. Another recent paper [9] proposed using an unsupervised two-step classification method, consisting of selecting a seeded training set, and training a support vector machine with it. Both of these approaches overcome the lack of labelled training data, and show comparable results to supervised methods, making them applicable to situations where no pre-labeled training data is available.

Another way to look at matching pairs in data sets is to define it as a group linkage problem where groupings of records are used to determine if two entities are approximately the same. On et al. [36] proposed a group linkage measure based on bipartite-graph match- ing. This could be applicable to census data, as there are many ways in which to group individual records, for example into households.

The overall question of which classification method should be used for a given record linkage task is still unanswered, mainly due to the performance of each method being highly dependent on the data being used therefore it’s hard to say that one method is better overall. An exploration into the performance of classification methods when applied to census data is given in Appendix A.

3.4 Current Tools

Current research has gone into producing all-encompassing record linkage software to be used during the blocking, comparison and/or classification steps. Free software like Febrl (Freely Extensible Biomedical Record Linkage) [10], has been used as a base for a lot

27 of the record linkage experiments being performed in the last few years, especially the census record linkage. Developed in 2002, and built around Python, Febrl supports data exploration, cleaning, and standardization, and gives the user the ability to use multiple blocking techniques, similarity functions and supervised/unsupervised classifiers.

Recently, Elfeky et al. [19] have developed a Record-Linkage Toolbox named TAILOR that utilizes common classification techniques and can be tailored to fit any record linkage problem. Three different methods can be implemented for record pair classification using TAILOR, they include:

• Induction Model: based on supervised decision tree induction,

• Clustering Model: based on unsupervised k-means clustering,

• and Hybrid Model: based on the combination of the first two models to overcome problems with lack of training data.

3.5 Current Record Linkage Systems for Census Data sets

The act of record linkage has been applied to various problems over the last few years and many aspects of the overall process have been researched, from trying to reduce communi- cation overhead for online record linkage [16], to setting up record linkage frameworks that will recognize multi-language records [39]. A paper by Yakout et al. [52] recently discussed the possibility of using user behaviour as extra information when performing record linkage between different transaction logs, and another recent paper [7] detailed a way to use record linkage in achieving a goal of geocoding postal addresses. Since our specific area of interest is historical record linkage the rest of this section will highlight some of the current record linkage frameworks and problems associated with census data.

The Minnesota Population Centre has currently linked U.S. census data from 1870 to 1880 [24]. Their overall record linkage framework consisted of breaking down each dataset

28 into blocks of males, females, and married couples, with similarity scores only being com- puted if the record pairs in the blocks were born within at least 7 years of each other. A hybrid classification system, consisting of a loose and a tight classifier (SVM), was developed and true links were defined as records that had only one link in both models.

The People-in-Motion [15] group at the University of Guelph has also implemented a historical record linkage system [1], for the goal of creating longitudinal data for tracking people in the 19th century Canada. Their system (Sec. 2.5), based upon an SVM, produced very few false positives while linking over 600,000 records between the 1871 and 1881 Cana- dian census. The definition of a link in this system is when one record in 1871 is linked to only one record in 1881. This definition of a links cuts out record-pairs that gave multiple possible links, as the authors saw this as ambiguous information even if a true link resided somewhere in those multiple links. However, the authors in paper [22], came up with a framework to help reduce the number of potential multiple links between records, with the use of extra household information.

Using a two-step record linkage approach Fu et al. [22] was able to utilize the extra household information that came with the census records from Rawtenstall in North-East Lancashire in the United Kingdom, to reduce the multiple possible links to a true link. The first step in this process was pair-wise linking. Blocking was done on the double metaphone encoding of each surname and if the final similarly score for each pairing was above a set threshold the pair was seen as a match. From there the issue of multiple links could be solved by the second step named group linking, which takes into account household and one-to-one relationships between records by using a variation of the Jaccard measure. When a record in one dataset is linked to multiple records in the other dataset the true link is found by taking a look into the household information of each record. The record with the highest household similarity to the single records household becomes the true link, and all other links are labeled as false. Note that this does not get rid of all multiple links, as some households may tie for the highest similarity score, but it does help to reduce the number of multiple links overall. Experiments showed that this approach was effective in reducing the

29 overall number of multiple links made, but because of the lack of testing data the authors did not evaluate the quality of the one-to-one links made through the group linking step, or look into the effect of thresholds. The same authors recently published another paper [23], which looked at the group linkage problem as a multiple instance learning problem.

Since historical record linkage system seem to have a problem with ambiguous links, incorporating extra information about the record-pairs, for instance household information, would be worth exploring to help deal with ambiguous links.

3.6 Summary

Previous work has shown that at each step of the record linkage process there is no dis- tinct technique/method that outperforms all the rest, and the decision on which tech- nique/method to use should reflect heavily on the data that is being linked. Based on the review, a good blocking technique for historical record linkage is traditional blocking, where records are spilt up into mutually exclusive blocks. The best comparison techniques vary depending on the type of attribute to be compared. Since census data tends to have lots of typographical errors the edit distance and Jaro distance would be a good beginning choice for comparing strings in the census.

Most historical record linkage systems employ a SVM classifier, yet no exploration into which classification technique performs the best for historical data has actually been done. Therefore, a comparison between a SVM classifier and other classification techniques should be done for census data, to make sure SVM is the right classification system to be using.

Finally, one major problem with the current historical record linkage systems is that they have no way of utilizing the ambiguous links that are produced by their systems. From the historical record linkage frameworks presented, only Fu et al. [22] addressed the issue of trying to disambiguate the ambiguous links produced, and their exploration was limited. Therefore a more detailed exploration into various disambiguation techniques is needed and is the core issue addressed in this thesis.

30 Chapter 4

Disambiguation Record Linkage System

4.1 Exploration of Classification Techniques

The Base system employs a Support Vector Machine (SVM) classifier that produces the

final results shown in Table 4.1, where RBase denotes the average performance and σBase denotes the standard deviation. Described in Sec. 2.2.1, a SVM classifier works by finding a hyperplane that splits the data space into the two classes. Any new instance is then labelled according to what side of the hyperplane it falls on.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

Table 4.1: Averaged Baseline Results - 5 Fold Cross Validation

Since we want to start with as many 1:1 links as possible, before proceeding with the disambiguation of the 1M ∪M1 links, we first explore if the linkage rate of the Base system

31 can be improved upon (i.e., increased), by employing classifiers that are based upon different classification techniques. In particular, we consider K-nearest neighbour (KNN) [14] and Naive Bayes (NB) [31]. The KNN classifier is based on clustering, and labels a new instance by identifying the K nearest neighbours to that new instance and assigning the new instance the majority class. On the other hand, NB works based on the probability that a certain class has a given attribute. By assuming that all attributes are independent of one another a new instance is labeled based on the probability of its attributes being in a certain class.

To perform the comparison of the three different classification systems – SVM, KNN and NB – each system was created and tuned using R [40]. Since there are 90,178,727 feature vectors from the comparison step of the Base system, serial farming on SHARCNET [44] was employed to reduce the total runtime required to perform the classification step. Without SHARCNET, the overall CPU time of running all three classifiers for this data set would have been roughly 3 years. For brevity the details of the setup and the results can be found in Appendix A. Overall, the KNN and NB classifiers produced false positive rates that were above the acceptable threshold of 5% (10.99% and 14.71%, respectively) and linkage rates that were below the Base system (14.54% and 8.11%, respectively). Therefore, the use of KNN and NB was found not to improve upon the Base system results and therefore in the remainder of this thesis we employ the SVM to perform classification.

As shown in Table 2.7 the Base system, using SVM discards 7,481,329 ambiguous positive record-pair links (1M ∪ M1), which makes up roughly 92% of the positive links returned by the classifier. By discarding the 1M ∪ M1 links, the Base system is potentially throwing away positive links that are actually matches, but have been put into the 1M ∪M1 link set due to their high similarity with other positive links. With so many positive links being discarded there is room for improvement in the Base system by including a technique for disambiguating the 1M ∪ M1 links.

Therefore, we propose a new record linkage system with an additional step to disam- biguate the 1M ∪ M1 links. A successful disambiguation system will improve upon the performance of the Base system in terms of increasing the linkage rate and true positive

32 rate, while maintaing or improving upon the current false positive rate.

4.2 Proposed Record Linkage System

As discussed, roughly 92% of the positive links made by the classifier are removed before evaluation, due to the ambiguity present between groups of positive links. Therefore, we propose a new record linkage system that consists of an additional disambiguation step, where the 1M ∪ M1 links are further processed, instead of being discarded, with the goal of disambiguating them into high quality links (which will help the overall performance of the system). An overview of the new record linkage system is shown in Fig. 4.1.

Figure 4.1: Overview of Proposed Disambiguation Record Linkage System

The proposed disambiguation step consists of two steps:

1. Calculating a score for each 1M ∪ M1 link based upon a similarity measure, and

2. Disambiguating the 1M and M1 link groups based upon the score of each link.

There are many similarity measurements used in record linkage [8], but not many of them specifically address the issue of 1M and M1 groupings where there are groups of links that are all highly similar in their base attributes (e.g., name, age, sex etc). One way to

33 help disambiguate these specific links might be to introduce more information about each link. The Base system currently defines a record using only six census attributes. However, there are other attributes that were not included in the overall Base system. In this thesis we will be dealing with three of them, namely origin, religion and household attributes.

The disambiguation algorithm used for the second step is described in Section 4.3, with Sections 4.6 and 4.7 proposing various similarity measures that can be made using origin, religion and household attributes, and the resulting performances of the new systems compared to the Base system.

4.3 Multiple-Links Group-Disambiguation Algorithm

Assuming that the similarity score given to each 1M ∪ M1 link is a good representation of a match, with a high similarity score corresponding to a high probability that the link is a match, we propose a heuristic for disambiguating the 1M ∪ M1 links.

A visual example of each step in the heuristic is shown in Fig. 4.2, with the links A-E, A-F, A-G and C-G, C-H being 1M link groups and the links A-G, B-G, C-G and C-H, D-H being M1 link groups (Fig. 4.2 - (i)). The proposed method, which is summarized in Algorithm 1, starts off by disambiguating the 1M link groups (Fig. 4.2 - (ii)). This is done by examining the group of links associated with each 1871 record. The link, or links in case of ties, with the maximum similarity score within the 1871 record group are kept, and the rest of the links are removed (lines 3-10). The M1 link groups are now disambiguated from the remaining links (Fig. 4.2 - (iii)), using the same process except this time examining the group of links associated with each 1881 record (lines 11-18). This leaves a set of links that have the highest similarity scores out of their respective 1M and M1 groups. After the 1M and M1 groups have been disambiguated, the remaining links are examined to see if they are 1:1 links between an 1871 record and an 1881 record; if they are, they are kept for evaluation (lines 19-23, Fig. 4.2 - (iv)).

34 Algorithm 1 Disambiguation Algorithm 1: Inputs: S = 1871 Census, E = 1881 Census, L = 1M ∪ M1 link set with scores

2: Outputs: O = set of one-to-one links

3: for all vertex ∈ S do

4: maxWeight ←− 0

5: for all edge of vertex do

6: if weight(edge) > maxW eight then

7: maxWeight ←− weight(edge)

8: for all edge of vertex do

9: if weight(edge) < maxW eight then

10: L = L \ edge

11: for all vertex ∈ E do

12: maxWeight ←− 0

13: for all edge of vertex do

14: if weight(edge) > maxW eight then

15: maxWeight ←− weight(edge)

16: for all edge of vertex do

17: if weight(edge) < maxW eight then

18: L = L \ edge

19: for all edge ∈ L do

20: vertex71 = vertex in 1871 of edge

21: vertex81 = vertex in 1881 of edge

22: if (count(edges of vertex71)== 1) and (count(edges of vertex81)== 1) then

23: O = O ∪ edge

35 Figure 4.2: Example of disambiguating 1M and M1 groups. (i) Starting 1M and M1 link groups (ii) Disambiguated 1M link groups (iii) Disambiguated M1 link groups (iv) Final set of 1:1 links

36 One thing to note about this algorithm is that by taking the highest scoring link from a 1M or M1 link group, we add in the assumption that every 1M and M1 group should have a link that is seen as a match when realistically this is not the case. Not every entity in the 1871 census is present in the 1881 census, since death and emigration removes some people from the population, just as births and immigration add new people who were not present in 1871 census. The newly added entities in the 1881 census may have characteristics similar to those who were present in the 1871 census, thus leading to 1M and M1 groups that might not have a true matching link.

One way to mitigate the negative effects associated with this assumption is to implement thresholds on the scores of every link, and only treat links as matches if their scores exceed the set threshold. This would ensure that the 1M ∪ M1 links being chosen as matches correspond to links that have high similarity scores. The use of thresholds on similarly scores is explored later in Section 4.7.4.

4.4 Probability Score from the Classifier

Each link in the 1M ∪ M1 data set has a starting similarity score associated with it in the form of an attached probability returned by the classification system. This probability (P rob) represents the confidence the (SVM) classification system had in its decision based on the training data it was given; the higher the score the more confidence the classifier had in it’s label, e.g., a link (a, b) where P rob = 0.73, corresponds to a 73% chance that a and b are a match. This is the first similarity measure that we use with our disambiguation algorithm.

Table 4.2 compares the results from the P rob disambiguation system with that of the

Base system. The results for P rob are denoted as RP rob, while σP rob denotes the standard deviation of the results over the 5-folds. ∆RP rob−Base denotes the change in performance produced by RP rob compared to RBase. As can be seen, the P rob system successfully in-

37 creases the TPR and LR by 2.91% and 4.78% respectively, but also increases the FPR rate by 14.34%, resulting in a FPR 3.88 times the size of the Base systems FPR.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RP rob 734.40 175.80 42.97% 19.32% 21.95% σP rob 36.31 16.39 1.32% 1.66% 0.05% ∆RP rob−Base +49.60 +139.80 +2.91% +14.34% +4.78%

Table 4.2: Comparison of P rob disambiguation system against the Base system

These initial results show that the disambiguation of the 1M ∪ M1 links can lead to an increase in the TPR and LR of the system, but by using the P rob system, it also leads to an increase in the number of false positives (FP) being produced. Recall, that for this particular application the FPR must be less than 5%.

To understand what is causing the increase in FPR we look at the distribution of the P rob score across the 1M ∪ M1 links. Figure 4.3 shows that over 80% of the 1M ∪ M1 links have a score of 0.96. Since the classification step only has a limited number of feature scores (i.e., 12) to work with, there are a large number of feature vectors that share the same information. Therefore, they have the same probability assigned to them. This is especially true for people having common names within the censuses ( e.g.., Mary Smith, John McDonald). Therefore, the poor performance of the P rob system is likely due to the skewed classifier probabilities. Since our proposed algorithm is based on a “greedy” maximum function, it cannot perform well with such skewed data.

Overall, the P rob score does not provide enough information to distinguish between the 1M ∪ M1 links, resulting in the skewed distribution, and therefore a better similarity mea- sure needs to be defined. This similarity measure can potentially utilize extra information about the 1M ∪ M1 links in the form of origin, religion and household attributes, which will be described next.

38 Figure 4.3: Distribution of P rob

4.5 Extra Attributes

Since the P rob system produces a FPR that is over 5%, our next approach is to bring in extra information about each record to help distinguish between the 1M ∪ M1 links. As perviously mentioned, the Base system defines a record using only six census attributes. Therefore, we have access to attributes that were not included in the overall Base system, i.e., origin, religion and household attributes. The next two subsections give details about these attributes.

39 4.5.1 Origin and Religion Attributes

Origin and religion attributes are numerical codes that correspond to a unique categorical label. For example, the origin code 2300 corresponds to the “Irish” category and the religion code 1100 corresponds to the “Roman Catholic” category. Between the 1871 and 1881 Canadian census data sets there are 113 different Origin codes [5] and 149 different Religion codes [6], each corresponding to their own unique label.

The number of different codes shown above corresponds to the base-level grouping, where one code corresponds to only one label. To decrease the number of origin and religion codes present, we also look at a coarse-level grouping of the codes, where one code corresponds to a group of categorical labels. This decreases the number of origin and religion codes to 8 and 3, respectively. The coarse-level version is created by grouping similar labels together under one label, i.e., French, French Canadian and Acadian would now correspond to just one label, French.

As shown below, origin (O) and religion (R) can create binary similarity scores, with a value of 1 corresponding to matching codes and a value of 0 corresponding to mis-matched codes between record-pair (rj, ri). We would expect to see matching origin and religion codes in record-pairs that are the same entity across the census sets.

( 1 if O(r ) == O(r ); O(r , r ) = j i j i 0 otherwise.

( 1 if R(r ) == R(r ); R(r , r ) = j i j i 0 otherwise.

To determine which level of groupings to move forward with, we compare the origin and religion codes of the expert links (discussed in Sec. 2.4) using the base and coarse-level groupings. Figure 4.4 shows the percentages of the expert links that have matching and

40 mis-matched codes for each grouping. Overall, we expect the expert links to have a high percentage of matching origin and religion codes, no matter what level of grouping we use, since origin and religion are less likely to change over time, and we see this when looking at the difference between the base-level and coarse-level origin groupings.

Figure 4.4: Difference in Matching vs Non-Matching origin and religion Codes

Unfortunately, this is not the case for the base-level and coarse-level religion groupings.

41 When using the base religion grouping, the percentage of expert links with matching religion codes is actually below the percentage of expert links with non-matching religion codes. This shows that the base religion grouping is not a good indictor of an expert link, and suggests that the religious category a person associates with tends to change or be misreported, over time. For our purpose, the coarse grouping of religion codes is a better measure to use for expert links, as it shows more than 90% of expert links having matched religion codes. Section 4.6 gives more details on how the origin and religion attributes can be utilized in a similarity measure.

4.5.2 Household Attribute

The household attribute comes in the form of a Household Identification Number (HID) that is assigned to each record in a given census data set. All of the records that reside in a given household can be found by grouping together records with the same HID.

There are 609,300 households present in the 1871 census and 801,052 households present in the 1881 census, with the number of records residing in each household ranging from 1 to 761 and 1 to 625, respectively. A detailed distribution of the household sizes in each census is given in Fig. 4.5.

42 Figure 4.5: Distribution of household sizes between 1871 and 1881 censuses

There is a problem associated with using HID’s however, and that is they are unique to the census and not unique through time. This means that comparing households is not as simple as looking up the same HID between the 1871 and 1881 censuses. For example, the HID 45 does not correspond to the same household in both the 1871 and 1881 censuses. Section 4.7 gives more details on how the household attribute can be utilized in a similarity measure.

4.6 Using the Origin and Religion Attributes

In this section we investigate if the inclusion of origin and religion into various similarity measures helps to bring the FPR of the disambiguation system down, while still improving upon the TPR and LR produced by the Base system. Subsection 4.6.1 describes how the origin and religion attributes are integrated into the similarly measure, while subsection

43 4.6.2 describes how the origin and religion attributes are used as a filter that is paired with the similarity measure. Finally, subsection 4.6.3 gives an analysis of the various results.

4.6.1 Integrated into a Similarity Measure

We incorporate the origin (O) and religion (R) attributes into a similarity measure with the classifier probability by taking the sum of the three values. The overall score is shown in Eq. 4.1

COR(rj, ri) = sum(C(rj, ri),O(rj, ri),R(rj, ri)) (4.1)

where C(rj, ri) is the classifier probability and

( 1 if O(r ) == O(r ); O(r , r ) = j i j i 0 otherwise.

( 1 if R(r ) == R(r ); R(r , r ) = j i j i 0 otherwise.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RP rob 734.40 175.80 42.97% 19.32% 21.95% σP rob 36.31 16.39 1.32% 1.66% 0.05%

RCOR 837.8 229.2 49.04% 21.39% 25.43% σCOR 25.88 51.12 1.08% 4.02% 1.30% ∆RCOR−Base +153 +193.2 +8.98% +16.40% +8.26% ∆RCOR−P rob +103.4 +53.4 +6.07% +2.07% +3.48%

Table 4.3: Comparison of COR disambiguation system against the P rob disambiguation system and Base system

44 Table 4.3 compares the results from the COR disambiguation system with that of the

P rob disambiguation system and Base system. The results for COR are denoted as RCOR, while σCOR denotes the standard deviation of the results over the 5-folds. ∆RCOR−Base de- notes the change in performance produced by RCOR compared to RBase, while ∆RCOR−P rob denotes the change in performance produced by RCOR compared to RP rob.

As shown, the COR system successfully increases the TPR and LR, compared to the P rob system, by 6.07% and 3.48% respectively, but it also increases the FPR rate by 2.07%. This leads to a FPR of 21.39%, which is 4.30 times higher than the FPR of the Base system (4.89%).

There are 3 different “spikes” visible in the distribution of COR, as shown in Fig 4.6. Remembering that over 80% of the links have a P rob score of 0.96 (Fig. 4.3), the first spike corresponds to sum(0.96, 0, 0), the second spike to sum(0.96, 1, 0)/sum(0.96, 0, 1) and the third spike to sum(0.96, 1, 1). By setting the sum of C, O and R as the similarity measure for each link, it opens the possibility for links with only one matching origin or religion attribute to be seen as a match by the disambiguation algorithm, in their respective 1M or M1 groups. We expect such links to be of lower quality, since we are assuming that origin and religion should not change over time, and therefore this could be the cause of the increase in the FPR for the COR system. To address this issue, we explore the use of origin and religion filters, when included within the disambiguation system.

45 Figure 4.6: Distribution of COR

4.6.2 Used as a Filter and paired with a Similarity Measure

Due to our assumption that record-pairs with the same origin and religion have a higher probability of being true matches, we can expect a matching link to have the same origin and religion, and therefore can construct a rule (below) to filter out all of the links that do not hold to this assumption.

( 1 if O(r ) == O(r ) V R(r ) == R(r ); OR(r , r ) = j i j i j i 0 otherwise.

We explore two different techniques for applying OR to the 1M ∪ M1 links. First we

46 disambiguate the 1M ∪ M1 links by applying just the OR rule, which filters out roughly 45% of the links. The remaining links are then checked and any 1M / M1 groups that have become 1:1 links are pull out for evaluation. We denote this method as ORF ilter. Second, we apply the P rob system (Section 4.4) to disambiguate the 1M ∪M1 links, but we eliminate the record-pairs that do not satisfy the above OR rule. We denote this method as ORMatch.

Table 4.4 compares the results from the ORF ilter and ORMatch disambiguation sys- tems with that of the P rob disambiguation system and Base system. Both systems, ORF ilter and ORMatch, increase the TPR and decrease the FPR compared to the P rob system, with ORFilter producing the lowest FPR (7.17%) of the two, at 1.44 times the FPR of the Base system. Overall, both disambiguation systems produced a TPR and LR higher than that of the Base system, but produced FPRs above the acceptable threshold of 5%.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RP rob 734.40 175.80 42.97% 19.32% 21.95% σP rob 36.31 16.39 1.32% 1.66% 0.05%

RORF ilter 775.8 59.8 45.39% 7.17% 19.71% σORF ilter 40.02 9.09 1.51% 1.11% 0.04% ∆RORF ilter−Base +91.00 +23.80 +5.33% +2.18% +2.54% ∆RORF ilter−P rob +41.40 -116.00 +2.42% -12.15% -2.24%

RORMatch 818.2 179.4 47.87% 17.98% 23.61% σORMatch 38.07 19.09 1.29% 1.74% 0.04% ∆RORMatch−Base +133.40 +143.40 +7.81% +13.00% +6.44% ∆RORMatch−P rob +83.80 +3.60 +4.90% -1.34% +1.66%

Table 4.4: Comparison of ORF ilter and ORMatch disambiguation systems against the P rob disambiguation system and Base system

47 4.6.3 Summary of Analysis

In this section we explored the performance of three new disambiguation systems, where extra information in the form of origin and religion attributes have been incorporated. The disambiguation systems were as follows:

• COR - incorporated origin, religion and the classifier probability (Section 4.4) into a similarity measure.

• ORF ilter - constructed a rule, based on the origin and religion attributes, to filter out links in the 1M ∪ M1 link set.

• ORMatch - used the previously constructed rule on the 1M ∪ M1 links and disam- biguated the remaining links with the P rob score.

The performance of each disambiguation system was compared to the P rob disam- biguation system, which doesn’t include the use of extra attributes, and the Base system. Between the three systems the ORF ilter system produced the lowest FPR at 7.17% but also the lowest TPR and LR, at 45.39% and 19.71%. On the other hand, the COR method produced the highest TPR and LR, at 49.04% and 25.43%, but also produced the highest FPR at 21.39%.

The inclusion of the origin and religion attributes helped to disambiguate more 1M ∪M1 links, and therefore increased the TPR and LR by a factor of 1.13 to 1.22 and 1.15 to 1.48 respectively, relative to the Base system. However, each system also increased the FPR by a factor of 1.44 to 4.30.

With respect to the P rob system, the ORF liter and ORMatch systems decreased the FPR by a factor of 2.69 and 1.08, respectively. From these results we can argue that the inclusion of extra attributes in the disambiguation system does improve upon the FPR produced, compared to a disambiguation system that doesn’t utilize extra attributes, and therefore we explore the use of another extra attribute, household, in Section 4.7.

48 4.7 Using the Household Attribute

As previously mentioned in subsection 4.5.2, each record has a HID associated with it that is only unique within the census, meaning a similarity measure cannot be constructed based upon matching the 1871 and 1881 HIDs of a record-pair.

Fortunately, the HID associated with each record can be used to create their households, consisting of all the records that have the same HID within the census. By having access to the 1871 and 1881 households that a record-pair resides in, it is possible to construct a similarly measure based upon the record-pair’s household similarity. We investigate if the inclusion of a household similarity into a disambiguation system helps to decrease the FPR, while still improving upon the TPR and LR produced by the Base system.

Subsection 4.7.1 describes the similarity measure used to compare two households, while subsections 4.7.2 to 4.7.5 describe the various household similarity measures used and the performance of their corresponding disambiguation systems.

4.7.1 Jaccard Coefficient

The similarity between two sets of items, A and B, can be calculated by taking the total number of items in the intersection of the two sets and dividing it by the total number of items in the union of the two sets. This is known as the Jaccard similarity coefficient [12, 35] as defined in Eq. 4.2.

|A ∩ B| |A ∩ B| J(A, B) = = (4.2) |A ∪ B| |A| + |B| − |A ∩ B|

As households can be seen as sets of attributes (items), the Jaccard similarity coeffi- cient is used as the general similarity measure for comparing the similarity between two

49 households. In a household we have access to three different sets of attributes, besides the attributes used in the record-linkage process, on which to base the Jaccard measure:

• HO = {origins present in the household},

• HR = { present in the household}, and

• HRc = {records present in the household}.

Figure 4.7: Example of Simple Househould

For example, given the simple household shown in Fig 4.7 the three sets of attributes are:

• HO = {1, 2} = {French, English}

• HR = {2} = {Protestant}

• HRc = {1, 2, 3} = {{Barns, Fred, ...}, {Barns, Leo, ...}, {Barns, Leonard, ...}}

For each 1M ∪ M1 link, the Jaccard score is calculated based upon the households that the 1871 and 1881 records reside in. For example, Fig. 4.8 shows a 1M group with a link between the 1871 record 805971460 and the 1881 records 1121501415 and 1092100809.

50 Figure 4.8: Example of 1M Link Group

To calculate the Jaccard score for the link identified as X in Fig. 4.8 (link 805971460 – 1121501415), all of the records that reside in the same household as 805971460 and 1121501415 are collected, resulting in the households shown in Fig 4.9. The first household (HID-100163) has 12 members, while the second (HID-724222) has 9 members. The actual Jaccard score can then be calculated between the two households based upon the attribute set to be used, e.g., HO, HR or HRc.

Figure 4.9: Example of Households belonging to a 1M Link

A single variation of the Jaccard measure was used by Fu et. al. [22] to help disambiguate 1M groups using household data. The Jaccard measure, based on the 1:1 records between households, was shown to greatly reduced the number of 1M groups present (see Chapter

51 3 for more details). In contrast, in this thesis we expand on the exploration of the Jaccard measure by looking at four different variations that include origin, religion and records, along with the introduction of thresholds on the Jaccard measures.

4.7.2 Using Origin and Religion in the Jaccard Measure

The first set of attributes used with the Jaccard measure are origin (HO) and religion

(HR). The general Jaccard measures for origin and religion are defined in Eq. 4.3 and 4.4 respectively, where A and B are household sets consisting of HO or HR attributes.

A ∩ B O J(A, B) = origin (4.3) (|A| + |B| − (A ∩origin B)

A ∩ B R J(A, B) = religion (4.4) (|A| + |B| − (A ∩religion B)

To calculate O J(A, B) the number of origins that are the same between households A and B is divided by the total number of origins found between households A and B. The calculation of R J(A, B) is exactly the same as O J(A, B), except the equation is dealing with the religions found between households A and B.

For example, to calculate the origin and religion Jaccard scores for the 1M link identified as X in Fig. 4.8, the origin and religion household sets are constructed for the link. In Fig. 4.9 the origin and religions present in the 100163 household are {4, 5} and {2}, respectively. Where as the origin and religions present in the 724222 household are {5} and {2}, respectively. Therefore the intersections between 100163 and 724222 are {5} for origins and {2} for religions. Figure 4.10 shows the final household sets produced with matching origins and religions depicted by solid arrows.

52 Figure 4.10: Origin/Religion Household setup

The origin and religion Jaccard measures for the 1M link X are calculated as follows:

1 O J(X) = = 0.5 (2 + 1) − 1

1 R J(X) = = 1 (1 + 1) − 1

Since origin and religion are considered together for each link (due to the assumption that they should stay the same over time), the average of the origin and religion Jaccard scores is used to create an overall similarity score, as shown in Eq. 4.5.

(O J + R J) AvgOR J() = (4.5) 2

For example, using the average of the two previous Jaccard measures leads to the fol- lowing score for the 1M link X:

(0.5 + 1) AvgOR J(X) = = 0.75 2

53 TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RORF ilter 775.8 59.8 45.39% 7.17% 19.71% σORF ilter 40.02 9.09 1.51% 1.11% 0.04%

RAvgOR J 874.8 128.4 51.15% 12.79% 23.14% σAvgOR J 48.58 13.81 2.00% 1.14% 0.03% ∆RAvgOR J−Base +190.00 +92.40 +11.12% +7.81% +5.97% ∆RAvgOR J−ORF ilter +99.00 +68.60 +5.79% +5.63% +3.43%

Table 4.5: Comparison of AvgOR J disambiguation system against the ORF ilter disam- biguation system and Base system

Table 4.5 compares the results from the AvgOR J disambiguation system with that of the best origin/religion disambiguation system ORF ilter (Section 4.6) and Base system.

The results for AvgOR J are denoted as RAvgOR J , while σAvgOR J denotes the standard deviation of the results over the 5-folds. ∆RCOR−Base denotes the change in performance produced by RAvgOR J compared to RBase, while ∆RAvgOR J−ORF ilter denotes the change in performance produced by RAvgOR J compared to RORF ilter.

As shown, the AvgOR J system has successfully increased the TPR and LR, compared to the ORF ilter system, by 5.79% and 3.43%, respectively, but it also increases the FPR rate by 5.63%. This leads to a FPR of 12.79%, which is higher by a factor of 2.57 with respect to the FPR of the Base system (4.89%).

4.7.3 Current Links with Jaccard

In the previous subsection, only origin and religion were considered, as these are not ex- pected to (significantly) change. In this section, we now focus on the records present in the households, HRc. The general household Jaccard equation is defined in Eq. 4.6, where A and B are household sets consisting of HRc.

54 A ∩ B Rc J(A, B) = (4.6) |A| + |B| − (A ∩ B)

In this case, the intersection in the Jaccard measure refers to the number of record- pairs that have been linked (i.e., 1:1, 1M and/or M1), and therefore are considered to be the same between the two households. From the classification step, we have access to two different sets of linked record-pairs. The first set consists of the 1:1 links and the second set consists of the 1M ∪ M1 links. By redefining what is considered a link between the two households, we construct three different versions of the Jaccard measure, as shown in Eqs 4.7 to 4.9.

A ∩1:1 B Rc J1:1(A, B) = (4.7) |A| + |B| − (A ∩1:1 B)

A ∩1M∪M1 B Rc J1M∪M1(A, B) = (4.8) |A| + |B| − (A ∩1M∪M1 B)

A ∩1:1+1M∪M1 B Rc J1:1+1M∪M1(A, B) = (4.9) |A| + |B| − (A ∩1:1+1M∪M1 B)

Fig 4.11 shows an example of some 1M and M1 groups, along with the 1871 household and all the 1881 households linked to it. Double-lined boxes represent households, bold arrows depict 1:1 links and dashed arrows depict 1M∪M1 links. Note that |HID−100163| = 12, |HID − 724222| = 9 and |HID − 704832| = 2.

55 Figure 4.11: 1:1 and 1M ∪ M1 Links between Households

To calculate Rc J1:1(A, B) the number of 1:1 links present between households A and B is divided by the total number of records present between households A and B. Therefore, when using Rc J1:1, the intersection between the two households is the number of 1:1 links found between the households. For X shown in Fig. 4.11, there are four 1:1 links (shown by bold arrows) between HID-100163 and HID-724222 and for Y there are zero 1:1 links between HID-100163 and HID-704832. The final Rc J1:1 scores for X and Y are as follows:

4 Rc J (X) = = 0.235 1:1 (12 + 9) − 4

0 Rc J (Y ) = = 0 1:1 (12 + 9) − 0

When using the Rc J1:1 measure most of the 1M ∪ M1 links end up with a score of 0, since 96.22% of the 1M ∪ M1 links are between households that don’t have 1:1 links (like what happens in the case of link Y in Fig 4.11). Even though the Rc J1:1 measure will return a small amount of 1M ∪ M1 links with values higher than 0, we expect those links to be of high quality since they are based on households that contain the high quality 1:1 links.

56 The calculations of Rc J1M∪M1(A, B) and Rc J1:1+1M∪M1(A, B) are exactly the same as Rc J1:1(A, B), except that the equations are dealing with the 1M ∪ M1 and 1:1+1M ∪

M1 links present between households A and B, respectively. Computing Rc J1M∪M1 and

Rc J1:1+1M∪M1 is no different than Rc J1:1, just the links used in the intersection have 1 changed . The final Rc J1M∪M1 and Rc J1:1+1M∪M1 scores of X and Y are as follows:

3 Rc J (X) = = 0.167 1M∪M1 (12 + 9) − 3

1 Rc J (Y ) = = 0.05 1M∪M1 (12 + 9) − 1

(3 + 4) Rc J (X) = = 0.5 1:1+1M∪M1 (12 + 9) − (3 + 4)

(1 + 0) Rc J (Y ) = = 0.05 1:1+1M∪M1 (12 + 9) − (1 + 0)

Our argument for using the ambiguous 1M ∪ M1 links is that household pairs with ambiguous links between them can still be considered similar, since the ambiguous links have the potential to be true matches. We make the assumption that the most probable link is the one where other family members have been linked between the same household- pairs.

Rc J Results

Table 4.6 compares the various Rc J disambiguation systems against the ORF ilter dis- ambiguation system and Base system. As shown, each Rc J system increases the TPR and LR, compared to the ORF ilter system, by 19.51% to 34.49% and 5.45% to 16.51%, respectively. However, the Rc J1:1 system decreased the FPR by a factor of 1.34, with

1When dealing with 1M ∪ M1 in the Jaccard measures there is a pre-step involved, see Appendix B for details.

57 TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RORF ilter 775.8 59.8 45.39% 7.17% 19.71% σORF ilter 40.02 9.09 1.51% 1.11% 0.04%

RRc J1:1 1109 43.00 64.90% 3.73% 25.17%

σRc J1:1 46.29 10.22 1.66% 0.83% 0.04%

∆RRc J1:1−Base +424.20 +7.00 +24.84% -1.26% +8.00%

∆RRc J1:1−ORF ilter +333.2 -16.8 +19.51% -3.44% +5.45%

RRc J1M∪M1 1315.2 145.6 76.99% 9.94% 35.66%

σRc J1M∪M1 25.76 22.55 0.77% 1.30% 0.06%

∆RRc J1M∪M1−Base +630.40 +109.60 +36.93% +4.96% +18.49%

∆RRc J1M∪M1−ORF ilter +539.4 +85.8 +31.60% +2.78% +15.95%

RRc J1:1+1M∪M1 1364.6 111.8 79.88% 7.56% 36.22%

σRc J1:1+1M∪M1 29.28 18.90 0.76% 1.12% 0.06%

∆RRc J1:1+1M∪M1−Base +679.80 +75.80 +39.82% +2.57% +19.05%

∆RRc J1:1+1M∪M1−ORF ilter +588.8 +52.0 +34.49% +0.39% +16.51% Table 4.6: Comparison of Rc J disambiguation systems against the ORF ilter disambigua- tion system and Base system

respect to the Base system, whereas the other two systems increased the FPR. It should be noted that the Rc J1:1+1M∪M1 only increased the FPR by a factor of 1.52 relative to the Base system, but it managed to double the TPR and LR produced by the Base system to increase each by a factor of 1.99 and 2.11, respectively.

The increase of the TPR and LR in the Rc J1:1+1M∪M1 systems is due to all of the 1M ∪ M1 links receiving a Jaccard score, since every household pair will have at least one 1M ∪ M1 link between it (the link in question). By letting every link have a score, the chance that a 1M ∪ M1 link with an extremely low similarity score will become a match increases, which can lead to an increase in FP links. One way to deal with this problem is to introduce thresholds on the Jaccard scores to remove low similarity scores from becoming matches. The use of thresholds, when applied on Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1, is explored in the next subsection.

58 4.7.4 Applying Thresholds on Record based Jaccard

As mentioned in Section 4.3, by making the disambiguation algorithm take the highest scoring link from a 1M or M1 group, we are assuming that every group should have a link that would be seen as a match, when realistically that is not the case. As with any passing of time, there will be people in 1871 that don’t exist in 1881 due to death and immigration, but due to births and immigration in the 1881 census there is a chance that those 1871 records will be linked anyway, due to similar attributes between the “newly” added records in 1881.

One way around the previous assumption is to implement thresholds on the scores of every 1M ∪ M1 link and only take 1M ∪ M1 links as matches if their scores exceed a certain threshold. This would make sure that the 1M ∪ M1 links being chosen as matches correspond to links that have high similarly scores and that the disambiguation algorithm doesn’t end up seeing a link with a low similarity score as a match just because its the only link in it’s 1M or M1 group with a value.

Applying a threshold works as follows: if the given score for a link doesn’t exceed the threshold, the link in question is removed from the 1M ∪M1 links group and is not given to the disambiguation algorithm. For example, a threshold of 0.5 would let all the links with a score higher than 0.5 through, and remove all links with scores 0.5 and below.

In terms of the Jaccard measure, setting a high threshold would confine the results to 1M ∪M1 links that are present between household-pairs that don’t differ drastically in size. However, household sizes are bound to change over time, due to death and birth within the households and people moving in and out. This means that an extremely high Jaccard threshold might not result in the best performance. Therefore, given the prior knowledge that households are going to change, we explore the performance of various threshold values upon the Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1 systems. To determine which threshold values to test, the distribution of Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1 on the 1M ∪ M1 links was examined, and can be seen in Figs 4.12, to 4.14, respectively.

59 Average Distribuon of Rc_J_1:1 100.00%

90.00%

80.00%

70.00%

60.00%

50.00%

% 1M U M1 Links 40.00%

30.00%

20.00%

10.00%

0.00% 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Similarity Score

Figure 4.12: Average distribution of Rc J1:1 across 1M ∪ M1 links

Average Distribuon of Rc_J_1MUM1 18.00%

16.00%

14.00%

12.00%

10.00%

8.00% % 1M U M1 Links

6.00%

4.00%

2.00%

0.00% 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Similarity Score

Figure 4.13: Average distribution of Rc J1M∪M1 across 1M ∪ M1 links

60 Average Distribuon of Rc_J_1:1+1MUM1 18.00%

16.00%

14.00%

12.00%

10.00%

8.00% % 1M U M1 Links

6.00%

4.00%

2.00%

0.00% 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Similarity Score

Figure 4.14: Average distribution of Rc J1:1+1M∪M1 across 1M ∪ M1 links

Similar trends are seen in all three distributions of Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1. For example, over 94% of the 1M ∪ M1 links are shown to have Jaccard scores in the range of 0 to 0.25, with the peak of each graph corresponding to the Jaccard value 1/13 (when disregarding the peak score of 0 for Rc J1:1 ). This peak at 1/13 is related to the household sizes that are present in the census. Roughly 99% of the households in each census have a size between 1 and 12 (Section 4.5.2), leading to 14 being the most commonly summed household-pair size. Since the majority of household-pairs will only have one link between them, the 1M ∪ M1 link in question, the Jaccard score of 1/13 becomes prominent in the distribution.

Each distribution also has the same five scores that cut the distribution into sections; these values are Jaccard scores of 0.25, 0.33, 0.5, 0.66 and 0.75. Since each distribution showed the same trend with these five Jaccard scores, we decided to employ them as thresholds. Tables 4.7 to 4.9 compare the performance of each threshold on the Rc J1:1,

Rc J1M∪M1 and Rc J1:1+1M∪M1 systems, to the non-threshold performance of the Rc J1:1,

61 Rc J1M∪M1 and Rc J1:1+1M∪M1 systems and the Base sytsem.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RRc J1:1 1109 43.00 64.90% 3.73% 25.17%

σRc J1:1 46.29 10.22 1.66% 0.83% 0.04%

RRc J1:1>0.25 746.80 36.00 43.69% 4.59% 18.16%

σRc J1:1>0.25 42.66 9.62 1.69% 1.14% 0.05%

∆RRc J1:1>0.25−Base 62.00 0.00 3.63% -0.40% 0.99%

∆RRc J1:1>0.25−Rc J1:1 -362.20 -7.00 -21.21% 0.86% -7.01%

Table 4.7: Comparison of Rc J1:1 > threshold disambiguation systems against the Rc J1:1 disambiguation system and Base system

Higher thresholds were not applied to Rc J1:1 because all of the false positive (FP) links present in the Base system were removed by the 0.25 threshold, therefore any threshold higher than 0.25 will only remove true positive (TP) links and will end up increasing the overall FPR and decreasing the TPR and LR.

Not surprisingly, the inclusion of thresholds on the Rc J systems decreased the TPR and LR compared to the non-thresholds versions. The threshold versions also decreased the FPR with respect to the non-threshold versions (excluding Rc J1:1). By applying thresholds to the Rc J1M∪M1 and Rc J1:1+1M∪M1 systems the FPR was bought below the 5% constraint.

Due to the overlap of the standard deviations for the FPR of all thresholds being used on a give Rc J, we can’t state that one threshold is better than the others, in respect to the FPR, since the difference in their results could be the effect of randomness in the data. However, we can state that the use of thresholds as a whole made a significant difference on the FPR between threshold and non-threshold systems (excluding the Rc J1:1 system).

Overall, Rc J1:1+1M∪M1 with a threshold of 0.25 produced the highest LR and TPR, while keeping the FPR below 5%.

62 TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RRc J1M∪M1 1315.2 145.6 76.99% 9.94% 35.66%

σRc J1M∪M1 25.76 22.55 0.77% 1.30% 0.06%

RRc J1M∪M1>0.25 1097 47.8 64.21% 4.17% 25.13%

σRc J1M∪M1>0.25 28.99 8.79 1.26% 0.76% 0.03%

∆RRc J1M∪M1>0.25−Base 412.20 11.80 24.15% -0.81% 7.96%

∆RRc J1M∪M1>0.25−Rc J1M∪M1 -218.20 -97.80 -12.77% -5.77% -10.53%

RRc J1M∪M1>0.33 970 41.4 56.78% 4.10% 22.42%

σRc J1M∪M1>0.33 30.20 7.60 1.35% 0.77% 0.03%

∆RRc J1M∪M1>0.33−Base 285.20 5.40 16.72% -0.89% 5.24%

∆RRc J1M∪M1>0.33−Rc J1M∪M1 -345.20 -104.20 -20.21% -5.85% -13.25%

RRc J1M∪M1>0.5 793.6 38.8 46.44% 4.65% 19.07%

σRc J1M∪M1>0.5 33.03 8.58 1.04% 0.97% 0.04%

∆RRc J1M∪M1>0.5−Base 108.80 2.80 6.38% -0.33% 1.90%

∆RRc J1M∪M1>0.5−Rc J1M∪M1 -521.60 -106.80 -30.55% -5.29% -16.59%

RRc J1M∪M1>0.66 716.4 37.00 41.91% 4.90% 17.78%

σRc J1M∪M1>0.66 35.39 9.49 1.25% 1.19% 0.05%

∆RRc J1M∪M1>0.66−Base 31.60 1.00 1.85% -0.08% 0.61%

∆RRc J1M∪M1>0.66−Rc J1M∪M1 -598.80 -108.60 -35.07% -5.04% -17.88%

RRc J1M∪M1>0.75 705.2 36.8 41.26% 4.95% 17.50%

σRc J1M∪M1>0.75 36.97 9.65 1.33% 1.21% 0.05%

∆RRc J1M∪M1>0.75−Base 20.40 0.80 1.20% -0.04% 0.33%

∆RRc J1M∪M1>0.75−Rc J1M∪M1 -610.00 -108.80 -35.73% -4.99% -18.16%

Table 4.8: Comparison of Rc J1M∪M1 > threshold disambiguation systems against the

Rc J1M∪M1 disambiguation system and Base system

4.7.5 Applying the Origin and Religion Filter to Household Similarity

We expect that using all the information available will help to disambiguate the 1M ∪ M1 links, therefore we look into a technique that includes origin and religion with the Jaccard measure based on records. Since the Rc J1:1+1M∪M1 method produces the best results, we create a combination of that method with an origin and religion filter. This is done by

63 TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RRc J1:1+1M∪M1 1364.6 111.8 79.88% 7.56% 36.22%

σRc J1:1+1M∪M1 29.28 18.90 0.76% 1.12% 0.06%

RRc J1:1+1M∪M1>0.25 1293.2 47.2 75.70% 3.52% 28.78%

σRc J1:1+1M∪M1>0.25 26.17 9.68 1.08% 0.68% 0.02%

∆RRc J1:1+1M∪M1>0.25−Base 608.4 11.2 35.64% -1.47% 11.61%

∆RRc J1:1+1M∪M1>0.25−Rc J1:1+1M∪M1 -71.4 -64.6 -4.17% -4.04% -7.44%

RRc J1:1+1M∪M1>0.33 1202 41.4 70.36% 3.33% 26.40%

σRc J1:1+1M∪M1>0.33 31.10 7.70 0.94% 0.59% 0.03%

∆RRc J1:1+1M∪M1>0.33−Base 517.2 5.4 30.29% -1.66% 9.23%

∆RRc J1:1+1M∪M1>0.33−Rc J1:1+1M∪M1 -162.6 -70.4 -9.52% -4.23% -9.82%

RRc J1:1+1M∪M1>0.5 946.4 39.4 55.39% 4.00% 21.70%

σRc J1:1+1M∪M1>0.5 38.41 8.56 1.62% 0.85% 0.04%

∆RRc J1:1+1M∪M1>0.5−Base 261.6 3.4 15.33% -0.99% 4.53%

∆RRc J1:1+1M∪M1>0.5−Rc J1:1+1M∪M1 -418.2 -72.4 -24.49% -3.56% -14.52%

RRc J1:1+1M∪M1>0.66 787.8 37.6 46.10% 4.55% 18.88%

σRc J1:1+1M∪M1>0.66 30.89 9.50 1.10% 1.11% 0.05%

∆RRc J1:1+1M∪M1>0.66−Base 103 1.6 6.04% -0.43% 1.71%

∆RRc J1:1+1M∪M1>0.66−Rc J1:1+1M∪M1 -576.8 -74.2 -33.77% -3.01% -17.34%

RRc J1:1+1M∪M1>0.75 748 37 43.77% 4.71% 18.12%

σRc J1:1+1M∪M1>0.75 32.96 9.49 1.12% 1.15% 0.05%

∆RRc J1:1+1M∪M1>0.75−Base 63.2 1 3.71% -0.28% 0.95%

∆RRc J1:1+1M∪M1>0.75−Rc J1:1+1M∪M1 -616.6 -74.8 -36.11% -2.85% -18.11%

Table 4.9: Comparison of Rc J1:1+1M∪M1 > threshold disambiguation systems against the

Rc J1:1+1M∪M1 disambiguation system and Base system

taking the links produced by the variations of the Rc J1:1+1M∪M1 method and applying the origin and religion filter (OR), as presented in subsection 4.6.2.

Therefore, the first level of disambiguation of the 1M ∪ M1 links is done by using the Jaccard scores, and a second level of disambiguation is added by using the OR filter. Table

4.10 compares the performance of applying the OR filter on the various Rc J1:1+1M∪M1 >

64 threshold systems, to the non-filtered systems and the Base system. The OR filter was not applied to any systems with a threshold higher than 0.33, since all of the false positive (FP) links present in the Base system were removed by theOR − Rc J1:1+1M∪M1 > 0.33 system and any Rc J1:1+1M∪M1 system with a higher threshold will only remove true positive (TP) links which ends up increasing the overall FPR and decreasing the TPR and LR.

Applying the OR filter decreased the TPR and LR compared to their corresponding non-filtered systems. It also increased the FPR of the Rc J1:1+1M∪M1 > threshold systems tested, by 0.03% and 0.16%. The use of the OR filter decreased the FPR by 1.16% compared to the Rc J1:1+1M∪M1 system.

TP FP TPR FPR LR

RBase 684.80 36.00 40.06% 4.98% 17.17% σBase 38.44 9.62 1.47% 1.25% 0.05%

RRc J1:1+1M∪M1 1364.6 111.8 79.88% 7.56% 36.22%

RRc J1:1+1M∪M1>0.25 1293.2 47.2 75.70% 3.52% 28.78%

RRc J1:1+1M∪M1>0.33 1202 41.4 70.36% 3.33% 26.40%

RRc J1:1+1M∪M1>0.5 946.4 39.4 55.39% 4.00% 21.70%

RRc J1:1+1M∪M1>0.66 787.8 37.6 46.10% 4.55% 18.88%

RRc J1:1+1M∪M1>0.75 748 37 43.77% 4.71% 18.12%

ROR−Rc J1:1+1M∪M1 1285.2 88 75.23% 6.40% 32.87%

σOR−Rc J1:1+1M∪M1 28.03 13.29 0.73% 0.88% 0.04%

∆ROR−Rc J1:1+1M∪M1−Base 600.4 52 35.17% 1.42% 15.70%

∆ROR−Rc J1:1+1M∪M1−Rc J1:1+1M∪M1 -79.4 -23.8 -4.65% -1.16% -3.35%

ROR−Rc J1:1+1M∪M1>0.25 1224.2 45 71.66% 3.54% 27.36%

σOR−Rc J1:1+1M∪M1>0.25 24.53 8.40 0.97% 0.63% 0.02%

∆ROR−Rc J1:1+1M∪M1>0.25−Base 539.4 9 31.60% -1.44% 10.19%

∆ROR−Rc J1:1+1M∪M1>0.25−Rc J1:1+1M∪M1>0.25 -69 -2.2 -4.04% 0.03% -1.42%

ROR−Rc J1:1+1M∪M1>0.33 1145 41.4 67.02% 3.49% 25.33%

σOR−Rc J1:1+1M∪M1>0.33 28.40 7.70 0.93% 0.61% 0.03%

∆ROR−Rc J1:1+1M∪M1>0.33−Base 460.2 5.4 26.96% -1.50% 8.16%

∆ROR−Rc J1:1+1M∪M1>0.33−Rc J1:1+1M∪M1>0.33 -57 0 -3.33% 0.16% -1.07%

Table 4.10: Comparison of OR − Rc J disambiguation systems against the Rc J disam- biguation systems and Base system

65 4.7.6 Summary of Analysis

In this section we looked at various ways to include household related information into a similarity score. By employing the Jaccard measure as a similarity score between household- pairs we were able to develop with four different Jaccard based disambiguation techniques, built around the attributes that can be found in a household, namely origin, religion and records. The disambiguation systems were as follows:

• AvgOR J - based upon the average of the origin and religion Jaccard scores.

• Rc J1:1 - record based Jaccard measure that only dealt with 1:1 links.

• Rc J1M∪M1 - record based Jaccard measure that only dealt with 1M ∪ M1 links.

• Rc J1:1+1M∪M1 - record based Jaccard measure that dealt with 1:1 and 1M ∪ M1 links.

The performance of each system was compared to the ORF ilter disambiguation system and the Base system. All four disambiguation systems increased the TPR and LR, relative to the Base system, with the Rc J1:1+1M∪M1 system producing the highest rates by a factor of 1.99 and 2.11 and AvgOR J producing the lowest rates by a factor of 1.28 and 1.35. The lowest FPR was produced by the Rc J1:1 system, which decreased the FPR by a factor of 1.34, relative to the Base system; all other FPRs produced were higher than the Base system with AvgOR J producing the highest at factor of 2.57.

We then looked at the effect of applying five different thresholds – 0.25, 0.33, 0.5, 0.66, and 00.75 – to the Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1 systems. With thresholds the

Rc J1M∪M1 and Rc J1:1+1M∪M1 systems saw a decrease in their FPR, producing FPR’s that ranged from 0.04% to 1.66% below the Base systems FPR (4.98%). Due to the overlap of the standard deviations for the FPR we cannot conclude that one threshold is better than the others, in terms of the FPR. However, we can state that the use of thresholds as a whole made a significant difference on the FPR between threshold and non-threshold

66 systems (excluding the Rc J1:1 system) and that Rc J1:1+1M∪M1 with a threshold of 0.25 produced the highest LR and TPR, while keeping the FPR below 5%.

Finally, we explored the change in performance of the Rc J1:1+1M∪M1 and Rc J1:1+1M∪M1 > threshold systems when applying the OR filter (Section 4.6.2) to the results. Overall, the inclusion of the OR filter decreased the TPR and LR for all systems, and increased the

FPR (excluding the Rc J1:1+1M∪M1 system).

Overall, the inclusion of household information successfully produces disambiguation systems that meet our requirements, i.e., a FP rate below 5% and a TPR and LR higher than the Base system. In the next section, we examine the bias present in the disambiguation systems that performed the best for this specific system, compared to the distribution present in the original 1871 census.

4.8 Bias of Disambiguation Methods

The overall use of a historical record-linkage system is to produce 1:1 links between census data sets that can be used by social scientists. However, bias can occur when the 1:1 links between the census data sets are not representative of the records found in the original census. In this section, we investigate and report the biases present in the links produced by two different disambiguation methods, Rc J1:1+1M∪M1 > 0.33 which gave the lowest FP rate, and Rc J1:1+1M∪M1 > 0.25 which gave the highest linkage rate, highest TP rate, and had a FP rate that was lower than 5%.

Each figure below shows the distributions of the 1871 census, the Base system and the disambiguation methods Rc J1:1+1M∪M1 > 0.33 (0.33T) and Rc J1:1+1M∪M1 > 0.25 (0.25T) disambiguation systems, based on the attributes that were used throughout this thesis.

Figure 4.15 shows the distribution of females to males in the 1871 censes compared to the other three record linkage systems. Overall, the three systems under-represent females and over-represent males. To understand where this bias is coming from, we break down the gen- der attribute by age and marriage status, which is shown in Fig. 4.16 and 4.17, respectively.

67 Figure 4.15: Distribution of gender in 1:1 links compared to 1871 Canadian census

Figure 4.16: Distribution of age in 1:1 links compared to 1871 Canadian census

68 Figure 4.17: Distribution of marriage status in 1:1 links compared to 1871 Canadian census

The under-representation of females is most prominent with single females between the ages of 15-24. This bias is to be expected since 15-24 year old single females are more likely to get married over the next 10 years, and therefore change their last name, making it impossible to link them with the current record linkage systems. By using disambiguation record-linkage systems the bias towards single females is decreased and is closer to being representative of the 1871 census.

Base males are over-represented in all age and marriage brackets, with the slight ex- ception of males between the ages of 15-24. By using the disambiguation record linkage systems the number of males between the ages of 0-14 increase, as well as the number of married males present.

Figures 4.18, 4.19 and 4.20 show the distribution of birthplaces, origins and religions, respectively, in the three record linkage systems, compared to the 1871 censes.

69 Figure 4.18: Distribution of birthplace in 1:1 links compared to 1871 Canadian census

All three record linkage systems under-represent records with “Quebec” as their birth- place, with the two disambiguation record linkage systems also over-represent records with “Ontario” as their birthplace by roughly 5.77%.

Figure 4.19: Distribution of origin in 1:1 links compared to 1871 Canadian census

70 Figure 4.20: Distribution of religion in 1:1 links compared to 1871 Canadian census

From the distributions of origin and religion the three record linkage systems under- represent records with French origins and over-represent English origins, as well as under- represent Catholics and over-represent Protestants. Under presentation of French origin is expected due to the variations present in the spelling of first and last names.

Note that there could be many underlying biases that are not shown by the figures, i.e., a bias towards people that don’t move around/change family groups (since we deal with household data), but for this section we only focused on bias present in the basic attributes.

71 Chapter 5

Conclusions

Due to the limited number of attributes used to describe a record in the census data sets there are groups of records between the censuses that are extremely similar, if not exactly the same, making the task of correctly linking these ambiguous records extremely difficult. At the University of Guelph, the People-in-Motion group currently have constructed a record linkage system (Base) to link the 1871 Canadian census to the 1881 Canadian census. The techniques explored in this thesis aim to improve upon the linkage rate of the Base system, while maintaining a false positive rate no greater than 5%, by resolving the ambiguous links created by the system.

5.1 Summary

Prior to investigating disambiguation techniques, we compared the performance of KNN and NB classification techniques to that of the SVM Base system. Overall, the KNN and NB classifiers produced false positive rates that were above the acceptable threshold of 5% (10.99% and 14.71%, respectively) and linkage rates that were below the Base system (14.54% and 8.11%, respectively). The use of KNN and NB was found not to improve upon the Base system results and therefore the remainder of this thesis built off of the Base system.

72 To deal with the task of resolving the ambiguous links created by the Base system, we proposed a novel disambiguation record-linkage system, that included an extra two steps compared to the Base system. In this system, each 1M ∪ M1 link produced got a score based upon a similarity measure, and then the 1M ∪ M1 links were disambiguated using an algorithm that picked the highest scoring 1M ∪ M1 link as the match, out of their respective 1M/M1 groups. This type of algorithm relies heavily on the use of a good similarly measure, therefore the core of this thesis was an exploration into various measures and their performances.

We first tried a similarity measure that was already present for each 1M ∪ M1 link, the probability assigned by the classifier (P rob). It was found that using this measure increased the TPR and LR by a factor of 1.07 and 1.28, respectively, relative to the Base system. However the FPR was also increased by a factor of 3.88. This showed that a similarity measure with new information was needed and therefore we explored the performance of three new attributes, origin, religion and household, when included various ways into a similarity measure.

The incorporation of the origin and religion attributes was explored first, with the following disambiguation systems tested:

• COR - incorporated origin, religion and the classifier probability (Section 4.4) into a similarity measure.

• ORF ilter - constructed a rule, based on the origin and religion attributes, to filter out links in the 1M ∪ M1 link set.

• ORMatch - used the previously constructed rule on the 1M ∪ M1 links and disam- biguated the remaining links with the P rob score.

The inclusion of the origin and religion attributes helped to disambiguate the 1M ∪ M1 links, and therefore increased the TPR and LR by a factor of 1.13 to 1.22 and 1.15 to 1.48 respectively, relative to the Base system. However, each system also increased the FPR by a factor of 1.44 to 4.30.

73 With respect to the P rob system, the ORF liter and ORMatch systems decreased the FPR by a factor of 2.69 and 1.08, respectively. From these results, we argue that the inclusion of extra attributes in the disambiguation system does improve upon the FPR produced, compared to a disambiguation system that doesn’t utilize extra attributes.

That lead us to explore the use of another extra attribute, household. The household based disambiguation systems were as follows:

• AvgOR J - used origin and religion within household-pairs and the Jaccard measure.

• Rc J1:1 - used records within household-pairs and based the intersection of the Jaccard measure on the 1:1 links.

• Rc J1M∪M1 - used records within household-pairs and based the intersection of the Jaccard measure on the 1M ∪ M1 links.

• Rc J1:1+1M∪M1 - used records within household-pairs and based the intersection of the Jaccard measure on the 1:1 and 1M ∪ M1 links.

All four household based disambiguation systems increased the TPR and LR, relative to the Base system, with the Rc J1:1+1M∪M1 system producing the highest rates by a factor of 1.99 and 2.11, respectively. The lowest FPR was produced by the Rc J1:1 system, which decreased the FPR by a factor of 1.34, relative to the Base system; all other FPRs produced were higher than the Base system with AvgOR J producing the highest at a factor of 2.57.

We then looked at the effect of applying five different thresholds – 0.25, 0.33, 0.5, 0.66, and 0.75 – to the Rc J1:1, Rc J1M∪M1 and Rc J1:1+1M∪M1 systems. With thresholds the

Rc J1M∪M1 and Rc J1:1+1M∪M1 systems saw a decrease in their FPR, producing FPR’s that ranged from 0.04% to 1.66% below the Base systems FPR (4.98%). Due to the overlap of the standard deviations for the FPR we cannot conclude that one threshold is better than the others, in terms of the FPR. However, we can state that the use of thresholds as a whole made a significant difference on the FPR between threshold and non-threshold

74 systems (excluding the Rc J1:1 system) and that Rc J1:1+1M∪M1 with a threshold of 0.25 produced the highest LR and TPR, while keeping the FPR below 5%.

In conclusion, the use of household information based upon a threshold version of the 1:1 and 1M ∪ M1 links was successful in producing higher TPR and LRs, while keeping the FPR below 5%. The increase in the TPR and LR ranged from 1.06 to 1.68 and 1.09 to 1.89 times that of the Base system, respectively, with the FPR decreasing by a range of 1.05 to 1.50 times that of the Base system.

5.2 Future Work

The techniques explored throughout this thesis provide motivation for further exploration into record linkage techniques that deal with census data. Even though this thesis presented disambiguation record-linkage systems that are capable of producing high quality links there are still areas of possible improvement which can be applied to various steps of the record linkage approach.

Improvements are summarized below:

• The training data associated with this pair of Canadian data sets is minuscule, com- pared to the number of records we need to link. Exploration into the techniques used in creating seeded training data, discussed in section 3.3, could result in a higher number of experts links that could be applied to this application.

• In terms of the classification model chosen, the links found at the intersection of the three classification models explored produced high quality links, with the FPR being roughly 3%. Investigation into the performance of hybrid classification systems could potentially be helpful to produce a classification system with an extremely low starting FPR.

• Our attempt at using origin and religion data to produced acceptable disambiguation systems was unsuccessful due to our assumption that record-pairs with matching origin

75 and religion attributes would have a higher probability of being true matches. More exploration into the origin and religion attributes should be done, especially whether the use of origin and religion on their own would be more helpful.

• Further exploration into variations of the Avg J disambiguation system should be done as the version used in this thesis is extremely basic and doesn’t taking into account the number of records in a household. For example, a household with three is currently equivalent to a household with ten English people. By coming up with a version of the Avg J disambiguation system that incorporates the difference in household sizes, the overall method could produce better results.

The longitudinal data created through record linkage techniques would be used by re- searchers to investigate historical trends and to address questions about society, history and economy. One line of research will examine migration, social mobility, labour market adjustment and intergenerational inequality. A second area for investigation with longi- tudinal data is the determinants of individual health - the ways in which family origin or early life circumstance affect adult health and the influence of life experience on aging. Re- searchers in history and the social sciences are waiting for longitudinal data of this nature in order to resolve pressing research questions [1]. The potential to use longitudinal and multi-generational data to understand genetic and epigenetic pathways, and therefore to distinguish environmental from genetic influences, is not yet feasible but eventually will become so through the generation of mass longitudinal data as envisaged by this thesis.

76 Bibliography

[1] L. Antonie, K. Inwood, D. J. Lizotte, and J. A. Ross. Tracking people over time in 19th century canada. Submitted to Springer Machine Learning Journal http://www.springer.com/computer/ai/journal/10994.

[2] R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25–27, Washington, DC, 2003.

[3] M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the Sixth International Conference on Data Mining, ICDM ’06, pages 87–96, Washington, DC, USA, 2006. IEEE Computer Society.

[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18:16–23, September 2003.

[5] Minnesota Population Center. North atlantic population project - origins, Au- gust 2013. https://www.nappdata.org/napp-action/variables/ORIGIN#codes_ section visited 29/08/2013.

[6] Minnesota Population Center. North atlantic population project - origins, Au- gust 2013. https://www.nappdata.org/napp-action/variables/RELIGION#codes_ section visited 29/08/2013.

77 [7] O. Charif, H. Omrani, O. Klein, M. Schneider, and P. Trigano. A method and a tool for geocoding and record linkage. In Geoscience and Remote Sensing (IITA-GRS), 2010 Second IITA International Conference on, volume 1, pages 356 –359, aug. 2010.

[8] P. Christen. A comparison of personal name matching: Techniques and practical issues. In ICDM Workshops, pages 290–294, 2006.

[9] P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 151–159, New York, USA, 2008. ACM.

[10] P. Christen. Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 1065–1068, New York, NY, USA, 2008. ACM.

[11] P. Christen. A survey of indexing techniques for scalable record linkage and dedupli- cation. IEEE Transactions on Knowledge and Data Engineering, 99, 2011.

[12] P. Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-centric systems and applications. Springer, 2012.

[13] P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining, pages 127–151. 2007.

[14] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions in Information Theory, IT-13(1):21–27, 1967.

[15] J. Crandfield, K. Inwood, G. Morton, L. Antonie, and A. Ross. People in Mo- tion: Longitudinal data from historical sources, February 2012. http://www. people-in-motion.ca/ visited 29/08/2013.

78 [16] D. Dey, V. Mookerjee, and D. Liu. Efficient techniques for online record linkage. IEEE Trans. on Knowl. and Data Eng., 23(3):373–387, March 2011.

[17] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2011. R package version 1.6.

[18] M. Ektefa, F. Sidi, H. Ibrahim, M.A. Jabar, S. Memar, and A. Ramli. A threshold- based similarity measure for duplicate detection. In Open Systems (ICOS), 2011 IEEE Conference on, pages 37 –41, sept. 2011.

[19] M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. TAILOR: A record linkage tool box. In ICDE, pages 17–28, 2002.

[20] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19:1–16, January 2007.

[21] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183–1210, 1969.

[22] Z. Fu, P. Christen, and M. Boot. Automatic cleaning and linking of historical census data using household information. In ICDM Workshops, pages 413–420, 2011.

[23] Z. Fu, J. Zhou, P. Christen, and M. Boot. Multiple instance learning for group record linkage. In PAKDD (1), pages 171–182, 2012.

[24] R. Goeken, L. Huynh, T. A. Lynch, and R. Vick. New methods of census record linking. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 44(1):7–14, 2011.

[25] K. Goiser and P. Christen. Towards automated record linkage. In Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61, AusDM ’06, pages 23–31, Darlinghurst, Australia, 2006. Australian Computer Society, Inc.

79 [26] M. Gollapalli, Xue Li, I. Wood, and G. Governatori. Approximate record matching us- ing hash grams. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 504 –511, dec. 2011.

[27] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.

[28] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011.

[29] C. Hsu, C. Chang, C. Lin, et al. A practical guide to support vector classification, 2003.

[30] K. Inwood. The 1871 census in Scotland and Canada, February 2012. http://www. census1871.ca visited 29/08/2013.

[31] D. Lewis. Na¨ıve (Bayes) at forty: The independence assumption in information re- trieval. In Proc. of ECML, pages 4–15, 1998.

[32] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169–178, 2000.

[33] M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In Proceedings of the 21st national conference on Artificial intelligence - Volume 1, AAAI’06, pages 440–445. AAAI Press, 2006.

[34] University Montreal. The 1852 and 1881 historical census of Canada, February 2012. http://www.prdh.umontreal.ca/census/en/main.aspx visited 29/08/2013.

[35] F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Morgan and Claypool Publishers, 2010.

[36] B. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 496 –505, april 2007.

80 [37] L. Philips. Hanging on the metaphone. Computer Language Magazine, 7(12):39–44, December 1990.

[38] L. Philips. The double metaphone search algorithm. C/C++ Users J., 18(6):38–43, June 2000.

[39] K. Qin, Y. Yang, S. Zhen, and W. Liu. A unified record linkage strategy for web service data. In Knowledge Discovery and Data Mining, 2010. WKDD ’10. Third International Conference on, pages 253 –256, jan. 2010.

[40] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.

[41] I. Rish. An empirical study of the naive bayes classifier. IJCAI 2001 workshop on empirical methods in artificial intelligence, 3(22):41–46, 2001.

[42] K. Schliep and K. Hechenbichler. kknn: Weighted k-Nearest Neighbors, 2011. R package version 1.1-1.

[43] M. Schraagen. Complete coverage for approximate string matching in record link- age using bit vectors. In Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, pages 740 –747, nov. 2011.

[44] SHARCNET. The shared hierarchical academic research computing network, April 2012. https://www.sharcnet.ca visited 29/08/2013.

[45] W. Su, J. Wang, and F. H. Lochovsky. Record matching over query results from multiple web databases. IEEE Trans. on Knowl. and Data Eng., 22(4):578–589, April 2010.

[46] P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005.

81 [47] M. Tromp, A. C. Ravelli, G. J. Bonsel, A. Hasman, and J. B. Reitsma. Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. Journal of Clinical Epidemiology, 64(5):565 – 572, 2011.

[48] Z. W. Tun and N. Thein. An approach of standardization and searching based on hierarchical bayesian clustering (hbc) for record linkage system. In Creating, Con- necting and Collaborating through Computing, 2007. C5 ’07. The Fifth International Conference on, pages 54 –60, jan. 2007.

[49] V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, DE, 1995.

[50] V. S. Verykios and A. K. Elmagarmid. Automating the approximate record matching process. Information Sciences, 126:83–98, 1999.

[51] W. E. Winkler. Overview of record linkage and current research directions. Technical report, U.S Census Bureau, 2006.

[52] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi. Behavior based record linkage. Proc. VLDB Endow., 3(1-2):439–448, September 2010.

82 Appendix A

Comparison of Classification Techniques

Before exploring ways to resolve 1M ∪ M1 links, we investigate and compare the perfor- mance of three previously introduced classifiers - Support Vector Machine (SVM), KNN and NB - when used for Canadian census data record linkage. This investigation is necessary to ensure that the most effective classification method is chosen for this particular application. In addition, we test out different feature sets and explore their impact on the performance of the three classifiers.

An overview of the experimental setup needed is detailed in Section A.2, with Sections A.2.1 to A.2.3 presenting the setup and tuning of each of the classifiers in question. Sections A.3 and A.4 detail the results and analysis respectively. Finally, Section A.5 presents the use of different feature sets and how they effect the performance of the classifiers.

83 A.1 Evaluation

To evaluate the performance of the classifiers three other evaluation measures are used, along with the linkage rate and false positive rate (FPR) discussed in Sec. 2.3

TP P recision = (A.1) (TP + FP )

P recision shows the fraction of true positives out of all the positive instances classified. The higher the precision value, the lower the overall number of false positive errors that that classifier will have made. This measurement alone does not give us any information regarding how many true positives the classifier has found out of all the true positives in the testing set.

TP Recall = (A.2) (TP + FN)

The Recall evaluation measure addresses this limitation, as it shows the fraction of the true positives that the classifier has found, out of all the true positives in the testing set. It illustrates the classifiers ability to recall positive classes.

Both the precision and recall measurements can be combined to into one measure, known as the F − measure, shown in Eq. A.3. The general F-measure can be set to bias towards precision or recall depending on the problem at hand, by setting the β accordingly. By setting the β value to 1, you get the traditional F 1 − measure, shown in Eq. A.4, which balances the two biases. This is the primary measurement used when evaluating the classifiers in this appendix.

(P ∗ R) F − measure = (1 − β2) ∗ (A.3) β ((β2 ∗ P ) + R)

(P ∗ R) F 1 − measure = 2 ∗ (A.4) (P + R)

84 The other two evaluation measures are the false negative rate (FNR) and the cut rate. The FNR, Eq. A.5, corresponds to the percentage of links that are matches, but have been labelled as non-matches by the system and the cut rate, Eq. A.6, is the percentage of “positive” links produced by the system that are deemed ambiguous and therefore are put the 1M ∪ M1 link set.

FN FNR = (A.5) (TP + FN)

# of ambiguous links Cut rate = (A.6) (# of 1:1 links) + (# of ambiguous links)

A.2 Experimental Setup

The classification step for the 1871 and 1881 Canadian censuses is binary, with record- pairs being labelled as a match or a non-match. Support Vector Machines [49], K-Nearest Neighbour [14] and Naive Bayes [31], were chosen to be evaluated since each method is based on a different classification technique that complements the binary nature of the problem. Details of the steps preformed by the record-linkage system, prior to the classification step, can be found in subsections 2.5.1 to 2.5.2.

The training set for each classification method consists of 62,250 feature vectors, created by the cartesian product of the expert data set ON P rop. The testing set for each classifier consists of 3385 expert links, taken from the combination of the Logan, St James and Les Boys expert sets. Details of the expert data sets and an explanation of training and testing sets can be found in Sec. 2.4 and subsection 2.1.3, respectively.

Each classification method was constructed using the R [40] language, due to its rich functions and excellent documentation for the classification methods being tested. Before the SVM and KNN classification systems are trained they are tuned to figure out which classification parameters will give the best results. Once each classification system is tuned

85 and trained, the feature vector set (F-B) from the comparison step is given to the classifier to label. Training and tuning details of each classifier are discussed in subsections A.2.1 to A.2.3.

Since there are 90,178,727 feature vectors from the comparison step, serial farming on SHARCNET [44] is employed to reduce the total runtime required to perform the classifi- cation step. Without SHARCNET, the overall CPU time of running all three classifiers for this data set would have been roughly 3 years. Overall, each classifier is used once to label the given F-B feature vector set, with the results being evaluated using the given testing set.

A.2.1 Support Vector Machine

The main premise of a SVM, as discussed in subsection 2.2.1, is to find a hyperplane h in the training space that best discriminates between the classes in the training data. This is done by maximizing the margin between h and the closest class points, known as support vectors. Once this hyperplane is found, new objects are classified based on which side of the hyperplane they fall on. Details on the syntax of the R SVM function can be found in the R package ‘e1071’ [17].

The RBF kernel was chosen for the SVM, as it has been shown to give a good general performance [29], therefore the parameters that the SVM requires are cost (C) and gamma (γ). C defines the trade off of misclassification and γ defines how much influence a single training point has. An R tuning function is used to find the best values of C and γ for this training set, with the finale values being 64 and 8, respectively

A.2.2 K-Nearest Neighbour

The KNN classifier, as discussed in subsection 2.2.2, classifies a new instance by gathering the K nearest training records to the new instance, based on a given distance function, and assigning the new instance the majority class out of the K training records. In this case we

86 use a standard KNN that is based off of Euclidean distance. Details of the R KNN function syntax can be found in the R package ‘kknn’ [42]. An R tuning function is used to find the best value of K for this training set, with the final value being 5.

A.2.3 Naive Bayes

The NB classifier, as discussed in subsection 2.2.3, assumes the attributes in a feature vector are independent of one another and uses the Bayes rule to compute the probabilities of each class, given the independent attributes. Syntax for the R NB function can be found in the R package ‘e1071’.

A.3 Results

The results achieved by each classifier are presented in this section. Details on the evalua- tion measures used can be found in Section A.1. As discussed in Sec. 2.5.3 only the positive one-to-one links produced by the classification step are used for evaluation, with all other positive links being discarded due to their ambiguity.

Table A.1: F-B data set results across three different classifiers

Table A.1 shows the various results for SVM, KNN and NB, with the best results bolded for each evaluation measure. SVM produces the best linkage rate, f1-measure and cut rate

87 at 16.29%, 39.51% and 94.62%. While KNN produces the best FPR, at 10.99%, and the NB produces the best FNR, at 1.92%. The difference in performance between SVM and KNN ranges from 0.17% to 2.68%. Whereas, between KNN/SVM and NB the difference in performance ranges from 3.25% to 17.18%.

Figure A.1: Intersection of 1:1 links between classifiers for F-B data set

% Unique (Total 1:1 Links) SVM 18.31% (567,899) KNN 6.60% (497,370) NB 28.17% (280,795)

Table A.2: Percentage of total links unique to each classifier for F-B data set

The intersection of the 1:1 links between the three classifiers is shown in Fig A.1, with

88 Table A.2 showing the percentage of 1:1 links unique to each classifier(i.e. don’t overlap with another classifier). The intersection of SVM-KNN produces highest intersection of links, at 273,898, with the intersection of NB-KNN, and NB-SVM producing the smallest, at 13,112 and 3487, respectively. NB is shown to produce the most unique links, at 28.17%, whereas KNN is shown to produce the lowest amount of unique links at 6.60%.

Table A.3: Intersection of testing set between classifiers for F-B data set

The intersection of the True Positive (TP), False Positive (FP) and False Negatives (FN) between all three classifiers, is shown in Table A.3. The links unique to NB produce a higher number of FP links, compared to TP links, with the links found at the intersection of all three classifiers producing the lowest ratio of FP to TP links. SVM and KNN share the highest overlap in TP links.

A.4 Analysis

From the results shown in the previous section we can see that NB does not perform as well as SVM and KNN, and that it should not be used for this specific application. Overall NB produces the worst results in linkage rate, cut rate, F1-measure and FP rate. Even though NB produced the lowest FNR (1.92%), the low evaluation can be attributed to the fact that most of the TP links were seen as ambiguous positive links, and therefore were discarded.

Table A.3 shows that the quality of the links unique to NB is very low, with the number of FP links produced (52) being higher than the number of TP links produced (45). Sur-

89 prisingly the intersection of the three classifiers produced high quality links, with a FPR of roughly 3%, these links could be useful for studies that have a low tolerance for FP’s.

Similar results were produced between the SVM and KNN classifiers for all the eval- uation metrics. KNN outperformed SVM in FPR by 0.17%, but in all other cases SVM outperformed KNN by 1.75% or more. The similarity between the two classifiers can also be seen when evaluating the intersection of links, shown in Fig. A.1 and Table A.3, with SVM and KNN producing the highest overlap of links.

Overall we chose to proceed with the SVM classification method, since it outperformed KNN in respect to linkage rate, cut rate and F1-measure and was extremely close to the FPR of KNN.

A.5 Testing the Affect of Different Feature Sets on Classifier Performance

In addition to exploring the performance of classification systems, we consider if different feature sets affected the classification systems performance in a significant way. To do this we tune and train the same three classifiers, SVM, KNN and NB, but this time we use two different feature vector sets.

In feature vector set one (F-1) only one extra feature is introduced to the baseline set (F-B). Here, a feature is added to give more weight to feature vectors where the first and last name have high similarities. This is modelled by multiplying together the Jaro-Winkler distance for the first and last name.

Feature vector set two (F-2) is included to help model the cases where a woman gets married and changes her last name. This is an important issue, since women who change martial status from single to married are usually harder to link due their last name changing between censuses.

We propose 7 new features that are added to the baseline set (F-B):

90 1. Female (F)

2. Marriage status 1871 == 1881

3. Marriage status 71-single to 81-married (S to M)

4. (F)*(S to M)*(Edit distance of last name)

5. (F)*(S to M)*(JW distance of last name)

6. (F)*(S to M)*(Edit distance DM-1 of last name)

7. (F)*(S to M)*(Edit distance DM-2 of last name)

The last four features are the most important, as they are specific indicators for females who got married.

A.5.1 Experimental Setup

The experimental setup for the classification methods is discussed in Sec. A.2. The only difference being the SVM and KNN parameters that are found by tuning, i.e., C, γ and K. Table A.4 shows the best values of C, γ and K for the F-1 and F-2 training sets.

C γ K F-1 64 8 5 F-2 16 8 6

Table A.4: Results from SVM and KNN tuning

A.5.2 Results

Table A.5 shows the results of the two new feature sets (F-1 and F-2), plus the results for the baseline feature set (F-B), with the best result over the three feature sets bolded for each classifier evaluation measure. F-1 and F-2 produce lower results for SVM and KNN in

91 terms of linkage rate and F1-measure and higher results for the cut rate. The NB classifier, on the other hand, showed improved performance in all evaluation measurements, except FN rate, when the F-2 data set is used. The range between feature set results was below 5% in all classification evaluation cases.

Table A.5: SVM, KKN and NB results over three different feature sets, F-B, F-1 and F-2

The intersection of the returned 1:1 links between the three different feature sets is shown in Fig A.2 for each classifier. In all cases, the biggest intersection of links was produced between F-B, F-1, and F-2, with the intersection of F-B and F-1 producing the next biggest set of links. The percentage of unique links made by each classification and feature set pair, is shown in Table A.6. F-2 produces the most unique links across all three classifiers, with the highest unique links, 19.51%, being produced when F-2 is paired with KNN.

92 Figure A.2: Intersection of 1:1 links between F-B, F-1, and F-2 for each classifier

% Unique (Total 1-to-1 Links) SVM KNN NB F-B 2.00% (567,899) 3.03% (506,995) 0.87% (282,795) F-1 0.81% (560,333) 1.59% (500,464) 0.39% (281,601) F-2 14.89% (496,400) 19.51% (473,515) 15.22% (285,929)

Table A.6: Percentage of total links unique to each feature set

A.5.3 Analysis

Similar results were produced between all cases of the F-B and F-1 feature sets for all evaluation metrics, with differences in performance no greater than 0.51%. The similarity

93 between the F-B and F-1 feature sets can also be seen when evaluating the intersection of links, shown in Fig A.2, with F-B and F-1 having the second highest intersection. Since F-B and F-1 differ in by only one attribute it is no surprise that they are so close in performance.

F-2 affected the performance of the NB classifier across all the evaluation measures and the FNR of all the classifiers. The change performance of NB could be related to the addition of extra functionally dependent features when using the F-2 feature set, as NB has been shown to perform better with such features present [41].

The lower FNR in F-2 shows that the extra 7 attributes are helpful in pulling out the true links as matches, but since we are not seeing an increase in the F1-measure from the new links most of these matches are being being seen as ambiguous positive links and are being cut from the evaluation.

The similarity between all three feature sets can be seen in Table A.2, where the highest number of links is produced from the intersection of all three feature sets. Since the range between feature sets was below 5% in all classification evaluation cases the addition of these extra feature sets did not help the overall performance of the classifiers, with the same classification system performing the best in all evaluations metrics regardless of the feature set used. Therefore, the F-B feature set will be used in the rest of this thesis.

A.6 Summary

In this appendix we investigated the performance of three different classification systems - SVM, KNN and NB - for the building of a record-linkage system using Canadian census data. Each classifier was tuned, trained using functions available in R, and run on SHARCNET to make use of serial farming. We also explored the effect of different feature sets on the performance of each classifier.

Results showed that the NB classifier is not suited to this application, as it gave the worst performance out of all evaluation measures, except for the FNR. The SVM and KNN classifiers produce similar results with SVM outperforming KNN in linkage rate, cut rate

94 and F1-measure and only differing by 0.17% for FPR. SVM was chosen as an adequate classifier for this data set, and is used in the rest of this thesis. The use of additional feature sets did not produce a significant effect on the performance of the classification systems and therefore we did not explore their use further.

95 Appendix B

1M ∪ M1 links between Household-pairs

When dealing with Rc J1M∪M1 and Rc J1:1+1M∪M1 there is an extra pre-processing step before the Jaccard measures can be applied to the group of links. It consists of removing all the 1M ∪ M1 links that appear within the same household-pairs and accounts for roughly 1.18% of the 1M ∪ M1 links.

An example of this is given in Fig B.1 which shows three 1M ∪ M1 links, H, I and K, that appear between the same household-pair. Since I and K point to the same 1871 ID, and then go to potentially match two different records in the same 1881 household they are removed from the overall 1M ∪ M1 link set during preprocessing.

96 Figure B.1: 1M ∪ M1 links between Household-pairs

97