Disambiguating Multiple Links in Historical Record Linkage
Total Page:16
File Type:pdf, Size:1020Kb
Disambiguating Multiple Links in Historical Record Linkage by Laura Richards A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science Guelph, Ontario, Canada c Laura Richards, August, 2013 ABSTRACT Disambiguating Multiple Links in Historical Record Linkage Advisors: Laura Richards Dr. L. Antonie University of Guelph, 2013 Dr. G. Gr´ewal Dr. S. Areibi Historians and social scientists are very interested in longitudinal data created from historical sources as the longitudinal data creates opportunities for studying people's lives over time. However, its generation is a challenging problem since historical sources do not have personal identifiers. At the University of Guelph, the People-in-Motion group have currently constructed a record linkage system to link the 1871 Canadian census to the 1881 Canadian census. In this thesis, we discuss one aspect of linking historical census data, the problem of disambiguating multiple links that are created at the linkage step. We show that the disambiguating techniques explored in this thesis improve upon the linkage rate of the People-in-Motion's system, while maintaining a false positive rate no greater than 5%. Acknowledgements First, I would like to thank my advisors, Dr. L. Antonie, Dr. G. Gr´ewal and Dr. S. Areibi, for their assistance throughout my research. Without their knowledge and guidance, I would not have been able to learn as much as I have through my studies. I am also thankful to Dr. F. Song, for his timely feedback and suggestions towards my thesis and to Dr. K. Inwood, for providing thoughtful insights into the record-linkage problem. Also, I would like to thank my parents and my friends for their unwavering support throughout my studies, your love and encouragement means the world to me. And finally a special thanks to Adrian and Jalisa for the ability to make me laugh, no matter what's happening. iii Contents 1 Introduction 1 1.1 PiM Record-Linkage System . .2 1.2 Thesis Statement . .3 1.3 Approach . .3 1.4 Organization . .4 2 Background 5 2.1 Record Linkage . .5 2.1.1 Data Pre-processing and Blocking . .7 2.1.2 Comparison . .8 2.1.3 Classification . .9 2.1.4 Evaluation . 10 2.2 Classification Methods . 10 2.2.1 Support Vector Machines . 10 2.2.2 K-Nearest Neighbours . 12 2.2.3 Na¨ıve Bayes . 13 2.3 Evaluation . 13 iv 2.4 Canadian Census . 14 2.5 Current Record Linkage (Base) System . 17 2.5.1 Blocking . 18 2.5.2 Comparison . 18 2.5.3 Classification . 19 2.5.4 Evaluation . 20 3 Literature Review 22 3.1 Blocking . 22 3.2 Comparison . 24 3.3 Record Linkage Classification . 26 3.4 Current Tools . 27 3.5 Current Record Linkage Systems for Census Data sets . 28 3.6 Summary . 30 4 Disambiguation Record Linkage System 31 4.1 Exploration of Classification Techniques . 31 4.2 Proposed Record Linkage System . 33 4.3 Multiple-Links Group-Disambiguation Algorithm . 34 4.4 Probability Score from the Classifier . 37 4.5 Extra Attributes . 39 4.5.1 Origin and Religion Attributes . 40 4.5.2 Household Attribute . 42 4.6 Using the Origin and Religion Attributes . 43 v 4.6.1 Integrated into a Similarity Measure . 44 4.6.2 Used as a Filter and paired with a Similarity Measure . 46 4.6.3 Summary of Analysis . 48 4.7 Using the Household Attribute . 49 4.7.1 Jaccard Coefficient . 49 4.7.2 Using Origin and Religion in the Jaccard Measure . 52 4.7.3 Current Links with Jaccard . 54 4.7.4 Applying Thresholds on Record based Jaccard . 59 4.7.5 Applying the Origin and Religion Filter to Household Similarity . 63 4.7.6 Summary of Analysis . 66 4.8 Bias of Disambiguation Methods . 67 5 Conclusions 72 5.1 Summary . 72 5.2 Future Work . 75 A Comparison of Classification Techniques 83 A.1 Evaluation . 84 A.2 Experimental Setup . 85 A.2.1 Support Vector Machine . 86 A.2.2 K-Nearest Neighbour . 86 A.2.3 Naive Bayes . 87 A.3 Results . 87 A.4 Analysis . 89 vi A.5 Testing the Affect of Different Feature Sets on Classifier Performance . 90 A.5.1 Experimental Setup . 91 A.5.2 Results . 91 A.5.3 Analysis . 93 A.6 Summary . 94 B 1M [ M1 links between Household-pairs 96 vii List of Tables 1.1 Links generated by the P iM system . .3 2.1 Simple census data sets . .7 2.2 Simple census data sets - Data pre-processing . .7 2.3 Simple census data sets and their corresponding feature vectors resulting from a comparison . .9 2.4 Known record-pairs with imprecise recording . 15 2.5 Census records with similar attributes . 16 2.6 Details on the feature scores in a feature vector . 19 2.7 Positive links generated by the Base system . 21 2.8 Base Results - 5 Fold Cross Validation . 21 4.1 Averaged Baseline Results - 5 Fold Cross Validation . 31 4.2 Comparison of P rob disambiguation system against the Base system . 38 4.3 Comparison of COR disambiguation system against the P rob disambiguation system and Base system . 44 4.4 Comparison of ORF ilter and ORMatch disambiguation systems against the P rob disambiguation system and Base system . 47 viii 4.5 Comparison of AvgOR J disambiguation system against the ORF ilter dis- ambiguation system and Base system . 54 4.6 Comparison of Rc J disambiguation systems against the ORF ilter disam- biguation system and Base system . 58 4.7 Comparison of Rc J1:1 > threshold disambiguation systems against the Rc J1:1 disambiguation system and Base system . 62 4.8 Comparison of Rc J1M[M1 > threshold disambiguation systems against the Rc J1M[M1 disambiguation system and Base system . 63 4.9 Comparison of Rc J1:1+1M[M1 > threshold disambiguation systems against the Rc J1:1+1M[M1 disambiguation system and Base system . 64 4.10 Comparison of OR − Rc J disambiguation systems against the Rc J disam- biguation systems and Base system . 65 A.1 F-B data set results across three different classifiers . 87 A.2 Percentage of total links unique to each classifier for F-B data set . 88 A.3 Intersection of testing set between classifiers for F-B data set . 89 A.4 Results from SVM and KNN tuning . 91 A.5 SVM, KKN and NB results over three different feature sets, F-B, F-1 and F-2 92 A.6 Percentage of total links unique to each feature set . 93 ix List of Figures 2.1 Overview of Record Linkage System . .6 2.2 A trained support vector machine classifier, where the circles and triangles represent different classes. 11 2.3 An Example of K-Nearest Neighbour Classification . 12 2.4 Overview of Base Record Linkage System . 18 4.1 Overview of Proposed Disambiguation Record Linkage System . 33 4.2 Example of disambiguating 1M and M1 groups. (i) Starting 1M and M1 link groups (ii) Disambiguated 1M link groups (iii) Disambiguated M1 link groups (iv) Final set of 1:1 links . 36 4.3 Distribution of P rob ............................... 39 4.4 Difference in Matching vs Non-Matching origin and religion Codes . 41 4.5 Distribution of household sizes between 1871 and 1881 censuses . 43 4.6 Distribution of COR ............................... 46 4.7 Example of Simple Househould . 50 4.8 Example of 1M Link Group . 51 4.9 Example of Households belonging to a 1M Link . 51 4.10 Origin/Religion Household setup . 53 x 4.11 1:1 and 1M [ M1 Links between Households . 56 4.12 Average distribution of Rc J1:1 across 1M [ M1 links . 60 4.13 Average distribution of Rc J1M[M1 across 1M [ M1 links . 60 4.14 Average distribution of Rc J1:1+1M[M1 across 1M [ M1 links . 61 4.15 Distribution of gender in 1:1 links compared to 1871 Canadian census . 68 4.16 Distribution of age in 1:1 links compared to 1871 Canadian census . 68 4.17 Distribution of marriage status in 1:1 links compared to 1871 Canadian census 69 4.18 Distribution of birthplace in 1:1 links compared to 1871 Canadian census . 70 4.19 Distribution of origin in 1:1 links compared to 1871 Canadian census . 70 4.20 Distribution of religion in 1:1 links compared to 1871 Canadian census . 71 A.1 Intersection of 1:1 links between classifiers for F-B data set . 88 A.2 Intersection of 1:1 links between F-B, F-1, and F-2 for each classifier . 93 B.1 1M [ M1 links between Household-pairs . ..