Large Scale Record Linkage in the Presence of Missing Data Otherwise (A8,A 9 ) Is a Member of U (True Non-Matches) and A8 and Definition 3.3 (Relational Signature)
Total Page:16
File Type:pdf, Size:1020Kb
Large Scale Record Linkage in the Presence of Missing Data Thilina Ranbaduge Peter Christen Rainer Schnell The Australian National University The Australian National University University Duisburg-Essen Canberra, Australia Canberra, Australia Duisburg, Germany [email protected] [email protected] [email protected] ABSTRACT One major data quality aspect that so far has only seen limited Record linkage is aimed at the accurate and efficient identifica- attention in RL is missing data [4, 16, 19, 28]. There are various tion of records that represent the same entity within or across reasons why missing values can occur, ranging from equipment disparate databases. It is a fundamental task in data integration malfunction or data items not considered to be important, to dele- and increasingly required for accurate decision making in appli- tion of values due to inconsistencies, or even the refusal of individ- cation domains ranging from health analytics to national security. uals to provide information for example when answering surveys. Traditional record linkage techniques calculate string similarities Missing data can be categorised into different types [26]. between quasi-identifying (QID) values, such as the names and ad- Data missing completely at random (MCAR) are missing values dresses of people. Errors, variations, and missing QID values can that occur without any patterns or correlations at all with any however lead to low linkage quality because the similarities be- other values in the same record or database. For example, if a first tween records cannot be calculated accurately. To overcome this name is missing for an individual, then neither last name, address, challenge, we propose a novel technique that can accurately link age, nor gender, can help predict the missing first name value (as- records even when QID values contain errors or variations, or are suming no external database is accessible where a complete record missing. We first generate attribute signatures (concatenated QID of that person is available). With data missing at random (MAR) one values) using an Apriori based selection of suitable QID attributes, can assume that a missing value can be predicted by other values in and then relational signatures that encapsulate relationship infor- the same record and/or database. As an example, for a record of a mation between records. Combined, these signatures can uniquely surgeon in an employment database, her salary could be predicted identify individual records and facilitate fast and high quality link- by averaging the salaries of all other (female) surgeons in that data- ing of very large databases through accurate similarity calculations base. Data missing not at random (MNAR) do occur for some spe- between records. We evaluate the linkage quality and scalability of cific reasons, for example if a patient in a medical study suffered our approach using large real-world databases, showing that it can from a stroke and therefore did not return to re-examinations then achieve high linkage quality even when the databases being linked there will be missing values for this patient in the study’s data- contain substantial amounts of missing values and errors. base. Finally, structurally missing data are values that are missing because they should not exist, and missing is their correct value. For example, young children should not have an occupation. Struc- 1 INTRODUCTION turally missing data are different to MNAR data because for the Organisations such as financial institutions, statistical agencies, and former the correct actual value is an empty value, while for MNAR government departments, increasingly require records about en- data there does exist a correct value, but not recorded. tities from multiple databases to be integrated to allow efficient Missing data can often occur in the QID attributes used to link and accurate decision making [12], where such databases can con- databases. While data imputation can be applied with the aim to tain many millions of records with detailed information about peo- fill such missing values before the databases are linked [22, 26], in ple, such as customers or patients [6, 18]. Record linkage (RL) [15] our work we assume that not all missing values have been imputed. is one major task required when databases are to be integrated. Specifically, imputation is not possible for missing values of types arXiv:2104.09677v1 [cs.DB] 19 Apr 2021 RL aims to identify and match records that refer to the same enti- MCAR and structurally missing. For example, as shown in Table 1, ties across different databases [6]. Linked databases allow improve- if the street address is missing for an individual (like forA1), then no ment of data quality, enrichment of the information known about other QID value of this record can help to predict this missing last entities, and facilitate the discovery of novel patterns and relation- name value. Missing values can lead to lower similarities between ships between entities that cannot be identified from individual records and thereby affect linkage quality. If alternatively those databases [7, 12, 32]. records or QIDs with missing values are not used for a linkage at Because there is often a lack of unique entity identifiers (such all, then linkage quality will also likely suffer. as social security numbers) across the databases to be linked, RL As an example, given the five records of three entities in the is commonly based on partially identifying attributes, known as two databases D and D in Table 1, a linkage between D and quasi-identifiers (QIDs) [8], such as names, addresses, dates of birth, D would potentially classify the record pairs (A1, A4), and (A2, A4) and so on. Data quality aspects such as typographical errors, vari- as matches because they have similar QID values, and it would ations, and changes of values over time (for example when people potentially classify records A3 and A5 to refer to different entities move or change their names due to marriage) are common in many because they have a low similarity due to name variations and of the QID attributes used for RL [6]. Due to errors and variations missing values. Therefore, this linkage will be of poor quality due in QIDs, exact matching of attribute values can lead to poor linkage to the wrongly matched record pair (A1, A4) and the non-matched quality [6, 8]. true matching record pair (A3, A5). Thilina Ranbaduge, Peter Christen, and Rainer Schnell Table 1: Two small example databases, D and D , with missing values and variations, and where RecordID and EntityID represent record and entity identifiers, respectively. Database RecordID EntityID FirstName LastName BirthDate StreetAddress City PhoneNumber A1 41 Peter Smith 1981-11-25 London D A2 42 Peter 1981-11-25 43SkyePl Dublin +353456785 A3 43 Anne Miller 1991-09-11 43SkyePlc +353456785 A4 42 Peter Smith 1981-11-25 43SkyePl D A5 43 Ann Myller 43SkyePlace Dublin +353456785 Contribution: We propose a novel RL technique that is appli- the EM algorithm, but does not scale to large databases due to its cable in situations where QID values contain errors and variations, computationally expensive weight estimations. and importantly that can be missing. For each record in a database Goldstein and Harron [19] developed an approach to link at- to be linked, our approach first generates several distinct attribute tributes of interest for a statistical model between a primary data- signatures by concatenating values from a set of QID attributes [34]. base and one or more secondary databases that contain the at- We propose an Apriori [5] based technique to select suitable QID tributes of interest. The approach exploits relationships between attributes for signature generation. As we detail in Section 4, the attributes that are shared across the databases, and treats the link- aim of attribute signatures is to uniquely describe records such that age of these databases as a missing data problem. The approach record pairs can be matched based on the number of common at- uses imputation to combine information from records in the dif- tribute signatures they share. ferent databases that belong to the same individual, and calculates We then utilise the relationships between entities to identify match weights to correct selection bias. matching records assuming they contain missing or dirty values in Ferguson et al. [16] proposed an approach to improve RL when their QID attributes. We first generate a graph of records based on the databases to be linked contain missing values by using a modi- the relationships between them, for example the roles of individu- fication ofthe EM algorithm [9] that considers boththe imputation als in a census household (such as father of or spouse of ), or shared of missing values as well as correlations between QID attribute val- attribute values such as addresses or phone numbers. We then gen- ues. Aninya et al. [4] analysed how different blocking techniques erate a relational signature for each record based on its neighbour- are affected by missing values. They showed that techniques that hood in this graph, and calculate the similarities between relational insert each record into multiple blocks, such as canopy clustering signatures of pairs of records. and suffix array indexing [6], performed best when the attributes We analyse our approach in terms of its computational complex- used for blocking contained missing values. ity, and evaluate linkage quality using large real-world data sets, In contrast to these existing methods [4, 16, 19, 28], our ap- including simulated German census databases containing in total proach does not consider imputation or estimation of similarities over 130 million records. As the experimental results in Section 5 when QID values are missing. Instead, we exploit relational in- show, our approach scales to large databases and incorporating re- formation between records to increase the overall similarities be- lational signatures can help to substantially improve linkage qual- tween record pairs even if their attribute similarities are low due ity compared to several baseline approaches [13, 15, 28, 34].