arXiv:2104.09677v1 [cs.DB] 19 Apr 2021 aus sn nAroibsdslcino utbeQDatt QID suitable of selection based Apriori an using values) rdtoa eodlnaetcnqe aclt tigsi string calculate techniques linkage record Traditional nQD,eatmthn fatiuevle a edt orl poor variat 8]. to and [6, lead errors can quality values to attribute Due of matching [6]. exact RL QIDs, for in used many attributes in QID common the are marriage) of peo to due when names example their (for change time or move over errors values typographical of as changes such and aspects ations, quality Data on. so and kno attributes, identifying linke partially be quasi-identifiers on to based databases the commonly across is numbers) security social as l,sc scsoeso ains[,18]. [6, patients or customers as abo such information detailed ple, with records effic can of databases allow millions many such to tain where [12], integrated making be decision accurate to and databases multiple abo from records tities require increasingly departments, government statistical institutions, financial as such Organisations INTRODUCTION 1 errors. and values missing of l amounts being substantial databases contain the when even quality linkage high tha achieve showing databases, real-world large using scalab approach and our quality linkage the evaluate We records. cal between similarity accurate through databases qu large very high of and ing fast facilitate and records uni individual can identify signatures these Combined, records. relations between encapsulate mation that signatures relational then and (concatena signatures o attribute variations, generate or first We errors missing. contain accurately values can QID that when technique even records novel overcome a To propose similaritie accurately. we the calculated challenge, be because cannot quality records linkage tween valu low QID to missing lead and however variations, Errors, people. a names of the dresses as such values, (QID) quasi-identifying between aindmisrnigfo elhaayist ainlse national a to analytics in health making integr from decision ranging data acro domains accurate in cation or for task within required fundamental increasingly entity a and same identific is the It efficient represent databases. and disparate that accurate records the of at tion aimed is linkage Record ABSTRACT hp ewe niista antb dnie rmindivi from identified 32]. be 12, cannot [7, databases that entities an patterns between novel ships of ab discovery known the information facilitate the and entities, of enrichment quality, en data same of the allow ment integrate to databases Linked refer be [6]. that databases to records different across match are ties and databases identify when to aims required RL task major one is eas hr sotnalc fuiu niyietfir (su identifiers entity unique of lack a often is there Because h utainNtoa University National Australian The ag cl eodLnaei h rsneo isn Data Missing of Presence the in Linkage Record Scale Large [email protected] hln Ranbaduge Thilina abra Australia Canberra, QD)[] uha ae,adess ae fbirth, of dates addresses, names, as such [8], (QIDs) eodlinkage Record h utainNtoa University National Australian The gnis and agencies, [email protected] relation- d milarities lt link- ality i infor- hip culations R)[15] (RL) improve- e QID ted abra Australia Canberra, ributes, tcan it t inkage tpeo- ut lt of ility scan es ee Christen Peter nas wn curity. dad- nd ten- ut inked quely vari- , ,RL d, ation be- s are r dual ppli- ions con- ient link this out ple ch ss ti- a- d. w databases two linkage suffer. a likely also for will used quality not linkage are then values alternativel all, missing If with quality. QIDs or linkage records affect be thereby similarities lower and to missin records lead this can predict values to Missing help value. can name record this of value QID other aateede xs orc au,btntrecorded. not but value, correct M a th for exist while for does value, there because empty an data data is value MNAR actual to correct the different former are v data S correct occupation. missing an their turally have is not should missing children young and example, exist, For not should they because remthn eodpi ( pair record matching true ftesre drs smsigfra niiul(iefor (like individual an for missing Tabl is in address street shown the as if example, For missing. values structurally missing and MCAR for possible imput not been is have values imputation missing Specifically, all [22, not that linked assume are we aim databases work the our the before with values applied missing be such can fill imputation data While databases. hr ilb isn ausfrti ain ntesuysd study’s the in patient this for Finally, values base. missing suffe be re-examination study to will medical return not there a did in therefore and patient stroke a a from if example for reasons, cific isn aacnb aeoie nodffrn ye [26]. types different into categorised surv be answering can when data example Missing for of information refusal provide the even to or uals inconsistencies, to to important, due be values vari to equipmen of are considered tion from not There ranging items 28]. data occur, 19, or can malfunction 16, values [4, missing data why missing reasons is RL in attention D yaeaigteslre falohr(eae ugosi t in surgeons Data (female) base. other all predi of be salaries the o could averaging record salary by her a database, for employment example, an an in As surgeon database. and/or valu other record by same predicted be the can value missing a that assume can data With available). r is complete person a that value where of name accessible first is database missing external the no predict suming ad help name, can last gender, neither a nor a then age, if individual, with example, an for For all database. missing at or is record name correlations same or the in patterns values other any without occur that otewogymthdrcr ar( pair and record qual variations matched poor name wrongly of the be to will to due linkage this similarity Therefore, low values. missing a have they because smthsbcuete aesmlrQDvle,adi would it and values, QID records similar classify have potentially they because matches as  sa xml,gvntefiercrso he niisi the in entities three of records five the given example, an As isn aacnotnocri h I trbtsue olin to used attributes QID the in occur often can data Missing n ao aaqaiyapc hts a a nyse limite seen only has far so that aspect quality data major One Data ol oetal lsiytercr ar ( pairs record the classify potentially would isn opeeya random at completely missing isn o trandom at not missing tutrlymissing structurally D  and D  nvriyDuisburg-Essen University A [email protected] nTbe1 ikg between linkage a 1, Table in A 3 3 , A and usug Germany Duisburg, 5 anrSchnell Rainer ). aaaevle htaemissing are that values are data MA)d cu o oespe- some for occur do (MNAR) A 5 isn trandom at missing A orfrt ieetentities different to refer to MA)aemsigvalues missing are (MCAR) 1 , A 4 n h non-matched the and ) A 1 , A 4 ,ad( and ), A MR one (MAR) 1 ,te no then ), a data- hat ftypes of D those y individ- t due ity 6,in 26], tween then s  A ecord dress, last g NAR 2 dele- sin es truc- alue. cted first , and ous eys. ata- red (as- 1, e A ny a f ed. to at 4 d k e t ) Thilina Ranbaduge, Peter Christen, and Rainer Schnell

Table 1: Two small example databases, D and D , with missing values and variations, and where RecordID and EntityID represent record and entity identifiers, respectively.

Database RecordID EntityID FirstName LastName BirthDate StreetAddress City PhoneNumber

A1 41 Peter Smith 1981-11-25 London D A2 42 Peter 1981-11-25 43SkyePl Dublin +353456785 A3 43 Anne Miller 1991-09-11 43SkyePlc +353456785

A4 42 Peter Smith 1981-11-25 43SkyePl D A5 43 Ann Myller 43SkyePlace Dublin +353456785

Contribution: We propose a novel RL technique that is appli- the EM algorithm, but does not scale to large databases due to its cable in situations where QID values contain errors and variations, computationally expensive weight estimations. and importantly that can be missing. For each record in a database Goldstein and Harron [19] developed an approach to link at- to be linked, our approach first generates several distinct attribute tributes of interest for a statistical model between a primary data- signatures by concatenating values from a set of QID attributes [34]. base and one or more secondary databases that contain the at- We propose an Apriori [5] based technique to select suitable QID tributes of interest. The approach exploits relationships between attributes for signature generation. As we detail in Section 4, the attributes that are shared across the databases, and treats the link- aim of attribute signatures is to uniquely describe records such that age of these databases as a missing data problem. The approach record pairs can be matched based on the number of common at- uses imputation to combine information from records in the dif- tribute signatures they share. ferent databases that belong to the same individual, and calculates We then utilise the relationships between entities to identify match weights to correct selection bias. matching records assuming they contain missing or dirty values in Ferguson et al. [16] proposed an approach to improve RL when their QID attributes. We first generate a graph of records based on the databases to be linked contain missing values by using a modi- the relationships between them, for example the roles of individu- fication ofthe EM algorithm [9] that considers boththe imputation als in a household (such as father of or spouse of ), or shared of missing values as well as correlations between QID attribute val- attribute values such as addresses or phone numbers. We then gen- ues. Aninya et al. [4] analysed how different blocking techniques erate a relational signature for each record based on its neighbour- are affected by missing values. They showed that techniques that hood in this graph, and calculate the similarities between relational insert each record into multiple blocks, such as canopy clustering signatures of pairs of records. and suffix array indexing [6], performed best when the attributes We analyse our approach in terms of its computational complex- used for blocking contained missing values. ity, and evaluate linkage quality using large real-world data sets, In contrast to these existing methods [4, 16, 19, 28], our ap- including simulated German census databases containing in total proach does not consider imputation or estimation of similarities over 130 million records. As the experimental results in Section 5 when QID values are missing. Instead, we exploit relational in- show, our approach scales to large databases and incorporating re- formation between records to increase the overall similarities be- lational signatures can help to substantially improve linkage qual- tween record pairs even if their attribute similarities are low due ity compared to several baseline approaches [13, 15, 28, 34]. to missing values, errors, or variations. To the best of our knowl- edge this work is the first to consider relational information for similarity calculations in the linkage process when records contain missing values. 2 RELATED WORK The use of combinations of QID values (called linkage keys) is Missing data has always been a challenge for linking records across commonly used for deterministic linkage [6]. A single attribute databases because missing QID values can result in lower linkage combination (such as SLK-581 [23]) or multiple attribute combi- quality. Ong et al. [28] introduced three methods to improve the nations (such as p-signatures [34]), which can improve the like- accuracy of probabilistic RL [15] when records have missing val- lihood of identifying matches [29], can be used to generate link- ues: redistribute matching weights of attributes that have missing age keys to identify matches. However, linkage keys cannot tol- values to non-missing attributes, estimate the similarity between erate differences in QID values, nor can they handle missing val- attributes when values are missing in a record, or use a set of addi- ues or use additional information such as relationships between tional attributes to calculate similarities if a primary QID value is records [23, 29, 34]. missing. Enamorado et al. [13] proposed an approach based on the prob- abilistic RL model [15] that allows for missing data and the inclu- 3 PROBLEM FORMULATION sion of auxiliary information when linking records. This approach Without loss of generality, we assume two deduplicated databases [6], uses the Expectation-Maximisation (EM) algorithm [9] to estimate D and D , such that each entity is represented by only one record matching weights in an unsupervised manner [33]. An approach in a database. We define two disjoint sets, M and U, from the cross- similar to [13] was proposed by Zhang et al. [35] for adjusting product D×D . A record pair (A8,A 9 ), with A8 ∈ D and A 9 ∈ D, match weights of a record pair when missing data occurred using is a member M (true matches) if (A8,A 9 ) represents the same entity; Large Scale in the Presence of Missing Data otherwise (A8,A 9 ) is a member of U (true non-matches) and A8 and Definition 3.3 (Relational Signature). The relational signature of A 9 represent two different entities. With the true class (M or U) un- a record A ∈ D based on its corresponding vertex E ∈ , using its known, the record linkage process attempts to accurately classify neighbourhood # (E) = {E 9 : (E, E 9 ) ∈ .} is defined as R( (A) each record pair as belonging to either M or U [6, 15]. = h51 (# (E)), ··· , 5< (# (E))i, with 58 being a certain graph feature As we detail in Section 4, the aim of our approach is to link generated from the neighbourhood # (E). records between databases based on their values in QID attributes, As we describe next, the features 5 that can be used in rela- as well as relationship information between records. We first gen- 8 tional signatures include degree and density using the egonet of a erate a set of attribute signatures for each record, where each sig- vertex [2], or we can use the QID values of neighbouring vertices nature is formed by concatenating one or more tokens (substrings) (records) as features for a given vertex. For example, records A extracted from QID values [34]. These tokens are generated by ap- 3 and A in Table 1 can be matched based on their respective neigh- plying various string manipulation functions. Any pair of records 5 bouring records A and A , given they share a highly similar street with the same values for a given set of QID attributes will have 2 4 address, and generate the same attribute signature A (A ,( ) = the same attribute signature, thus leading to the identification of ( 2 2 A (A ,( ) = Peter1991Skye for ( = [FirstName,~40A$ 5 (BirthDate), matching records. We define an attribute signature as: ( 4 2 2 BCA44C#0<4 (StreetAddress)]. Definition 3.1 (Attribute Signature). Let D be a database where each record A ∈ D contains values A.0 in a set of QID attributes 0 ∈ 4 SIGNATURE BASED RECORD LINKAGE IN . Let ( = [0 ∈  :  (0)] be a list of attributes where each 0 ∈ ( is THE PRESENCE OF MISSING DATA selected according to a certain criteria . The attribute signature Our approach consists of three main steps. We first provide an of A using ( is defined as A( (A,() = concat([60 (A.0) : 0 ∈ (]), adaptive method to select attribute combinations (from the set of where 2>=20C () is the string concatenation function and 60 () is a QID attributes ) that are suitable to generate attribute signatures. manipulation function applied on the value A.0 of attribute 0. We then generate several attribute signatures per record, as well as one relational signature per record based on the record graph we For example, let us assume the attributes selected for attribute build for each database to be linked. Finally, we use the generated signature generation are based on a criteria  of an attribute hav- attribute and relational signatures to identify matching records. ing a maximum of 20% of records in a database with missing val- = ues. Thus, assuming an attribute combination (1 [FirstName, 4.1 Attribute Selection LastName, ~40A$ 5 (BirthDate)] is selected, the attribute signature The choice of QID attributes plays an important role for attribute generated for record A1 in Table 1 is PeterSmith1981, where the val- ues 1981 is a token and the string manipulation function ~40A$ 5 () signature generation. The aim of each attribute signature is to dis- returns the year of a date of birth value. tinctively describe each record, such that if two records share the No signature can be generated for A if a QID value used in ( is same attribute signature it is highly likely that they belong to the missing for A [34]. Furthermore, errors and variations in QID val- same entity. Due to errors, variations, and missing values in QIDs, ues can result in different signatures, which can lead to record pairs however, a certain attribute combination might not be suitable as = to be classified as false matches or false non-matches. For example, an attribute signature. For example, (3 [FirstName, BirthDate] in Table 1 would generate the same attribute signature Peter1981- assuming the same attribute combination as (1, the records A1 and 11-25 for records A1, A2, and A4, and result in the false matching of A4 in Table 1 will be classified as a match because they have the record pair (A1,A4). same signature A( (A1,(1) = A( (A4,(1) = PeterSmith1981, while One way to overcome this issue is to select a set ( of attributes A3 and A5 will be classified as non-match since no attribute signa- based on domain knowledge [34]. However, such knowledge is ture can be generate for A5 due to its missing date of birth value. To improve similarities when records contain missing values, or not always available, or it might be inadequate for a given link- errors and variations, we use the relationships between records in age project, leaving a user to randomly select combinations of (po- a database, as identified through some attributes such as family re- tentially not suitable) QIDs for attribute signature generation. We therefore propose a method to select suitable QID attributes for sig- lationship, address, or phone number. For example, the records A2 nature generation based on their data characteristics. We consider and A3 in Table 1 can be considered as related because they share the same phone number. We represent records and their relation- two aspects in the selection process, (1) completeness, which mea- ships in a record graph, defined as: sures the amount of missing values in each attribute, and (2) Gini impurity [20], which measures the homogeneity of an attribute Definition 3.2 (Record Graph). Let  = (+, ) be an undirected with respect to the number of its distinct values. graph where + is the set of vertices, and each E ∈ + represents As we outline in Algorithm 1, we follow an Apriori based ap- a record A ∈ D. The set of edges of  is defined as  = {(E8, E 9 ) : proach [3, 5] where we use an iterative search over attribute com- 0 0 binations of length : generated from suitable attribute combina- E8, E 9 ∈ + ∧ E8 ←→ E 9 } where E8 ←→ E 9 represents a relationship tions of length : −1. First, in lines 1 and 2, we initialise an inverted over an attribute 0 ∈  between records A8 and A 9 . index C to store the combinations we need to process. In lines 3 We then generate a relational signature for each E ∈  based to 13 we then loop over each attribute combination of size :, with on its graph neighbourhood [2], such as the degree of E, where we 2 ≤ : ≤ ||. In line 4, using the function genCombinations() we generate < ≥ 1 features for the relational signature of a record. generate the set C[:] of attribute combinations of size : based on We define a relational signature as: the previously generated combinations C[: − 1] of size : − 1. Thilina Ranbaduge, Peter Christen, and Rainer Schnell

Algorithm 1: A ribute Selection Algorithm 2: Signature Generation Input: Input: - D: Database to be linked - D: Database to be linked - : List of QID attributes - : List of QID attributes C - =0: Number of combinations required - ( : List of candidate attribute combinations - U: Score weight, with 0 ≤ U ≤ 1 - : List of graph feature functions - 2C : Attribute selection threshold, with 0 ≤ 2C ≤ 1 - ?C : Probability threshold, with 0 ≤ ?C ≤ 1 Output: Output: S - C( : List of candidate attribute combinations - : Signature database 1: C = {}, T = [], : = 2 // Initialise variables 1: S = {}, A = {}, ! = ∅ // Initialise variables 2: C[1] = genCombinations() // Set initial combinations 2: foreach A ∈ D do: // Loop over each record in the database 3: while : ≤ || do: // Loop over all sizes of combinations 3: A[A.83] = ∅ // Initialise to an empty set 4: C[:] = genCombinations(C[: − 1]) 4: foreach ( ∈ C( do: // Loop over each combination 5: foreach ( ∈ C[:] do: // Loop over each combination ( 5: A( = genAˆrSignature(A,() // Attr. signature (Def. 3.1) A A 6: B2 = getCompletenessScore((, D) // Completeness of ( 6: [A.83].add(A( ) // Add the generated signature to 7: B6 = getGiniScore((, D) // Gini impurity of ( 7: !.add(A( ) // Add the generated signature to the set ! 8: B = U · B2 +(1 − U)· B6 // Calculate overall score of ( 8: foreach A( ∈ ! do: // Loop over attribute signatures = A 9: if B ≥ 2C do: // Check if score B is at least 2C 9: ? getSigProbability(A(, ) // Calculate the probability 10: T.add(((,B)) // Add the combination and its score 10: if ? < ?C do: // Check if the probability is less than ?C 11: T = removeSubCombinations(S, T) // Remove subsets 11: A = removeSignature(A(, A) // Remove the signature 12: else: // Score B is less than 2C 12:  = genRecordGraph(D,) // Generate the graph (Def. 3.2) 13: C[:].remove(() // Remove the combination ( 13: foreach A ∈  do: // Loop over each record in  14: : = : + 1 // Increment : 14: R( = genRelSignature(A,,) // Rel. Signature (Def. 3.3) S = A S 15: C( = 64C)>?><18=0C8>=B (T,=0) // Get best combinations 15: [A.83] ( [A.83], R( ) // Add signatures of A to S 16: return C( 16: return

Following the Apriori principle [3], to generate an attribute com- if the score of the attribute combination [FirstName, LastName] is bination ( of size : requires that all its attribute combinations of less than the threshold 2 , then it is unlikely that its superset, such size : − 1 are in C[: − 1], which means all its subsets must satisfy C as [FirstName, LastName, BirthDate], can provide more distinctive the selection criteria we detail below. For example, for the attribute attribute signatures. combination [0 , 0 , 0 ] of size : = 3, where 0 ∈ , all three com- 1 2 3 8 In line 14 we increment the value of : by 1 allowing the algo- binations [0 , 0 ], [0 , 0 ], and [0 , 0 ] must be in C[2]. 1 2 2 3 1 3 rithm to continue processing attribute combinations until : = ||. In lines 5 to 7, the algorithm iterates over each combination ( ∈ Finally, in line 15 we select the = attribute combinations with the C[:] and calculates the completeness and Gini impurity scores B 0 2 highest scores in the list T to create the final list of combinations and B , respectively. We calculate B = (|D|− < )/|D|, where |D| 6 2 ( C . As we show in Section 5, increasing the number of attribute is the number of records in D and < is the number of records ( ( combinations, = , can lead to an increase of the overall runtime with missing values for any attribute in (. We calculate the Gini 0 of our approach as more signatures are generated per record, but impurity of ( as B = Í % (E)·(1 − % (E)), where + is the list of 6 E∈+ potentially can also increase the linkage quality as more attribute unique values in ( and the probability % (E) = 5 /|D| with 5 being E E signatures will be considered when comparing a pair of records. the frequency of value E ∈ D. In lines 8 and 9, a combination ( (|D|· ||) is selected as a candidate attribute signature if the weighted score The complexity of Algorithm 1 is $ 2 . The attribute se- lection threshold, 2C , can be used to increase or decrease the num- B = U ·B2 +(1−U)·B6 is at least the threshold 2C , where U determines the importance of completeness over Gini impurity. ber of candidate attribute combinations that are pruned in each iteration, with lower values for 2C generating more attribute com- If the score B is at least 2C , in lines 10 and 11 we add ( along with its score B to the list T, and remove any subsets of ( from T binations that were added in previous iterations. This subset pruning pre- vents redundant attribute combinations and reduces the costs of 4.2 Signature Generation generating not required attribute signatures. For instance, if the at- As shown in Algorithm 2, in the second step we generate attribute tribute combination [FirstName, LastName, BirthDate] is selected, and relational signatures for each record A in the database D. In then all its subsets ([FirstName, LastName],[FirstName, BirthDate], line 2 we loop over each A ∈ D, and initialise an inverted index list and [LastName, BirthDate]) are pruned because they cannot pro- in A by adding the corresponding record identifier A.83 as a key vide more distinctive attribute signatures. and with an empty set as values (line 3). Next, for each attribute In line 13, we remove the combination ( from the set of combina- combination ( in the list of attribute combinations C( from Algo- tions C[:] if its score B is less than 2C . This ensures the supersets rithm 1, we generate the corresponding attribute signature A( (as of ( will not be processed in subsequent iterations. For instance, per Definition 3.1) and add it to the sets A[A.83] and ! (lines 5 to 7). Large Scale Record Linkage in the Presence of Missing Data

Attribute signatures are generated only if all attributes in ( have a Algorithm 3: Signature based Record Matching value for a given A. Input: As we discussed in Section 3, we require signatures to be unique - S : Signature database of D to a record because otherwise they can lead to incorrectly linked   - S : Signature database of D records. The probability for a given attribute signature A to occur   ( - B8< (): Similarity function for attribute signatures repeatedly in = records is governed by a Binomial distribution [26].  - B8< (): Similarity function for relational signatures The probability of an attribute value combination in a record to ' - V: Similarity weight, with 0 ≤ V ≤ 1 become an attribute signature decreases if the value combination - B : Similarity threshold, with 0 ≤ B ≤ 1 occurs in multiple records. We followed the same probability cal- C C culation described by Zhang et al. [34] to calculate the probability Output: - M: Matching record pairs ? of an attribute signature A( ∈ ! to be considered as a signature, = = M = which can be calculated as ? % (A( | = |{A.83 : ∀A.83 ∈AA( ∈ 1: ∅ // Initialise a set to hold matching record pairs = A[A.83]}|) = 1/(1 + _ `), with 0 < ` < 1 < _. The two parameters 2: R = genRecordPairs(S, S ) // Generate record pairs _ and ` control how fast the maximum of ? decreases as the num- 3: foreach (A.83,A .83) ∈ R do: // Loop over all record pairs Λ Λ = S S ber of records = that have signature A( increases [34]. We refer 4: , , R, R getSignatures(  [A.83],  [A .83]) the reader to [34] for more details on this probability calculation. 5: B = simA (Λ, Λ ) // Calculate attr. signature similarity We keep A( if its probability ? is at least the threshold ?C (line 6: if B ≥ BC do: // Check if the similarity is at least BC M M 10), otherwise we remove A( from all records in A in line 11. In line 7: .add((A.83,A .83)) // Add the record pair to 12 we then generate the record graph  following Definition 3.2. In 8: else: line 13 we loop over each record A ∈  and in line 14 we generate 9: B' = simR (R, R) // Calculate rel. signature similarity = the relational signature R( for A based on its local neighbourhood 10: B V · B +(1 − V)· B' // Calculate the similarity structure in . We provide a list  of graph feature functions to the 11: if B ≥ BC do: // Check if the similarity is at least BC M M function 64='4;0C8>=0;(86=0CDA4 () which generates a R( . 12: .add((A.83,A .83)) // Add the record pair to To generate a relational signature we use different features that 13: return M describe the local graph neighbourhood of a vertex in . Features include degree and egonet density of a vertex [2], as well as the attribute signatures of its neighbouring vertices for a given vertex. Finally, in line 15, we add the set of generated attribute signatures have at least one attribute signature in common. In line 3, we loop over each (A,A) ∈ R and get its respective set of attribute and A[A.83] and R( of each record A ∈ D into the signature database S relational signatures, Λ, Λ, R, and R in line 4. We calculate using the corresponding record identifier A.83 as the key. Λ (|D|+|D| · / ) the similarity B between the sets of attribute signatures  and Generating the graph  is of $ 3 2 , assuming the Λ average number of neighbours of a record in  (its degree) is 3.  , and if B is at least a user specified threshold BC then we add (A,A ) into the set M of matching record pairs (lines 5 to 7). If we assume |C( | attribute combinations have been generated by Algorithm 1, and each record in D is on average represented by ; For non-matching record pairs, in line 9 we calculate the simi- R R attribute signatures, and | | is the number of features considered larity B' between their relational signatures  and , and then to generate a relational signature, then the complexity of the signa- calculate the weighted average similarity B using both B and B'. If ( ) ture generation process in Algorithm 2 is $ (|D|·(|C | +; +3 ·| |)). B is at least BC then in line 12 the record pair A,A is classified as ( a match and added to M, which is returned in line 13. The complexity of Algorithm 3 is $ (=) if we assume = = |D | = D D D 4.3 Signature based Record Matching |  | and each record in  and  has generated a unique at- In the last step, as outlined in Algorithm 3, we compare the at- tribute signature in the signature generation step in Algorithm 2. tribute and relational signatures generated for each record to find matching record pairs that correspond to the same entity. We first 5 EXPERIMENTAL EVALUATION calculate the similarities between attribute signatures, and if these We use six real-world data sets to empirically evaluate our pro- are low due to missing QID values, or errors and variations in QID posed approach, as summarised in Table 2, where each data set values, we calculate the similarities between relational signatures. contains a pair of databases. Two are bibliographic data sets [24], Our signature based matching process uses the two signature DBLP-Scholar and DBLP-ACM, where their entities are academic databases S and S of databases D and D, respectively, as gen- publications, and each record contains title, authors, venue, and erated by Algorithm 2. We use the function B8<() to calculate a year of publication. We added missing values randomly, following similarity between records based on the number of attribute sig- the missing completely at random (MCAR) missing data category, natures that they have in common, while B8<' () calculates sim- to authors, venue, and year of publication, ensuring each record ilarities between the features in relational signatures. For exam- contains at most two missing values. The relationships between ple, B8<' () can be a set based similarity function such as Jaccard publications are identified based on their common authors. or Dice similarity [6] if we use the QID values of neighbouring The UKCD data set consists of census records of 12,044 house- records as relationship features. holds collected from 1851 to 1901 in 10 year intervals for the town In line 2, the function 64='42>A3%08AB () returns the set of all of Rawtenstall and surrounds in the United Kingdom [17]. We use record pairs R = {(A,A ) : A( ∈ S [A.83], A( ∈ S [A .83]} that the role of individuals in a household to identify the relationships Thilina Ranbaduge, Peter Christen, and Rainer Schnell

Table 2: A summary of the number of records, percentage of records containing missing values, and the number of ground truth record pairs for each data set pair used in the experimental evaluation.

DBLP - Scholar DBLP - ACM UKCD NCVR-20 NCVR-50 NCVR-2017 - NCVR-2019 Records 2,616 64,263 2,616 2,294 18,699 877,012 877,012 7,846,174 7,688,308 Missing% 17% 54% 20% 20% 24% 20% 50% 12% 10% Groundtruthpairs 2,386 2,220 18,354 877,012 877,012 6,948,248

Table 3: Average runtime results (in seconds) for linking the different data set pairs.

DBLP - Scholar DBLP - ACM UKCD NCVR-20 NCVR-50 NCVR-2017 - NCVR-2019 ProbLink[15] 143 12 164 1,155 1,143 38,468 ReDistLink[28] 141 11 148 1,090 1,125 34,766 P-Sig[34] 9 6 8 45 47 892 FastLink[13] 15 26 38 267 279 6,482 RA-Sig (steps1/2/3) 5/1/5 5/1/2 10/2/3 112/43/17 123/39/19 861/363 / 198 between records. We used first name, middle name, surname, ad- We compare our approach (RA-Sig) with four baseline linkage dress, occupation, gender, age, and birth parish as QIDs, where techniques. We use traditional probabilistic record linkage (Prob- 24% of records contain missing values in at most four of these at- Link) as proposed by Fellegi and Sunter [15] as a baseline given tributes. this approach is widely used in practical record linkage applica- We assume the missing data in the UKCD data set represents tions [16, 19]. We also use the method by Ong et al. [28] (ReDis- the MCAR, MAR, and MNAR categories. This is because the miss- tLink) that redistributes weights of QID attributes that have miss- ing values in attributes such as first or middle name cannot not be ing values to non-missing QID attributes, where these attribute imputed using other QID values, which makes them MCAR data. weights are calculated based on the probabilistic RL method [15]. The missing values in address could be imputed based on the other The third baseline is the probabilistic signature approach (P-Sig) available information of the household as we assume all members proposed by Zhang et al. [34], which identifies attribute signatures lived at the same address when the census was collected, hence that distinctively describe records, without however considering they follow the MAR category. Finally, some missing values in the errors, variations, nor missing values. Finally, we use the recently occupation attribute follow MNAR as records about children do proposed probabilistic linkage approach by Enamorado et al. [13], not contain an occupation. named FastLink. This approach also utilises the Expectation Max- The NCVR data sets contain records from an October 2019 snap- imisation algorithm based probabilistic RL model [9, 15] which esti- shot of the North Carolina voter registration database (available mates matching weights considering missing values and auxiliary from: http://dl.ncsbe.gov/). We extracted 309,877 household units information when linking records. from this snapshot and created two subsets, NCVR-20 and NCVR- Based on a set of parameter sensitivity experiments, in Algo- 50, where we randomly created (assuming MCAR) at most five rithm 1 we set =0 = [1, 5, 10], U = 0.5, and 2C = [0.5, 0.7, 0.9]. missing values in the QID attributes first, middle, and last names, Following [34], we set _ = 1.2, ` = 0.2, and ?C = [0.5, 0.7, 0.9] in street address, city, and zip code in 20% and 50% of records, re- Algorithm 2. We use the attribute signatures of neighbouring ver- spectively. To evaluate our approach with data of different quality, tices as features in Algorithm 3. The Jaccard coefficient [6] is used we then applied various corruption functions using the GeCo data for both B8< () and B8<' (), and in Algorithm 3 we set V = 0.5. corruptor [8] on between 1 to 3 randomly selected QID values of Following [28] we set the similarity threshold BC in Algorithm 3 to 20% of records. For relationship information we considered if two range from 0.5 to 1.0 (in 0.1 steps) for all techniques. records shared the same last name and address. We evaluated scalability using runtime, and linkage quality us- Finally, we use two full NCVR snapshots from October 2017 and ing precision and recall [6]. Precision measures the number of true October 2019, named as NCVR-2017 and NCVR-2019, respectively, matched record pairs against the total number of record pairs gen- as two databases to be linked. We use the same QID attributes first erated by a particular approach; and recall measures the number of name, middle name, last name, street address, city, and zip code as true matched record pairs against the total number of record pairs the other NCVR data sets for linkage. The last name and address in the linked ground truth data [21]. We do not use the F-measure QIDs did not contain any missing values, and we therefore con- for evaluation because recent research has shown that F-measure sider two voters as related if they share the same address and last is not suitable for measuring linkage quality in record linkage due name. Among the 89% voters who appear in both snapshots, 1%, 3%, to the relative importance given to precision and recall which de- 1.3%, 17%, 10%, and 12.5% of voters have changed their first name, pends upon the number of predicted matches [21]. middle name, last name, address, city, and zip code, respectively. We ran experiments on a server with 64-bit Intel Xeon (2.1 GHz) CPUs, 512 GBytes of memory, and running Ubuntu 18.04. We im- plemented all techniques in Python (version 3) except FastLink for Large Scale Record Linkage in the Presence of Missing Data

UKCD NCVR-20 NCVR-50 1.0 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 Random 0.5 0.5 0.5 Precision Adaptive, ct = 0.5 Precision Precision 0.4 0.4 0.4 Adaptive, ct = 0.7 0.3 0.3 0.3 Adaptive, ct = 0.9 0.2 0.2 0.2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall Recall 1.0 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6

Precision n Precision Precision 0.5 a = 1 0.5 0.5 na = 5 0.4 0.4 0.4 na = 10 0.3 0.3 0.3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall Recall 1.0 1.0 1.0

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7 Precision Precision Precision 0.6 pt = 0.5 0.6 0.6 p = 0.7 0.5 t 0.5 0.5 pt = 0.9 0.4 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall Recall

Figure 1: Linkage quality (precision and recall) results of our RA-Sig approach with different attribute selection methods and varying thresholds 0.5 ≤ 2C < 1.0 (top row), different numbers of attribute combinations =0 (middle row) in Algorithm 1, and different probability thresholds 0.5 ≤ ?C < 1.0 in Algorithm 2 (bottom row) for different similarity thresholds 0.5 ≤ BC ≤ 1.0. which we used an Apache Spark based Python implementation The top row of Figure 1 shows the linkage quality of our ap- developed by the Ministry of Justice in the United Kingdom [27]. proach for different values of 2C used in the adaptive attribute se- Though their original implementation takes the advantage of par- lection method. A low value of 2C resulted in lower linkage quality allelisation, we ran all experiments on a single-core to allow a fair because attribute combinations that are selected can include QID evaluation. We followed [1, 10] to measure the statistical signifi- attributes with more missing or less distinct values. However, ir- cance of linkage results. To facilitate repeatability, the programs respective of the value of 2C , the Apriori based attribute selection and data sets are available from: https://dmm.anu.edu.au/ra-sig. method results in substantially increased linkage quality compared Table 3 shows the runtime of all approaches on the different data to randomly selecting QID attribute combinations for attribute sig- sets. As can be seen, the attribute selection step of our approach natures. has the longest runtime due to the complexity of the Apriori based The middle row in Figure 1 shows the linkage quality of our iterative selection method. This runtime can however be decreased approach with different numbers of attribute combinations, =0, in by increasing 2C to reduce the number of candidate attribute combi- Algorithm 1. As can be seen, an increase in the number of combi- nations to be processed in each iteration. As shown in Table 3, the nations, =0, improves both overall precision and recall. This is be- linkage between two full NCVR snapshots took less than 30 min- cause more attribute signatures can be generated for each record utes to complete all three steps of our approach. This shows our which can potentially identify matching record pairs even their approach scales to large database sizes, making it applicable for QID values are missing or contains errors and variations. large-scale linkages of databases that contain errors and missing We measured the linkage quality of our approach with different values. Overall, P-Sig was the fastest approach (however it cannot probability thresholds ?C for the attribute signature generation in handle errors nor missing values), while ProbLink, ReDistLink, and Algorithm 2. As shown in the bottom row of Figure 1, a lower ?C FastLink required substantially longer runtime due to the weight value reduces the precision of our approach because the generated calculations and attribute string comparison they employ [6]. attribute signatures are not distinct enough (they are shared by many records), which can potentially lead to false positives. On the other hand, a higher ?C value tends to increase precision with Thilina Ranbaduge, Peter Christen, and Rainer Schnell

Table 4: Average precision and recall results (P / R) with BC = 0.8. The best results are shown in bold.

DBLP - Scholar DBLP - ACM UKCD NCVR-20 NCVR-50 NCVR-2017 - NCVR-2019 ProbLink[15] 0.53/0.31 0.94/0.53 0.83/0.64 0.72/0.83 0.67/ 0.41 0.82/0.81 ReDistLink[28] 0.67/0.32 0.96/0.57 0.84/0.65 0.73/0.84 0.67 /0.66 0.82/0.84 P-Sig [34] 0.87 /0.61 0.97/0.65 0.95 / 0.71 0.97 /0.82 0.91/0.70 0.95/0.85 FastLink[13] 0.73/0.58 0.96/0.69 0.91/0.71 0.93/0.77 0.68/ 0.63 0.85/0.88 RA-Sig 0.86 / 0.73 0.99 / 0.83 0.92 / 0.86 0.97 / 0.91 0.96 / 0.84 0.98 / 0.93 a drop of recall because only a few signatures are compared in the Table 5: Average runtime (in seconds) and linkage quality = matching process. We also noted that a lower ?C value can increase results with BC 0.8 for German census data sets. the runtime of our approach by 5% to 10% due to many attribute signatures are being generated in Algorithm 2. Attribute selection step (Algorithm 1) 7,903 The average linkage quality achieved by the different approaches Runtime Signature generation step (Algorithm 2) 4,573 for similarity threshold BC = 0.8 in Algorithm 3 is shown in Table 4. Record matchingstep(Algorithm3) 3,438 As can be seen, RA-Sig resulted in high precision and recall values Precision 0.999 Linkage quality with all data sets which illustrates that the use of relationship in- Recall 0.962 formation between records can correctly match records even when their QID attributes contain missing values, or variations and er- rors. The linkage quality results remain qualitatively similar if we change the threshold BC from 0.5 to 1.0. The probabilistic linkage approaches performed poorly due to data have to consider regional variations in household size, com- missing values that resulted in low similarities (even with weight positions of households, number and kind of institutional popula- redistribution [28] or estimation of weights considering missing tions, name variations according to date of birth and many more values [13]), and many record pairs were classified as non-matches. restrictions usually not considered when generating artificial data P-Sig achieved similar precision as our approach because we are us- for simulations. Furthermore, the amounts of errors and the fre- ing the same set of candidate signatures generated in Algorithm 1 quency of incomplete QID attributes have to be considered. for P-Sig. However, RA-Sig achieved, based on Tukey’s honestly Destatis [11], the Federal Statistical Office of Germany, there- significant difference (HSD) post hoc test [1], statistically signifi- fore, commissioned a research group at the University of Duisburg- cant improvement in recall (10% to 20%) even if the QID attributes Essen to develop suitable data sets which fulfil the given constraints. contain large numbers of missing values. These data sets are based on real-world samples of first names and last names, the actual distribution of households at given addresses 6 LARGE SCALE REAL DATA LINKAGE and marginal distributions of population characteristics [30]. Two of these data sets, with 81,194,354 and 55,321,209 with an overlap of In this section, we describe how we apply our approach in a real- 55,126,620 records, are used in the experiments presented here. We world record linkage setting. We use our approach to link two sim- use first and last names, date of birth, gender, and street address as ulated Census data sets containing information of all households QIDs for linkage and household identifiers to get relationships be- in Germany. tween entities. The two data sets contain 1% missing values, with Problem Description: Due to the number of changes required on 4%, 3.3%, 2.6%, and 1.3% of individuals having changes in their first population and increasing demands for faster and more name, last name, date of birth, and street address, respectively. frequent census statistics, European countries transition from tra- Results and Discussion: Table 5 shows the runtime and link- ditional census operations to register-based censuses [31]. In Ger- age quality results our approach achieved linking these two cen- many, the next census will be (similar to the Census 2001) a combi- sus data sets. As can be seen, our approach completed the link- nation of a register-based and a traditional census. After the Cen- age in less than 5 hours. Our approach generated approximately sus 2022, Germany will use a register-based census to decrease re- 400 million unique attribute signatures in Algorithm 2 when we spondent burden, cost and time needed for processing [25]. A ma- selected the five best attribute combinations in Algorithm 1, and jor challenge for any register-based census is linking various avail- set the score weight as U = 0.5 and attribute selection thresh- able administrative registers if no unique personal identification old as 2 = 0.6, respectively. As shown in Table 5, our approach number is available. C achieved an overall precision and recall of 0.999 and 0.962, respec- GermanCensus SimulationData: Since no unique personal iden- tively, which involves the comparison of approximately 54 million tification number is currently available in Germany, QID attributes candidate record pairs. such as names, date of birth, and place of birth have to be used for The initially used combination of algorithms (including exact linking. Data protection regulations, such as GDPR [14], do not matching [6], phonetic matching [6], and FastLink [13]) needed permit the permanent storage of QID attribute values in official more than 300 hours on average to link the two data sets achiev- statistics. Therefore, data sets used for the development of link- ing similar linkage quality which shows the superiority of our ap- age applications have to use artificially simulated data [30]. Such proach when linking large databases containing millions of records. Large Scale Record Linkage in the Presence of Missing Data

7 CONCLUSION [15] Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. Journal of the American Statistical Society 64, 328 (1969), 1183–1210. We have presented an efficient approach to link records even when [16] John Ferguson, Ailish Hannigan, and Austin Stack. 2018. A new computationally they contain errors or missing values. Our approach generates suit- efficient algorithm for record linkage with field dependency and missing data able attribute signatures using an unsupervised selection method, imputation. Elsevier IJMI 109 (2018), 70–75. [17] Zhichun Fu, Peter Christen, and Jun Zhou. 2014. A Graph Matching Method and relational signatures from relationship information between for Historical Census Household Linkage. In PAKDD. Springer, Tainan, Taiwan, records. Combined these signatures can uniquely identify individ- 485–496. [18] Lise Getoor and Ashwin Machanavajjhala.2012. Entity Resolution: Theory, Prac- ual records and facilitate very fast and accurate linking of large tice and Open Challenges. VLDB Endowment 5, 12 (2012), 2018–2019. databases compared to other state-of-the-art linkage techniques. [19] Harvey Goldstein and Katie Harron. 2015. Record linkage: a missing data prob- As future work, we plan to evaluate our approach under differ- lem. In Methodological Developments in Data Linkage. John Wiley & Sons, 110– 124. ent missing data scenarios, and to use graph embeddings to gener- [20] Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and ate improved relational signatures. We also aim to extend our ap- Techniques (3 ed.). Morgan Kaufmann. proach to facilitate efficient and accurate private record linkage [7]. [21] David Hand and Peter Christen. 2018. A note on using the F-measure for evalu- ating record linkage algorithms. Statistics and Computing 28, 3 (2018), 539–547. [22] Thomas N. Herzog, Fritz Scheuren, and William E. Winkler. 2007. REFERENCES and Record Linkage Techniques. Springer Verlag. [23] Rosemary Karmel, Phil Anderson, et al. 2010. Empirical aspects of record linkage [1] Hervé Abdi and Lynne J Williams. 2010. Tukey’s honestly significant difference (HSD) test. Encyclopedia of Research Design 3, 1 (2010), 1–5. across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study. BMC Health Services Research 10, 1 (2010), 41. [2] Charu C. Aggarwal and Haixun Wang. 2010. Managing and Mining Graph Data. [24] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity Springer, Boston. resolution approaches on real-world match problems. VLDB Endowment 3, 1-2 [3] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association (2010), 484–493. rules between sets of items in large databases. In SIGMOD. ACM, Washington [25] Thomas Körner, Anja Krause, and Kathrin Ramsauer. 2019. Requirements and DC, 207–216. perspectives on the way to a future register census [in German: Anforderungen [4] Imrul Chowdhury Anindya, Murat Kantarcioglu, and Bradley Malin. 2019. Deter- und Perspektiven auf dem Weg zu einem künftigen Registerzensus]. Wirtschaft mining the Impact of Missing Values on Blocking in Record Linkage. In PAKDD. und Statistik Special Issue (2019), 74–87. Springer, Melbourne, 262–274. [26] Roderick J. A. Little and Donald B. Rubin. 2020. Statistical Analysis with Missing [5] Aaron Ceglar and John F Roddick. 2006. Association mining. ACM Computing Data (3 ed.). Wiley, Hoboken. Surveys (CSUR) 3, 2 (2006), 5. [27] Ministry of Justice (MoJ). 2020. splink: Probabilistic record linkage and dedupli- [6] Peter Christen. 2012. Data Matching – Concepts and Techniques for Record Link- cation at scale. https://github.com/moj-analytical-services/splink/. age, Entity Resolution, and Duplicate Detection. Springer, Heidelberg. [28] Toan C. Ong, Michael V. Mannino, Lisa M. Schilling, and Michael G. Kahn. 2014. [7] Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Improving record linkage performance in the presence of missing linkage data. Data. Springer, Heidelberg. Elsevier JBI 52 (2014), 43–54. [8] Peter Christen and Dinusha Vatsalan. 2013. Flexible and extensible generation [29] Sean Randall, Adrian P Brown, Anna M Ferrante, and James H Boyd. 2019. Pri- and corruption of personal data. In CIKM. ACM, San Francisco, 1165–1168. vacy preserving linkage using multiple dynamic match keys. IJPDS 4, 1 (2019). [9] Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum like- [30] Rainer Schnell, Christian Borgs, and Virginia Wedekind. 2021. Simulating a lihood from incomplete data via the EM algorithm. Wiley JRSS-B 39, 1 (1977), register-based census: An expertise [in German: Durchführung einer Simulation- 1–22. sstudie als Gutachten im Bereich Registerzensus]. Technical report for Destatis. [10] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. University of Duisburg-Essen. The Journal of Research 7 (2006), 1–30. [31] Paolo Valente. 2010. Census Taking in Europe: How Are Populations Counted [11] Destatis. 2021. The Federal Statistical Office of Germany. in 2010? Population & Societies 467 (2010), 1–4. https://www.destatis.de [32] Jiannan Wang, Guoliang Li, JeffreyXu Yu, and Jianhua Feng. 2011. Entity match- [12] Xin L. Dong and Divesh Srivastava. 2015. Big . Morgan and ing: How similar is similar. VLDB Endowment 4, 10 (2011), 622–633. Claypool Publishers. [33] William E Winkler. 2000. Frequency-based matching in Fellegi-Sunter model of [13] Ted Enamorado, Benjamin Fifield, and Kosuke Imai. 2019. Using a probabilistic record linkage. US Bureau of the Census Statistical Research Division 14 (2000). model to assist merging of large-scale administrative records. American Political [34] Yuhang Zhang, Kee Siong Ng, Tania Churchill, and Peter Christen. 2018. Scalable Science Review 113, 2 (2019), 353–371. Entity Resolution Using Probabilistic Signatures on Parallel Databases. In CIKM. [14] European Union. 2016. Regulation (EU) 2016/679 of the European Parliament ACM, Turin, 2213–2221. and of the Council of 27 April 2016 on the Protection of Natural Persons with [35] Yinghao Zhang, Senlin Xu, Mingfan Zheng, and Xinran Li. 2020. Field Weights Regard to the Processing of Personal Data and on the Free Movement of Such Computation for Probabilistic Record Linkage in Presence of Missing Data. In- Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). ternational Journal of Pattern Recognition and Artificial Intelligence 34, 14 (2020). Official Journal of the European Union (L 119) 59 (2016), 1–88.