Learning Blocking Schemes for Record Linkage∗
Total Page:16
File Type:pdf, Size:1020Kb
Learning Blocking Schemes for Record Linkage∗ Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292 USA {michelso,knoblock}@isi.edu Abstract Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is imprac- tical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For ex- ample, in a person matching problem, blocking might return all people with the same last name as candidate matches. Two main problems in blocking are the selection of attributes for generating the candidate matches and deciding which meth- ods to use to compare the selected attributes. These attribute and method choices constitute a blocking scheme. Previ- ous approaches to record linkage address the blocking issue in a largely ad-hoc fashion. This paper presents a machine learning approach to automatically learn effective blocking Figure 1: Record linkage example schemes. We validate our approach with experiments that show our learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to or cluster the data sets by the attribute. Then we apply the those manually built by a domain expert. comparison method to only a single member of a block. Af- ter blocking, the candidate matches are examined in detail Introduction to discover true matches. This paper focuses on blocking. Record linkage is the process of matching records between There are two main goals of blocking. First, the number data sets that refer to the same entity. For example, given of candidate matches generated should be small to minimize databases of AI researchers and Census data, record link- the number of detailed comparisons in the record linkage age finds the common people between them, as in Figure 1. step. Second, the candidate set should not leave out any Since record linkage needs to compare each record from possible true matches, since only record pairs in the candi- each dataset, scalability is an issue. Consider the above case, date set are examined in detail during record linkage. These and assume each dataset consists of just 5,000 records. This blocking goals represent a trade off. On the one hand, the would require 25 million detailed comparisons if all records goal of record linkage is to find all matching records, but the are compared. Clearly, this is impractical. process also needs to scale. This makes blocking a challeng- To deal with this issue, a preprocessing step compares ing problem. all of the records between the data sets with fast, approx- Most blocking techniques rely on the multi-pass approach imate methods to generated candidate matches. This step of (Hernandez´ & Stolfo 1998). The general idea of the is called blocking because it partitions the full cross product multi-pass approach is to generate candidate matches us- of record comparisons into mutually exclusive blocks (New- ing different attributes and methods across independent runs. combe 1967). That is, to block on an attribute, first we sort For example, using Figure 1, one blocking run could gener- ate candidates by matching tokens on the last name attribute ∗This research was sponsored by the Air Force Research Labo- along with first letter of the first name. This returns one ratory, Air Force Material Command, USAF, under Contract num- true match (M Michelson pair) and one false match (J Jones ber FA8750-05-C-0116. The views and conclusions contained pair). Another run could match tokens on the zip attribute. herein are those of the authors and should not be interpreted as Intuitively, different runs cover different true matches, so the necessarily representing the official policies or endorsements, ei- ther expressed or implied, of AFRL or the U.S. Government. union should cover most of the true matches. Copyright c 2006, American Association for Artificial Intelli- The effectiveness of a multi-pass approach depends on gence (www.aaai.org). All rights reserved. which attributes are chosen and the methods used. For ex- ample, generating candidates by matching the zip attribute Therefore effective blocking schemes should learn con- returns all true matches, but also, unfortunately, every pos- junctions that minimize the false positives, but learn enough sible pair. Method choice is just as important. If we change of these conjunctions to cover as many true matches as pos- our last name and first name rule to match on the first three sible. These two goals of blocking can be clearly defined letters of each attribute, we drop the false match (J Jones). by the Reduction Ratio and Pairs Completeness (Elfeky, While past research has improved the methods for com- Verykios, & Elmagarmid 2002). paring attributes, such as bi-gram indexing (Baxter, Chris- The Reduction Ratio (RR) quantifies how well the current ten, & Churches 2003), or improved the ways in which the blocking scheme minimizes the number of candidates. Let data is clustered into blocks (McCallum, Nigam, & Ungar C be the number of candidate matches and N be the size of 2000), this has not addressed the more fundamental issue the cross product between both data sets. of blocking research: which attributes should be used and which methods should be applied to the chosen attributes. RR = 1 − C/N As (Winkler 2005) puts it, “most sets of blocking criteria It should be clear that adding more {method,attribute} are found by trial-and-error based on experience.” pairs to a conjunction increases its RR, as when we changed This paper addresses this fundamental question of block- ({token-match, zip}) to ({token-match, zip} ∧ {token- ing research. We present a machine learning approach to match, first name}). discovering which methods and which attributes generate a Pairs Completeness (PC) measures the coverage of true small number of candidate matches while covering as many positives, i.e., how many of the true matches are in the can- true matches as possible, fulfilling both goals for blocking. didate set versus those in the entire set. If Sm is the number We believe these goals, taken together, are what define effec- of true matches in the candidate set, and Nm is the number tive blocking, and their optimization is the goal we pursue. of matches in the entire dataset, then: The outline of this paper is as follows. In the next sec- tion we further define the multi-pass approach and the choice PC = Sm/Nm of which attributes and methods to use for blocking. After Adding more disjuncts can increase our PC. For example, that, we describe our machine learning approach to block- we added the second conjunction to our example blocking ing. Then, we present some experiments to validate our idea. scheme because the first did not cover all of the matches. Following that we describe some related work and we finish So, if our blocking scheme can optimize both PC and with some conclusions and future work. RR, then our blocking scheme can fulfill both of our defined goals for blocking: it can reduce the candidate matches for Blocking Schemes record linkage without losing matches, which would hinder the accuracy of the record linkage. We can view the multi-pass approach as a disjunction of con- junctions, which we call a Blocking Scheme. The disjunction is the union of each independent run of the multi-pass ap- Generating Blocking Schemes proach. The runs themselves are conjunctions, where each To generate our blocking schemes, we use a modified ver- conjunction is an intersection between {method,attribute} sion of the Sequential Covering Algorithm (SCA) which dis- pairs. covers disjunctive sets of rules from labeled training data Using our example from the introduction, one conjunction (Mitchell 1997). Intuitively, SCA learns a conjunction of at- is ({first-letter, first name} ∧ {token-match, last name}) and tributes , called a rule, that covers some set of positive exam- the other is ({token-match, zip}), so our blocking scheme ples from the training data. Then it removes these covered becomes: positives examples, and learns another rule, repeating this BS = ({first-letter, first name} ∧ {token-match, last name}) step until it can no longer discover a rule with performance ∪ ({token-match, zip}) above a threshold. In our case, the positive examples for A blocking scheme should include enough conjunctions SCA are matches between the data sets, and the attributes to cover as many true matches as it can. For example, the SCA chooses from in the algorithm are {method,attribute} first conjunct by itself does not cover all of the true matches, pairs. The result of running SCA in our case will be a block- so we added the second conjunction to our blocking scheme. ing scheme. This is the same as adding more independent runs to the SCA, in its classic form is shown in Table 1. multi-pass approach. SCA maps naturally for learning blocking schemes. At However, since a blocking scheme includes as many con- each iteration of the while loop, the LEARN-ONE-RULE junctions as it needs, these conjunctions should limit the step learns a conjunction that maximizes the reduction ratio number of candidates they generate. As we stated previ- Table 1: Classic Sequential Covering Algorithm ously, the zip attribute might return the true positives, but SEQUENTIAL-COVERING(class, attributes, examples, threshold) it also returns every record pair as a candidate. By adding LearnedRules ← {} more {method, attribute} pairs to a conjunction, we can Rule ← LEARN-ONE-RULE(class, attributes, examples) W hile (PERFORMANCE(Rule) > threshold) do limit the number of candidates it generates.