A Hierarchical Graphical Model for Record Linkage
Total Page:16
File Type:pdf, Size:1020Kb
A Hierarchical Graphical Model for Record Linkage Pradeep Ravikumar William W. Cohen Center for Automated Center for Automated Learning and Discovery, Learning and Discovery, School of Computer Science, School of Computer Science, Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] Abstract edit distances as a general-purpose record matching scheme was proposed by Monge and Elkan [10, 9], and The task of matching co-referent records is in previous work [2, 3], we developed a toolkit of vari- known among other names as record link- ous string-distance based methods for matching entity- age. For large record-linkage problems, of- names. The AI community has focused on applying ten there is little or no labeled data available, supervised learning to the record-linkage task | for but unlabeled data shows a reasonably clear learning the parameters of string-edit distance metrics structure. For such problems, unsupervised [14, 1] and combining the results of di®erent distance or semi-supervised methods are preferable to functions [15, 4, 1]. More recently, probabilistic object supervised methods. In this paper, we de- identi¯cation methods have been adapted to matching scribe a hierarchical graphical model frame- tasks [12]. In statistics, a long line of research has been work for the record-linkage problem in an un- conducted in probabilistic record linkage, largely based supervised setting. In addition to propos- on the seminal paper by Fellegi and Sunter [5]. ing new methods, we also cast existing un- In this paper, we follow the Fellegi-Sunter approach of supervised probabilistic record-linkage meth- treating the record-linkage problem as a classi¯cation ods in this framework. Some of the tech- task, where the basic goal is to classify record-pairs as niques we propose to minimize over¯tting in matching or non-matching. Many record-linkage prob- the above model are of interest in the gen- lems are quite large, such as the matching of individu- eral graphical model setting. We describe a als and/or families between samples and censuses, e:g:, method for incorporating monotonicity con- in the evaluation of the coverage of the U.S. decennial straints in a graphical model. We also outline census. Often for such large problems, there is little a bootstrapping approach of using \single- or no labeled data available, but unlabeled data shows ¯eld" classi¯ers to noisily label latent vari- reasonably clear structure. For such problems, unsu- ables in a hierarchical model. Experimental pervised or semi-supervised methods are preferable to results show that our proposed unsupervised supervised methods. methods perform quite competitively even with fully supervised record-linkage methods. In this paper, we describe a hierarchical graphical model framework for approaching this problem. In addition to proposing new methods, we also cast ex- 1 Introduction isting unsupervised probabilistic record-linkage meth- ods in the framework. The proposed graphical model Databases frequently contain multiple records that re- has (k + 1) latent variables for records with k ¯elds, fer to the same entity, but are not identical. The and hence ¯tting it to the data with minimal over- task of matching such co-referent records has been ex- ¯tting is a non-trivial task. We outline approaches to plored by a number of communities, including statis- deal with this estimation problem in Section 4, some of tics, databases, and arti¯cial intelligence. Each com- which could also be utilized in more general graphical munity has formulated the problem di®erently, and dif- model applications. We address the problem of in- ferent techniques have been proposed. corporating monotonicity constraints into a graphical model, which should be helpful in reducing over¯tting In the database community, some work on record in complex generative models where such constraints matching has been based on knowledge-intensive ap- exist. We also outline a bootstrapping approach of proaches [7, 6, 13]. More recently, the use of string- using \single-¯eld" classi¯ers to assign noisy labels to M: MATCH CLASS LATENT VARIABLE M the latent variables in a hierarchical model. Results show that this enables us to capture constraints in the multi-¯eld-record data more e®ectively. We also note that the proposed hierarchical model could be used to address the general problem of ¯tting a graphical FEATURE VECTOR model to continuous data (Section 7). F1 F2 F3 F4 Experimental results show that our proposed unsuper- vised methods are competitive with fully supervised record-linkage methods. Figure 2: Model with a single latent match-class 2 Preliminaries basic approaches to deal with continuous variables in a Given two lists of records, A and B, we look at the the graphical model: we can restrict to speci¯c families of task of detecting the matching record-pairs (a; b) 2 A£ parametric distributions e:g:, Gaussian mixture mod- B.A record is basically a vector of ¯elds, e:g:; Figure 1. els, or we can discretize the variables and learn the Thus, a record-pair is essentially a vector of ¯eld-pairs. model over the discrete domain. More generally, we can represent a record-pair (a; b) as Discretization: The f feature vector is discretized to a vector of features, often called a comparison vector: a discrete-valued vector ~w. For example, one way of f(a; b) = f1(a; b); :::; fk(a; b), where f1; : : : ; fk are the discretizing into binary values is: features. ½ 1 if fi > i wi = (1) Unless speci¯ed otherwise, we will consider the record- 0 if fi · i pair feature vector to be a vector of distances, one for One could then ¯t a multinomial probability model to each ¯eld-pair. If there are k ¯elds, we denote the the graphical model in Figure 2, as the observation feature vector by f, where f is the distance feature for i values for the bottom layer are now discrete. This the ith ¯eld. approach of binarization of the distance-features has The record-linkage problem is the classi¯cation task been adopted by Winkler et al [16, 17] and is one of of assigning the record-pair feature vectors to a label the baseline methods we compare our model to. \matching" or \non-matching". Denote the match- While the above approach enables using the e±cient class by a binary variable M, where M = 0 indicates estimation and inference machinery of discrete graphi- a non-match and M = 1 indicates a match. The goal cal models, it has the problem that discretization into of probabilistic record-linkage is to formulate a prob- a small number of values leads to a poor approximation abilistic model for the match-class M and the feature of the continuous distribution. However, if we increase vector f, and use the same to estimate the probability the number of discrete values d, there is a potential of the match class given the record-pair feature vector, explosion in the number of parameters of the model. P (Mjf). In an unsupervised setting, this amounts to In the graphical model of Figure 2, if the average estimating a generative model for (f;M). number of parents for a node is q, and each node has d values, then the number of multinomial parameters 3 Graphical Models for existing to estimate for k such nodes is O(kdq). This might Record-Linkage Methods cause standard estimation methods to over¯t the data. Existing unsupervised methods for probabilistic Using speci¯c parametric families: Instead record-linkage use a generative model for the record- of using the discrete-valued multinomial distribution pair feature vector with a single latent match-class for the variables, we could use other parametric variable, as shown in Figure 2. Note that the genera- families which allow continuous values, and for which tive model is for the record-pair feature vector rather there exists an e±cient machinery for estimation and than the record-pair itself. Some other generative inference, such as the Gaussian distribution. models for classi¯cation such as Naive-Bayes and Tree- Gaussian distributions have the added advantage Augmented Naive Bayes form special cases of Figure 2. that a mixture of m Gaussians can model any prob- The predominant problem with the graphical model ability distribution to arbitrary accuracy, provided in Figure 2 is that the fi feature values are contin- m is large enough [8]. This suggests the following uous, which precludes the normal multinomial prob- semi-supervised approach: we cluster the unlabeled ability model for a Bayesian network. There are two feature-vectors using a Gaussian mixture model. We NAME1 ADDRESS1 CITY1 OCCUP1 SEX1 RECORD 1 NAME2 ADDRESS2 CITY2 OCCUP2 SEX2 RECORD 2 dist(name1,name2) Address City Occup. Sex RECORD−PAIR = 0.8 Match Match Non−match Match FEATURE VECTOR Figure 1: Records and Record Pairs MATCH CLASS LATENT VARIABLE M k ¯elds. Rationale for the hierarchical model: While the intermediate latent variables as described above are operationally free to take any value, superimposing a LATENT MATCH−CLASS VARIABLES certain semantic interpretation upon the latent vari- X1X2 X3 X4 ables shall give an intuition for the hierarchical model, as well as allow us to constrain the model in order to make the estimation of the structure and parameters easier. F1 F2 F3 F4 FEATURE VECTOR Speci¯cally, one could interpret the binary-valued mid- dle layer x nodes in Figure 3 as latent match vari- ables for each ¯eld. Thus, each node x in the middle Figure 3: Hierarchical generative model for Record i layer corresponds to the match-class of a single ¯eld- Linkage pair distance feature fi. The top node in Figure 3 is the record-match class latent variable, which gives then use the limited labeled training data to label the match class of the entire record-pair, and which the clusters.