Adaptive Name Matching in Information Integration
Total Page:16
File Type:pdf, Size:1020Kb
Information Integration on the Web Adaptive Name Matching in Information Integration Mikhail Bilenko and Raymond Mooney, University of Texas at Austin William Cohen, Pradeep Ravikumar, and Stephen Fienberg, Carnegie Mellon University hen you combine information from heterogeneous information sources, you W must identify data records that refer to equivalent entities. However, records that describe the same object might differ syntactically—for example, the same person can be referred to as “William Jefferson Clinton” and “bill clinton.” Figure 1 presents Identifying more complex examples of duplicate records that are tant in duplicate identification. We examine a few effec- not identical. tive and widely used metrics for measuring similarity. approximately Variations in representation across information sources can arise from differences in formats that Edit distance duplicate database store data, typographical and optical character recog- An important class of such metrics are edit dis- nition (OCR) errors, and abbreviations. Variations tances. Here, the distance between strings s and t is records that refer to are particularly pronounced in data that’s automati- the cost of the best sequence of edit operations that cally extracted from Web pages and unstructured or converts s to t. For example, consider mapping the the same entity is semistructured documents, making the matching task string s = “Willlaim” to t = “William” using these essential for information integration on the Web. edit operations: essential for Researchers have investigated the problem of identifying duplicate objects under several monikers, • Copy the next letter in s to the next position in t. information including record linkage, merge-purge, duplicate • Insert a new letter in t that does not appear in s. detection, database hardening, identity uncertainty, • Substitute a different letter in t for the next letter integration. The coreference resolution, and name matching. Such in s. diversity reflects research in several areas: statistics, • Delete the next letter in s; that is, don’t copy it to t. authors compare and databases, digital libraries, natural language pro- cessing, and data mining. The sidebar summarizes Table 1 shows one possible sequence of opera- describe methods for various traditional approaches to name matching. tions. (The vertical bar represents a “cursor” in s or Our research explores approaches to the name- t, indicating the next letter.) If the copy operation has combining and matching problem that improve accuracy. Particu- cost zero and all other operations have cost one, this larly, we employ methods that adapt to a specific is the least expensive such sequence for s and t,so learning similarity domain by combining multiple string similarity the edit distance between s and t is three. methods that capture different notions of similarity. A fairly efficient scheme exists for computing the measures for name lowest-cost edit sequence for these operations. The Static string similarity metrics trick is to consider a slightly more complex function, matching. Most methods we discuss in the sidebar use metrics D(s, t, i, j), which is the edit distance between the that measure the similarity between individual record first i letters in s and the first j letters in t. Let si denote fields. As Figure 1 suggests, individual record fields are the ith letter of s, and, similarly, let tj be the jth letter often stored as strings. This means that functions that of t. It’s easy to see that we can define D(s, t, i, j) accurately measure two strings’ similarity are impor- recursively, where D(s, t,0,0) = 0 and 2 1094-7167/03/$17.00 © 2003 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society Ds(),,, t i j Name Address City Phone Cuisine ()−− = Ds,, ti11 , jif sij t , and youcopy s ij to t Fenix 8358 Sunset Blvd. West Hollywood 213/848-6677 American −−+ Ds(),, t i111 , jif yousubstitute tji for s = min Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood 213-848-6677 French (New) Ds(),,, t i j−1 if you insert the letter t j − (a) Ds(),, t i1 , jif youdelete the letter si (1) L.P. Kaelbling. An architecture for intelligent reactive systems. In Reasoning About Actions and Plans: Proceedings of the 1986 Workshop. Morgan Kaufmann, 1986 We can evaluate this recursive definition Kaelbling, L.P., 1987. An architecture for intelligent reactive systems. In M.P. Georgeff & A.L. efficiently using dynamic programming tech- Lansky, eds., Reasoning about Actions and Plans, Morgan Kaufmann, Los Altos, CA, pp. 395–410 niques. Specifically, for a fixed s and t,we (b) can store the D(s, t, i, j) values in a matrix that is filled in a particular order. The total Figure 1. Sample duplicate records from (a) a restaurant database and (b) a scientific computational effort for D(s, t,|s|, |t|), the edit citation database. distance between s and t, is thus approxi- mately O(|s||t|). for s′ and t′. The Jaro metric for s and t is Table 1. An example of an Edit distance metrics are widely used— edit-distance computation. not only for text processing but also for bio- ′ ′ ′ − stOperation 1 s t sTst′′, logical sequence alignment1—and many Jaro() s, t =⋅ + + . 3 s t s′ |Willlaim | variations are possible. The simple unit-cost W|illlaim W| Copy “W” metric just described is usually called Lev- enstein distance. The Needleman-Wunsch To better understand the intuition behind Wi|lllaim Wi| Copy “i” distance is a natural extension to Levenstein this metric, consider the matrix M in Figure Wil|llaim Wil| Copy “l” distance that introduces additional parame- 2, which compares the strings s = “WILL- Will|laim Will| Copy “l” ters defining each possible character substi- LAIM” and t = “WILLIAM.” The boxed tution’s cost and the cost of insertions and entries are the main diagonal, and M(i, j) = 1 Willl|aim Will| Delete “l” deletions. We can implement it straightfor- if and only if the ith character of s equals the Willla|im Willi| Substitute “i” for “a” wardly by modifying Equation 1—replacing jth character of t. The Jaro metric is based on Willlai|m Willia| Substitute “a” for “i” the second term of the min with something the number of characters in s that are in com- such as D(s, t, i − 1, j − 1) + substitution- mon with t. In terms of the matrix M of the Willlaim| William| Copy “m” Cost(si, tj). The Smith-Waterman distance figure, the ith character of s is in common lets us easily modify the metric to discount with t if Mi,j = 1 for some entry (i, j) that is mismatching text at the beginning and the “sufficiently close” to M’s main diagonal, WI LLI AM end of strings. The affine gap cost is another where sufficiently close means that |i − j| < W 10 0 0000 widely used variation that introduces two min(|s|, |t|)/2 (shown in the matrix in bold). I 01 0 0 100 costs for insertion: one for inserting the first William Winkler proposed a variant of the L 00 11000 character and a second (usually lower) for Jaro metric that also uses the length P of the L00 11000 inserting additional characters. longest common prefix of s and t.6 Letting L0011000 In information integration, Alvaro Monge P′ = max(P, 4), we define A0000010 and Charles Elkan performed several exper- I0100100 iments in which they used an affine-cost vari- Jaro-Winkler(s, t) = Jaro(s, t) + M0000001 ant of the Smith-Waterman distance to per- (P′/10) ⋅ (1 − Jaro(s, t)). form duplicate detection.2,3 This emphasizes matches in the first few char- Figure 2. The Jaro metric. The boxed entries are the main diagonal, and each The Jaro metric and variants acters. The Jaro and Jaro-Winkler metrics bold character is in common with the Another effective similarity metric is the seem to be intended primarily for short strings string “WILLIAM” (“WILLLAIM”). Jaro metric,which is based on the number (for example, personal first or last names). and order of common characters between 4–6 … two strings. Given strings s = a1 aK and Token-based and hybrid distances based metric is Jaccard similarity. Between … t = b1 bL, define a character ai in s to be In many situations, word order is unim- the word sets S and T,Jaccard similarity is “in common” with t iff there is a bj = ai in t portant. For instance, the strings “Ray simply (|S ∩ T|)/(|S ∪ T|). We can define such that i − H ≤ j ≤ i + H,where H = Mooney” and “Mooney, Ray” are likely to term frequency-inverse document frequency ′ = ′′ min(|s|,|t|)/2. Let saa1... k be the charac- be duplicates, even if they aren’t close in edit (TF-IDF) or cosine similarity—which the ters in s that are common with t (in the same distance. In such cases, we might convert the information retrieval community widely ′ = ′′ order they appear in s), and let tbb1... L′ strings s and t to token multisets (where each uses—as be analogous. Then define a transposition for token is a word) and consider similarity met- ()= ()⋅ () s′, t′ to be a position i such that ai = bi. Let rics on these multisets. TF-, IDF S T∑ V w , S V w , T , ∈∩ Ts′,t′ be one-half the number of transpositions One simple and often effective token- wS T SEPTEMBER/OCTOBER 2003 computer.org/intelligent 3 Information Integration on the Web Traditional Name-Matching Approaches Record linkage—the task of matching equivalent records blocks (“windows”) in which candidates for matching lie.3 Alter- that differ syntactically—was first explored in the late 1950s natively, the canopies method uses a computationally cheap and and 1960s.1 Ivan Fellegi and Alan Sunter’s seminal paper— general string similarity metric such as term frequency-inverse where they studied record linkage in the context of matching data frequency (TF-IDF) cosine similarity to create overlapping population records—provides a theoretical foundation for record clusters that contain possible matching pairs.4 subsequent work on the problem.2 They described several key We can roughly categorize methods for improving matching insights that still lie at the base of many modern name-match- accuracy by how much human expertise they require and the ing systems: extent to which they use machine learning and probabilistic methods.