<<

Efficient in Large Data Sets

Liang Jin, Chen Li, and Sharad Mehrotra Department of Information and Computer Science

University of California, Irvine, CA 92697, USA  liangj,chenli,sharad @ics.uci.edu

Abstract as “Tom Franz” or “Franz, T.” or other forms. In addition, there could be typos in the data. The main task here is to This paper describes an efficient approach to record link- link records from different databases in the presence of mis-

age. Given two lists of records, the record-linkage problem matched information. ¾ consists of determining all pairs that are similar to each With the increasing importance of data linkage in a va- other, where the overall similarity between two records is riety of data-analysis applications, developing effective and defined based on domain-specific similarities over individ- efficient techniques for record linkage has emerged as an ual attributes constituting the record. The record-linkage important problem. It is further evidenced by the emer- problem arises naturally in the context of gence of numerous organizations (e.g., Trillium, FirstLogic, that usually precedes data analysis and mining. We ex- Vality, DataFlux) that are developing specialized domain- plore a novel approach to this problem. For each at- specific record-linkage and data-cleansing tools. tribute of records, we first map values to a multidimensional Three primary challenges arise. First, it is important to Euclidean space that preserves domain-specific similarity. determine similarity functions to link two records as dupli- Many mapping algorithms can be applied, and we use the cates [2, 17]. Such a similarity function consists of two FastMap approach as an example. Given the merging rule levels. First, similarity metrics need to be defined at the that defines when two records are similar, a set of attributes level of each field to determine similarity of different val- are chosen along which the merge will proceed. A multidi- ues. Next, field-level similarity metrics need to be com- mensional similarity join over the chosen attributes is used bined to determine overall similarity between two records. to determine similar pairs of records. Our extensive exper- The second challenge is to provide user-friendly interactive iments using real data sets show that our solution has very tools for users to specify different transformations, and use good efficiency and accuracy. the feedback to improve the [3, 5, 15]. The third challenge is that of scale. Often it is computa- tionally prohibitive to apply a nested-loop approach to use 1 Introduction the similarity function(s) to compute the distance between each pair of records. This issue has previously been studied The record-linkage problem – identifying and linking in [7], which proposed an approach that first merges two duplicate records – arises in the context of data cleansing, given lists of records, then sorts records based on lexico- which is a necessary pre-step to many database applications. graphic ordering for each “key” attribute. A fixed-size slid- Databases frequently contain approximately duplicate fields ing window is applied, and records within a window are and records that refer to the same real-world entity, but are checked to determine if they are duplicates using a merg- not identical, as illustrated by the following example. ing rule for determining similarity between records. Notice that, in general, records with similar values for a given field EXAMPLE 1.1 A hospital has a database with thousands might not appear close to each other in lexicographic or- of patient records. Every year it receives new patient data dering. The effectiveness of the approach is based on the from other sources, such as the government or local organi- expectation that if two records are duplicates, they will ap- zations. It is important for the hospital to link the records pear lexicographically close to each other in the sorted list in its own database with the data from other sources. How- based on at least one key. Even if we choose multiple keys, ever, usually the same information (e.g., name, SSN, ad- the approach is still susceptible to deterministic data-entry dress, telephone number) can be represented in different errors, e.g., the first character of a key attribute is always formats. For instance, a patient name can be represented erroneous. It is also difficult to determine a value of the window size that provides a “good” tradeoff between accu- step that conducts a similarity join in the high-dimensional racy and performance. In addition, it might not be easy to space. Section 5 studies how to solve the record-linkage choose good keys to bring similar records close. problem if we have a merging rule over multiple attributes. Recently many new techniques have a direct bearing In Section 6 we give the results of our extensive experi- on the record-linkage problem. In particular, techniques ments. We conclude in Section 7. to map similarity spaces into similarity/distance-preserving multidimensional Euclidean spaces have been developed. 2 Problem Formulation Furthermore, many efficient multidimensional similarity joins have been studied. In this paper, we develop an effi- Distance Metrics of Attributes: Consider two relations

cient strategy to record linkage that exploits these advances.

Ô and that share a set of attributes ½ . Each

We first consider the situation that a record consists of a sin-

 attribute  has a metric that defines the difference a

gle attribute. The record-linkage problem then reduces to

 value of  and a value of . There are many ways linking similar strings in two string collections based on a to define the similarity metric at the attribute level, and given similarity metric. We propose a two-step solution for domain-specific similarity measures are critical. We take duplicate identification. In the first step, we combine the two commonly-used metrics as examples: edit distance and two sets of records, and map them into a high-dimensional q-gram distance. Euclidean space. (A similar strategy is used in [14] to map Edit distance, a.k.a. [13], is a com- event sequences.) Many mapping techniques can be used. mon measure of textual similarity. Formally, given two

As an example, in this paper we focus on the FastMap algo-

¡ ´ µ

¾  ½ ¾ strings ½ and , their edit distance, denoted , rithm [4] due to its simplicity and efficiency. We develop a is the minimum number of edit operations (insertions, dele- linear algorithm called “StringMap” to do the mapping, and tions, and substitutions) of single characters that are needed

it is independent of the distance metric.

¡

¾  to transform ½ to . For instance, (“Harrison Ford”,

In the second step, we find similar object pairs in the “Harison Fort”) = 2. It is known that the complexity of

¾

Euclidean space whose distance is within a new thresh- computing the edit distance between strings ½ and is

´ ¢ µ

¾ ½ ¾ ½

old, which is closely related to the old threshold of string ½ , where and are the lengths of and

distances. Again, many similarity-join algorithms can be ¾ , respectively [18].

used, and we use the algorithm proposed by Hjaltason and Given a string and an integer , the set of q-grams of ,

´µ Samet [8] as an example. For each object pair in the result denoted , is obtained by sliding a window of length

of the second step, we check their corresponding strings to over the characters of string . For instance:

see if their original distance is within the original thresh-  (“Harrison Ford”)= ’Ha’, ’ar’, ’rr’, ’ri’, ’is’, ’so’, ’on’, ’n old, and find those similar-record pairs. Our approach has

’, ’ F’, ’Fo’, ’or’, ’rd’.

the following advantages. (1) It is “open” to many mapping  (“Harison Fort”)= ’Ha’, ’ar’, ’ri’, ’is’, ’so’, ’on’, ’n ’, ’ functions (the first step) and high-dimensional similarity-

F’, ’Fo’, ’or’, ’rt’. join algorithms (the second step). (2) It does not depend

on a specific similarity function of strings. (3) Extensive

The relative q-gram distance between two strings ½ and

¡ ´ µ

experiments on real large data sets show this approach has

Õ ½ ¾ ¾ , denoted , is defined as:

very good efficiency and accuracy (greater than 99%).

´ µ  ´ µ

½ ¾

´ µ½

We next study mechanisms for record linkage when ¡

Õ ½ ¾

´ µ  ´ µ ¾ many attributes are used to determine overall record sim- ½

ilarity. Specifically, we consider merging rules expressed ¡

For example, Õ (“Harrison Ford”, “Harison Fort”) = 1 -

½¼

¼¾¿

as logical expressions over similarity predicates based on  . Clearly the smaller the relative q-gram distance ½¿ individual attributes. Such rules would be generated, for between two strings is, the more similar they are.

example, if a classifier such as a decision tree was used to Similarity Merging Rule: Given distance metrics for

train an overall similarity function between records using

¾ Ô attributes ½ , there is an overall function that a labeled training set. Given a merging rule, we choose a determines whether or not two records are to be considered set of attributes over which the similarity join is performed as duplicates. Such a function may either be supplied manu- (as discussed in the single-attribute case), such that similar ally by a human, or alternatively, learned automatically us- pairs can be identified with minimal cost. ing a classification technique. While the specifics of the This work is conducted in the Flamingo Data-Cleansing method used to learn such a function are not of interest to Project at UC Irvine. The rest of the paper is organized as us in this paper, we do make an assumption that the function follows. Section 2 gives the formulation of the problem. is captured in the form of a rule discussed below. The form Section 3 presents the first step of our solution, which maps of rules considered include those that could be learned us- strings to a Euclidean space. Section 4 presents the second ing inductive rule-based techniques such as decision trees. Furthermore, this form of merging rules are consistent with 3.1 StringMap

the merging functions considered in [7]. In the first step, we combine the strings from the two

Let and be two records whose similarity is being de- relations into one set, and embed them into a Euclidean termined. A merging rule for records and is of the the space. Many mapping/embedding functions can be used,

following disjunctive format. such as multidimensional scaling [12, 19], FastMap [4],

´ µ  Æ ´ µ  Æ

½ ½½ Ô Ô ½Ô

½ Lipschitz [1], SparseMap [10], and MetricMap [16]. These

´ µ  Æ ´ µ  Æ

½ ¾½ Ô Ô ¾Ô ½ algorithms have different properties in terms of their effi- . ciency, contractiveness, and stress. (See [9] for a good sur- .

vey.) In this paper we use the FastMap algorithm because of

´ µ  Æ ´ µ  Æ

½ ½ Ô Ô Ô

½ its efficiency and distance-preserving capability. We mod-

´ µ  Æ ½ 

  For each conjunct  ( , and

ify the FastMap slightly and propose an algorithm called

Æ ´ µ ½ ), the value is a threshold using the metric

“StringMap,” as shown in Figure 1 (a). StringMap iterates

 function  on attribute . The conjunct means that two

to find pivot strings to form orthogonal directions, and

´ µ  Æ

  

records and should satisfy  . computes the coordinates of the strings on the axes.

For instance, given three attributes about papers, title

´ µ The function selects two (“T”), name (“N”), and year (“Y”), suppose we use the edit

strings to form an axis for the -th dimension. These two ¡ distance  as a metric for attributes name and year, and the

strings should be as far from each other as possible, and the

¡

relative q-gram function Õ as a metric for attribute . function iterates times to find the seeds. A typical Then we could have the following rule:

value could be . The algorithm assumes the dimensional-

¡ ´ µ  ¼½¼ ¡ ´ µ   ¡ ´ µ  ½

Õ  

ity of the target space, and we will discuss how to choose

¡ ´ µ  ¼½ ¡ ´ µ  ¾ ¡ ´ µ  ¾

Õ  

The record-linkage problem is to find pairs of records a good value shortly.

´ µ

Function

µ ´ from relations and , such that each pair satisfies

the merging rule defined above. The quadratic nested-loop computes the distance between strings (indexed by and

½

solution is not desirable, since as the size of the data set in- ) after they are mapped to the first axes. As shown

½ creases, this solution becomes computationally prohibitive. in Figure 1 (b), it iterates over the dimensions, and does the computation using only the two strings and their

For instance, in our experiments, it took more than hours

½ to use this approach on two single-attribute relations, each already-computed coordinates on the dimensions.

There are a few observations about the algorithm. (1) It with just ½¼ names, assuming the edit-distance function

is used (see Section 6 for our experimental setting). Since removes the recursion in FastMap. (2) The computation

by definition, the record-linkage problem is based on ap- of is not symmetric with respect to values

´µ

proximate matching of those individual metrics for different and . (3) In function , it is known that

£ £ attributes, an efficient solution that might miss some pairs can be negative [16]. In StringMap, (but with a high accuracy) can often be more preferable. we take the square root of the absolute value to compute the new distance. In our experiments we tried other ways to deal with this case, as described in [9]. Our results 3 Mapping Strings to Euclidean Space indicated that the approach of taking the absolute value keeps the distance well.

We first consider the single-attribute case, where and All the steps in StringMap are linear on the number of

¾

share one attribute . Thus the record-linkage prob-

´ ¢ ¢ µ

strings . Its complexity is , assuming it

lem reduces to linking similar strings in and based

´½µ ´µ takes time to compute . Notice that a major

on a given similarity metric. Formally, given a predefined ´µ

cost in the algorithm is spent in function .

´µ

threshold value Æ , we want to find pairs of strings from We can reduce the cost in the function as follows. At each

and respectively, such that according to the metric step in the function, we want to find a new string that is as

Æ

of attribute , the distance of and is within , i.e.,

far from a string (e.g.,  ) as possible. Instead of scanning

the whole strings, we can just do sampling to find a string

´ µ  Æ

that is very far from  . Or we can just stop once we find

a string that is “far enough” from  , i.e., their distance is Such a string pair is called a similar-string pair; otherwise, above certain value. it is called a dissimilar-string pair. Our proposed approach has two steps. In the first step, we map strings to objects in a multidimensional Euclidean space, such that the mapped 3.2 Choosing Dimensionality space preserves the original string distances. In the second

step, a multi-dimensional similarity join is conducted in the A good dimensionality value used in StringMap should Euclidean space. In this section we discuss the first step. have the property that after the mapping, similar strings can Algorithm StringMap // choose two pivot strings on the h-th dimension Input: • N strings: t[1,...,N]. Function ChoosePivot(int h, Metric M) • d: Dimensionality of Euclidean space. { • M: Metric function on strings. seed t[sa] = a random string from t[1],...,t[N]; Output: N corresponding objects in the new space. for(i=1tom){ // a typical m value could be 5 Variables: • PA[1,2][1,. . . ,d]: 2 × d pivot strings. // use function GetDistance(.,.,h,M) • coord[1,. . . ,N][1,. . . ,d]: object coordinates. // to compute distances seed t[sb] = a farthest point from t[sa]; Method: seed t[sa] = a farthest point from t[sb]; for (h = 1 to d) { } (p1,p2) = ChoosePivot(h,M); // choose pivot strings return (sa, sb); PA[1,h]= t[p1]; PA[2,h] = t[p2]; // store them } dist = GetDistance(p1, p2, h,M); // get distance of two strings (indexed by a and b) if (dist == 0) { // after they are projected onto the first h − 1axes // set all coordinates in the h-th dimension to 0 Function GetDistance(int a, int b, int h, Metric M) for (i = 1 to N) { coord[i,h] = 0 }; { break; A = t[a]; B = t[b]; // get strings } dist = M(A, B); // get original metric distance // compute coordinates of strings on this axis for(i=1toh− 1) { for (i = 1 to N) { // get their difference on dimension i x = GetDistance(i, p1,h,M); w = coord[a,i] - coord[b,i]; y = GetDistance(i, p2,h,M); dist = |dist × dist − w × w|; coord[i,h] = (x*x + dist*dist - y*y) / (2*dist); } } return ( dist ); } }

(a) The main algorithm (b) Functions

Figure 1. Algorithm StringMap. be differentiated from those dissimilar ones. It cannot be Notice that we only use the pairs of selected strings to

too small, since otherwise those dissimilar pairs will not compute the cost. This value measures how well a new

¼

 “fall apart” from each other. In particular, the distances threshold Æ differentiates the similar-string pairs from

of similar-string pairs are too close to those of dissimilar those dissimilar pairs. In particular, the string pairs whose

¼ Æ

ones. cannot be too high either, since (1) the complexity new distance is within will be retrieved in step 2. Thus ¾ of the StringMap algorithm (see above) is linear to ; (2) the lower the cost is, relatively the fewer object pairs need

we need to do a similarity join in the second step, and we to be retrieved in step 2 whose original distance is more Æ

want to avoid the curse of dimensionality. As increases, than . Figure 3 in Section 6.1 shows that typically a good ¾ it becomes more time-consuming to find object pairs whose dimensionality value is between ½ and . distance is within a new threshold. We choose a dimension- ality as follows. 4 Similarity Join in Euclidean Space

1. Randomly select a small number (say, 1K) of strings In the second step, we find object pairs whose Euclidean

from the data sets and . Use the nested-loop ap- ¼

distance is within a new threshold Æ . For each candidate proach to find all similar-string pairs within threshold pair, we check the distance of their original strings to see it

Æ in the selected strings.

is within threshold Æ .

2. Run StringMap using different values. (Typically ¼

4.1 Choosing New Threshold Æ

¿¼ is between and .) For each , compute the new

distances of the similar-string pairs. Find their largest ¼

The selection of the new threshold Æ depends on the

new distance . ¼

 Æ mapping function. For instance, we can set Æ if the mapping function is contractive. That is, for any two strings

3. Find a dimensionality with a low cost:

¼ ¼ ¼

´µ  ´ µ ´µ

and ,wehave , where is

¼ ¼ ¼

´ µ

# of object pairs within distance their distance in the original metric space, and is

 (1) ¼

# of similar-string pairs the new Euclidean distance of the corresponding objects ¼

and in the new space. In general, suppose there are two

½ ¾

Intuitively, the cost is the average number of object constants ½ , such that for any two strings and ,

½

¼ ¼ ¼

¡ ´µ  ´ µ  ¡ ´µ

pairs we need to retrieve in step 2 for each similar- we have: ¾ , then



½

¼ ¼

Æ Æ  ¡ Æ

string pair, if is used as the new threshold for se- we can just set ¾ . Properties of different mapping lecting similar-object pairs in the mapped space. functions are studied in [9]. ¼

Ideally, the threshold Æ should be set to a maximal value When we consider the two child nodes of each of the two of the new distance between any two similar-string pairs root nodes in the figure, we have four pairs:

in the original space. Then it will guarantee no false dis-

´ µ ´ µ ´ µ ´ µ

½ ½ ¾ ¾ ½ ¾ ¾ missals. However, this maximal value could be either too ½

expensive to find (we do not want to have a nested-loop pro- ¼

cedure), or the theoretical upper bound could be too large. Suppose each pair has a distance within Æ , then we insert

´ µ ½ Since it is acceptable to miss a few pairs, we would like to them into the queue. We remove the pair ½ from the choose a threshold such that for most of the similar-string queue, and consider pairs of their child nodes:

pairs, their new distances are within this threshold. As

´ µ ´ µ ´ µ ´ µ ´ µ ´ µ

¿ ¿  ¿   ¿     shown in our experimental results, even though a theoret- ¿ ical upper bound could be large, most of the new distances For each pair, we compute their distance. If, for exam-

could be within a much smaller threshold.

¼

Æ  ple, the distance between ¿ and is greater than ,we

In our approach to selecting the dimensionality , the will not consider this pair, since they cannot generate object

threshold can be used to identify similar pairs in the

¼ Æ

¼ pairs that have a distance within . In other words, all the

mapped space. Here is how we choose the threshold Æ .We ¼

pairs of their descendants are greater than Æ .

randomly select a small number (say, 1K) of strings from

´ µ ´ µ ´ µ ´ µ

¿ ¿     

Suppose only ¿

data sets and . (Notice these selected strings might be ¼

have a distance within Æ . Then we insert these pairs into different from those used to decide the dimensionality in the queue. The status of the queue is shown in the figure. Section 3.2.) We find all the similar-string pairs in these se- Eventually we have a pair of objects from the queue. Then lected strings, and compute their new Euclidean distances we compute their Euclidean distance to check if it is within

after StringMap. We choose their maximal new distance as

¼ Æ

¼ . If so, we compute the original metric distance of their the new threshold Æ . We may do this sampling multiple

¼ original strings. We output this pair of strings if their times (on different sets of selected strings), and choose Æ

original metric distance is within the original threshold Æ . as the maximal new distance of those similar-string pairs. Different strategies can be used to traverse the two R-

trees, such as depth first and breadth first. Our experiments ¼

4.2 Finding Object Pairs within Æ show that the depth-first traversal strategy has several ad- vantages. (1) It can effectively reduce the queue size, since

We want to find all those object pairs whose new distance object pairs in the leaf nodes can be processed early, and ¼ is within this new threshold Æ . Many similarity-join algo- they can be dequeued. Thus the memory requirement of the rithms can be applied, and we use a simplified version of algorithm tends to be small. In our experiments, when each the algorithm in [8] as an example, due to its simplicity and data set had about 27K strings, the breadth-first strategy re- availability of the code. We briefly explain the main idea of quired about 1.2GB memory, while the depth-first strategy the algorithm. We first build two R-trees for the mapped ob- only requested about 30MB memory. (2) It also reduces the jects of the two string sets, respectively. We traverse the two time that the first pair is output, since it can reach the leaf

trees from the roots to the leaf nodes to find those pairs of nodes more quickly than the breadth-first strategy. ¼ objects within distance Æ . As we do the traversal, a queue Comparison: In [6], the authors proposed an attrac- is used to store pairs of nodes (internal nodes or leaf nodes) 1 tive q-gram-based approach to find similar string pairs in from the two trees. We only insert those node pairs that a DBMS. Compared to that approach, our approach has can potentially yield object pairs that satisfy the condition. several advantages in the context of record linkage. First, Those node pairs that cannot produce results are pruned in the distance metrics used at individual attributes might not the traversal, i.e., they are never inserted into the queue. be edit distance. As argued earlier, domain-specific meth-

Take Figure 2 as an example. Initially, a pair of the root ods work better in identifying similar records. Furthermore,

´ µ ¼

nodes ¼ is inserted into the queue. At each step, we the algorithm in [6] is geared towards finding all the string

´ µ  dequeue the head pair  . If both nodes are internal pairs that are within a fixed edit-distance threshold. Since

nodes, we consider all the pairs of their children. For each the record-linkage problem, by definition, is based on ap-

´ µ pair  , we compute their “distance,” which is a lower proximate matching, solutions that might miss some such bound of all the distances of their child objects. (The case pairs (say those that obtain around 99% accuracy) but re- of a node-object pair is handled similarly.) We prune node sult in significant time savings might be more preferable.

pairs as follows: if the distance of a node pair is greater than As we will see in Section 6.3, our algorithm works very ¼

Æ , we do not insert this pair into the queue. If the distance efficiently, while still achieving a very high accuracy. In ¼ of two nodes is within Æ , we insert this pair into the queue. addition, the implementation of the qgram-based approach

1Since we just want to find those object pairs whose distance is within inside a database might be less efficient than a direct imple- ¼

Æ ,wedonot need a priority queue. mentation using other languages (e.g., the C language).

´Ê Ë µ ÉÙÙ

¿ ¿

´Ê Ë µ

¿ 

´Ê Ë µ Ë

  ¼

Ê

¼

´Ê Ë µ Ë Ê Ë Ê

  ¾ ¾ ½ ½

´Ê Ë µ

½ ¾

Ë Ë

½ ¾

´Ê Ë µ

¾ ½

Ê Ê

½ ¾

Ê Ë Ê Ë Ë Ê Ê Ê Ë Ë

     ¿    ¿

´Ê Ë µ

¾ ¾

Ê Ë

¿ ¿

Ê Ê Ê Ê Ë Ë Ë Ë

       

Ô Ô Ô Ô Ô Ô Ô Ô Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Ô Ô

    ½ ¾ ¿  ½ ¾ ¿      ½¼ ½¼

Figure 2. Finding similar-object pairs using R-trees.

5 Combining Multiple Attributes 1. Approach A: Find all record pairs that satisfy the first

conjunctive clause by doing a similarity search using

¡ ´ µ  ¼½¼ In this section we discuss how to join over multiple at- the conjunct Õ . Find all record pairs

tributes efficiently where the merging rule is of the disjunc- satisfying the second conjunctive clause by doing a

¡ ´µ  ¾ tive normal form (DNF) discussed in Section 2. similarity search using the conjunct  . Single Conjunctive Clause: For a single conjunctive Take the union of these two sets of results. clause, we can process the most “selective” attribute to find 2. Approach B: Do a similarity search to find record pairs

the candidate pairs that satisfy this conjunct condition, and

¡ ´ µ  ¼½

that satisfy Õ in the second conjunctive then check other conjunct conditions. For instance, consider

clause. These pairs also include all the pairs satisfying

the following query clause ¾ .

¡ ´ µ  ¼½¼

the first conjunctive clause, since Õ

¡ ´ µ  ¼½

implies Õ . Among all these pairs, find

¡ ´ µ  ¼½ ¡ ´µ  ¿ ¡ ´µ  ½

  Õ those satisfying the merging rule. We could first do a similarity search to find all the record

Approach A needs to do two similarity searches, while

¡ ´ µ  ¼½ pairs that satisfy the first condition, Õ .For approach B requires only one. However, both similarity each of the returned candidate pairs, we check if it satisfies searches in approach A could be more selective than the the other two conditions on the name and year attributes. single similarity search in approach B. Which approach is Alternatively, we can choose either name or year to do the better depends on the data set. similarity join. Our experiments show that the step of test- The example shows that to answer a conjunctive clause, ing other attributes takes relatively much less time than the we can choose at most one conjunct in it to do a similar-

step of finding the candidate record pairs, thus we mainly

´ µ   ity join. In addition, once we choose a conjunct 

focus on the time of doing the similarity join that finds the

´ µ Æ to do a similarity join, we do not need to do a similar- candidate pairs. We can use existing techniques on estimat- ity join for any other conjunctive clause that has a conjunct

ing the performance of spatial joins (e.g., [11]), and choose

´ µ  Æ ´ µ Æ ´ µ  Æ ´ µ   , where . The reason

the attribute that takes the least time to do the correspond-

´ µ   is that a superset of the results for the conjunct 

ing similarity join. (This attribute is called the most selec-

´ µ Æ has been returned by the similarity search. As the tive attribute for this conjunctive clause.) Notice that sim- number of attributes and the number of conjunctive clauses ilar to [7], we could also search along multiple attributes increase, there could be an exponential number of possible of the conjunction to improve accuracy. Since the map- ways to answer the query. In fact, assuming the time of do- ping in Step 1 does not guarantee that all the relevant string ing a similarity search for each conjunct is given, we can pairs will be found, using multiple attributes may improve show that the problem of finding an optimal solution (with accuracy. However, as will be shown in the experimental a minimal total time) to answer the query is NP-complete. section, since our strategy for single attributes is able to (The NP-hardness can be proven by reducing the problem identify matching string pairs at a very high accuracy (over to the NP-complete set-cover problem.) 99%), traversals along multiple attribute are not necessary. Since it could be too time-consuming to find an opti- In contrast, since string lexicographic ordering may not pre- mal solution, we propose two heuristic-based algorithms for serve domain-specific similarity as well, to get the same finding a good one. degree of accuracy, the approach in [7] requires traversals

along multiple keys. ¯ Algorithm 1: Treat all the conjunctive clauses sep- Disjunctive Clauses: The problem becomes more chal- arately. For each of them, choose the most selec-

lenging in the case of multiple conjunctive clauses. Take tive attribute to do the corresponding similarity join.



query ½ in Section 2 as an example. We have at least the If we choose the same attribute in two conjunc-

Æ  following different approaches to answering this query. tive clauses, and their corresponding thresholds 

Æ Æ   , then we only choose the threshold to do the 6.1 Choosing Dimensionality similarity search for the second clause, saving one sim- ilarity search for the first one. Take the union of results As discussed in Section 3.2, we need to choose a good

of all conjunctive clauses. dimensionality for the StringMap algorithm. To select a

value, we randomly selected 2K strings from Source 1 to form two data sets with the same size, and ran StringMap

¯ Algorithm 2: For each attribute, choose the largest

threshold among all its conjuncts. Among the largest using different dimensionalities. For edit distance, we con-

 ½ ¾ ¿

thresholds of different attributes, choose the most se- sidered the case where Æ , , and . For the q-gram

¼½ ¼½ lective one to do a similarity join. Among the results, metric, we considered Æ and . find the record pairs that satisfy the merging rule.

Effect of different dimensionalities 3500

δ = 1

For instance, consider the query ½ in Section 2. Sup- δ = 2

3000 δ = 3

¡ ´ µ  ¼½¼ pose Algorithm 1 chooses Õ as the most 2500

2000

¡ ´µ  ¾ selective condition for the first clause, and  for the second one. Thus it will produce the approach A Cost 1500 above. For Algorithm 2, the largest thresholds of the three 1000

500

½  ¾ attributes title, name, and year are ¼ , , and , respec- 0 10 15 20 25 30

Dimensionality d

¡ ´ µ  ¼½ tively. Suppose Õ is the most selective one. This algorithm will produce approach B as the solution. In (a) Edit Distance general, Algorithm 1 works better than Algorithm 2 if doing multiple similarity searches with small thresholds is more Effect of different dimensionalities efficient than one with a large threshold. 400 δ = 0.1 350 δ = 0.15

300 6 Experiments 250 200 Cost 150

100

In this section we present our extensive experimental re- 50

0 sults to evaluate our solution. The following are three main 10 15 20 25 30 sources we used. (1) Source 1 consists of 54,000 movie star Dimensionality d

names collected from The Internet Movie Database. 2 The (b) Q-gram Distance ¾¼ length of each name varies from to , and their average

length is about ½¾. (2) Source 2 is from the Die Vorfahren Figure 3. Costs of different dimensionalities. Database of mostly Pomeranian surnames and geographic locations.3 It contains 133,101 full names, whose lengths Figures 3 (a) and (b) show costs for different dimension- are less than 40, and mean length is around 15. (3) Source alities for edit distance and q-gram distance, respectively. 3 is from the publications in DBLP.4 We randomly selected (See Section 3.2 for the definition of “cost.”) It is shown that 20,000 papers in proceedings. We use this data source to the cost decreased with the increase of dimensionality. That show how to do data linkage in multiple-attribute cases. is, the larger the dimensionality, the fewer extra object pairs All the experiments were run on a PC, with a 1.5GHz we need to retrieve in step 2 for each similar-string pair, Athlon CPU and 512MB memory. The operating system while the original strings of these objects have a distance is Windows 2000, and the complier is gnu C++ running greater than Æ . On the other hand, due to the computational

in cygwin. This setting shows that our approach does not complexity of StringMap and the curse of dimensionality,

¾¼

have specific hardware and software requirements. We used cannot be high either. The results show that is a

½¾ as the page size to build R-trees. Most of our exper- good dimensionality for both metrics. imental results are similar for three sources. Due to space Figures 4 (a) and (b) show the distributions of the object- limitation, we mainly report the results on data source 1. In pair distances after StringMap, for the edit-distance met-

addition, since the results for the edit-distance metric and ric. We randomly sampled 2K from Source 1 to form two

 ¾

the q-gram distance metric are similar, we will mainly re- data sets with the same size. We chose Æ for similar-

 ¾¼ port the results of the edit-distance metric. string pairs, and set . The figures show that after StringMap, there is a new threshold to differentiate simi- 2http://www.imdb.com/ lar pairs from most dissimilar pairs. In particular, all the 3http://feefhs.org/dpl/dv/indexdv.html sampled similar-string pairs had new object-pair distances

4

http://www.informatik.uni-trier.de/ley/db/ within , while most of dissimilar-string pairs had their ric, since the two sources have different strings with dif- Histogram for Similar Pairs 8 d=20 ferent length distributions, it is not surprising that we need

7 ¼ to choose different Æ values by sampling. For the q-gram

6 ¼

¼¾ metric, we set Æ for both data sources 1 and 2.

5

4 6.3 Running Time

Number of Pairs 3 2 In order to measure the performance of our approach, 1 we ran our algorithm on different data sizes. In each case,

0 1.5 2 2.5 3 3.5 4 4.5 5 5.5 we chose the same number of strings in both data sets. We

Euclidean Distance ¾¼ chose dimensionality , and let the total size of strings (a) Similar pairs vary from 2K, 4K, 8K, 16K, to 54K from Source 1. We measured the corresponding running time. Figure 5 shows the time of the complete algorithm (in- cluding two steps), and the time of the StringMap step, as- suming we use the edit-distance metric. Their gap is the time of the second step that did the R-tree similarity join. The figure shows that as the data sizes increased, both the StringMap time and the total time grew. Our approach is

shown to be very efficient and scalable. For instance, when the total number of strings is  , it took the approach only

½ minutes to find the similar-string pairs. (It look almost one week for the nested-loop approach to finish.) The fig- ures also indicate that other similarity-join techniques may (b) Dissimilar pairs be used in the second step to improve its performance.

Figure 4. Histograms of new Euclidean dis- Running time for different sizes of datasets (k=2) 3000

Total Time

¾ ¾¼ tances: edit distance, Æ , . Time for StringMap 2500

2000

1500

time (sec)

object-pair distances larger than . 1000

500 ¼

6.2 Choosing Threshold Æ 0 0 10 20 30 40 50 60 total # of strings in 2 data sets (K)

As discussed in Section 4.1, we selected new threshold

Æ  ¾  ¾¼

¼ Figure 5. Running time ( ), edit

Æ for the second step as follows. For each source, we ran- domly selected 2K strings, and partitioned the strings into distance. two data sets equally. We used the nested-loop approach to In the case where the edit-distance metric is used, the ap-

find all the similar-string pairs in the selected strings. We proach in [6] can be used to find all the string pairs whose ¾¼ ran StringMap with , and traced the new Euclidean edit distance is within a given threshold. We implemented

distances of these sample similar-string pairs. We did this this approach using Oracle 8.1.7 on the same PC, and let

¼ Æ sampling ½¼ times, and chose as the largest new object- the database use indexes to run the query. We selected sub- pair distance of the sampled similar-string pairs. sets of strings from Source 1, and let both sets have the

same number of strings. In our approach we chose threshold

¼

¾ ¾¼ Æ  

Æ , dimensionality , and new threshold .

¡ ½ ¡ ¾ ¡ ¿ ¡ ¼¾

Õ

  ¿ ¼¾

Source 1  Figure 6 shows the performance difference between these

¿ ¼ ½¼ ¼¾ Source 2  two approaches. The figure shows that our approach can

substantially reduce the time of finding similar-string pairs.

¼

¾¼ Table 1. Threshold Æ used in step 2 ( ). Notice that even though our approach cannot guarantee to

find all such pairs, as we will see shortly, it can still achieve ¼

Table 1 shows the Æ values used in step 2 for two differ- a very high accuracy. In the context of record linkage, an ent metrics. Notice that when using the edit-distance met- efficient approximate-search algorithm with a very high ac- curacy might be more desirable. Time vs. Accuracy Comparison of two approaches for Source 1 QGram Approach 60000 Our Approach 110 Time-Accuracy for our approach Time-Accuracy for sliding-window approach 50000 105

40000 100

30000 95 time (sec)

20000 Accuracy (%) 90

10000 85

0 80 0 5 10 15 20 25 30 35 40 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 total # of strings in two data sets (K) Running Time (sec)

Figure 6. Our approach (approximate search) Figure 8. Our approach versus the sliding- versus the qgram exact approach in [6]. window approach in [7].

6.4 Accuracy 6.5 Results on Multiple Attributes

We want to know how well our approach can find all the Now we report our experimental results for the multiple- similar-string pairs. (Ideally we want to find all of them!) attribute case on data source 3.

In particular, we are interested in the accuracy, i.e., ratio Single Conjunctive Clause: We evaluated the single of similar-string pairs found among all similar-string pairs. conjunctive query ¾ in Section 5. For the three attributes Figure 7 shows the accuracy of data source 1, with differ- to perform a similarity join: title, name, and year, experi-

¼ mental results showed that the attribute year is not a good ent threshold Æ values in the second step, using the edit-

¼ one to do a similarity join, since too many candidate pairs distance metric. As the Æ value increased, the accuracy were generated. Thus we answered the query by conducting also increased, and it quickly got very close to ½¼¼±.For

a similarity join using attributes title and name, respectively.

 ¾

instance, in the case where Æ , the accuracy reached

¼

Æ  ± when . When we further increased the thresh-

old, the accuracy continued to grow close to ½¼¼±. We also Similarity-join Attribute Time (sec) Final Result Size ¼¼

implemented the sliding-window approach in [7]. Without name ½½¼ ¼¾

loss of generality, we used attributes as keys, and the condi- title ½¼

¡ ´µ  ¾ tion is  for data source 1. We chose different window sizes, and for each of them, the time and accuracy Table 2. Results of similarity join using differ- were measured. ent attributes. As shown in Table 2, for the thresholds in the query, do- Accuracy for Source1 110 ing a similarity join on name is more efficient than on title. δ=1 δ=2 105 δ=3 Notice that the result size is different for these two similar-

100 ity searches, since both of them are approximate. The small

95 difference (only two pairs) between them again shows that our approach has a very high accuracy.

Accuracy (%) 90

Disjunctive Clause: We used the query ½ in Section 2 85 as an example of disjunctive clauses. We implemented the 80 3 4 5 6 7 two approaches in Section 5. For algorithm 1 that produces

Threshold δ’ in step 2

¡ ´ µ  ¼½¼

approach A, we chose the conjunct Õ to

¼ do the similarity search for the first clause, and conjunct

Figure 7. Accuracy versus threshold Æ (d=20),

¡ ´µ  ¾ edit distance.  for the second one. After getting the result pairs for each clause, we took a union of the two sets and

Figure 8 shows the accuracy and time for these two produced the final result. Table 3 shows the results. The

¾¼ approaches. It shows that our approach can achieve a total time for approach A was ½ seconds, and the size

very high accuracy given a time limit. The primary rea- of the final result set was ½. Notice that the results of the son is that our mapping function provides very good dis- two clauses had overlapped record pairs. The accuracy of tance/similarity preservation. Since lexicographic ordering taking this approach is more than 99%.

does not preserve edit distances as well, the approach dis- For algorithm 2 that produces approach B, we found that

¡ ´ µ  ¼½ cussed in [7] needs to consider a very large window size Õ was the most selective conjunct, using (and hence cost) to obtain competitive degree of accuracy. which we performed the similarity join. Table 4 shows the

Selected Condition Time (sec) # of Record Pairs References

¡ ´ØØÐ µ ¼½ ½¼ ¼ ¼

Clause 1, Õ

´Òѵ ¾ ¼ ½ ¡ [1] J. Bourgain. On Lipschitz embedding of finite metric spaces

Clause 2, ½ Total ½ ¾¼ (pairs overlap) in hilbert space. Israel Journal of Mathematics, 52(1-2):46– 52, 1985. [2] W. W. Cohen, H. A. Kautz, and D. A. McAllester. Hardening Table 3. Approach A for disjunctive clause. soft information sources. In Knowledge Discovery and Data Mining, pages 255–259, 2000. [3] M. G. Elfeky, V. S. Verykios, and A. K. Elmagarmid. Tailor:

results. In particular, the total time for approach B was A record linkage toolbox. In ICDE, 2002.

¿ ½ ½ seconds, and the size of the final result was also . [4] C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and Selected Condition Time # of Record Pairs multimedia datasets. In M. J. Carey and D. A. Schneider,

editors, SIGMOD, pages 163–174, 1995.

¡ ´ØØÐ µ ¼½ ½ ¿ ½

Clause 2, Õ secs [5] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.- Table 4. Approach B for disjunctive clause. A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371–380, 2001. [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, For this example, approach B was better than approach S. Muthukrishnan, and D. Srivastava. Approximate string A. In general, algorithm 1 produces a solution that tries to joins in a database (almost) for free. In VLDB, pages 491– minimize the time of each individual similarity join, which 500, 2001. tends to produce a small candidate set. Algorithm 2 pro- [7] M. A. Hern´andez and S. J. Stolfo. The merge/purge problem duces a solution that needs to perform a similarity join only for large databases. In M. J. Carey and D. A. Schneider, once for all the clauses. It may need more time for the sin- editors, SIGMOD, pages 127–138, 1995. gle similarity join, since the threshold could be large. To fix [8] G. R. Hjaltason and H. Samet. Incremental distance join al- this problem of Algorithm 2, we can set an upper bound on gorithms for spatial databases. In L. M. Haas and A. Tiwary, the threshold of each attribute, and we will never consider editors, SIGMOD, pages 237–248, 1998. [9] G. R. Hjaltason and H. Samet. Contractive embedding meth- those conjuncts on this attribute whose threshold is above ods for similarity searching in metric spaces. Technical re- this bound. One observation is that if the thresholds for the port, University of Maryland Computer Science, 2000. attributes vary a lot, it could be better to take Algorithm 1. [10] G. Hristescu and M. Farach-Colton. Cluster-preserving em- Otherwise, Algorithm 2 could be preferable. bedding of proteins. Technical Report 99-50, Rutgers Univ., 8 1999. 7 Conclusion [11] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. A cost model for estimating the performance of spatial joins using R-trees. In Statistical and Scientific Database Management, In this paper we developed a novel approach to the pages 30–38, 1997. record-linkage problem: given two lists of records, we [12] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage want to find similar record pairs, where the overall sim- Piblications, Beverly Hills, CA, 1978. ilarity between two records is defined based on domain- [13] V. I. Levenshtein. Binary codes capable of correcting dele- specific similarities over individual attributes. For each at- tions, insertions and reversals. Soviet Physics Doklady, tribute of records, we first map records to a multidimen- 10:707–710, 1966. sional Euclidean space that preserves domain-specific simi- [14] H. Mannila and J. K. Seppnen. Recognizing similar situa- tions from event sequences. In First SIAM Conference on larity. Given the merging rule that defines when two records Data Mining, 2001. are similar, a set of attributes are chosen along which the [15] V. Raman and J. M. Hellerstein. Potter’s wheel: An inter- merge will proceed. A multidimensional similarity join active data cleaning system. In The VLDB Journal, pages over the chosen attributes is used to determine similar pairs 381–390, 2001. of records. Our extensive experiments using real data sets [16] J. T.-L. Wang et al. Evaluating a class of distance-mapping showed that our solution has very good efficiency and ac- algorithms for data mining and clustering. In Knowledge curacy. In addition, our approach is very extendable, since Discovery and Data Mining, pages 307–311, 1999. many existing techniques can be used. It can also be gener- [17] W. Winkler. Advanced methods for record linkage, 1994. alized to other similarity functions between strings. [18] P. N. Yianilos and K. G. Kanzelberger. The LIKEIT intelli- gent string comparison facility. Technical report, NEC Re- Acknowledgments: We thank Michael Ortega- search Institute, 1997. Binderberger for providing his implementation of the [19] F. W. Young and R. M. Hamer. Multidimensional Scal- algorithm in [8]. We thank the authors of [6] for providing ing: History. Theory and Applications Erlbaum, New York, the details of their implementation. We thank Heikki 1987. Mannila for letting us know the work in [14].