<<

Assignment in Information Extraction

Emili Sapena, Llu´ısPadr´oand Jordi Turmo TALP Research Center Universitat Polit`ecnicade Catalunya , Spain {esapena, padro, turmo}@lsi.upc.edu

Resumen: Este art´ıculo presenta un m´etodo general para la tarea de asignaci´onde alias en extracci´onde informaci´on. Se comparan dos aproximaciones para encarar el problema y aprender un clasificador. La primera cuantifica una similaridad global entre el alias y todas las posibles entidades asignando pesos a las caracter´ısticas sobre cada pareja alias-entidad. La segunda es el cl´asico clasificador donde cada instancia es una pareja alias-entidad y sus atributos son las caracter´ısticasde ´esta. Ambas aproximaciones usan las mismas funciones de caracter´ısticas sobre la pareja alias-entidad donde cada nivel de abstracci´on, desde los car´acteres hasta el nivel sem´antico, se tratan de forma homog´enea. Adem´as,se proponen unas funciones extendidas de caracter´ısticas que desglosan la informaci´ony permiten al algoritmo de aprendizaje autom´aticodeterminar la contribuci´onfinal de cada valor. El uso de funciones extendidas mejora los resultados de las funciones simples.

Palabras clave: asignaci´onde alias, extracci´onde informaci´on,entity matching Abstract: This paper presents a general method for alias assignment task in information extraction. We compared two approaches to face the problem and learn a classifier. The first one quantifies a global similarity between the alias and all the possible entities weighting some features about each pair alias-entity. The second is a classical classifier where each instance is a pair alias-entity and its attributes are their features. Both approaches use the same feature functions about the pair alias-entity where every level of abstraction, from raw characters up to semantic level, is treated in an homogeneous way. In addition, we propose an extended feature functions that break down the information and let the machine learning algorithm to determine the final contribution of each value. The use of extended features improve the results of the simple ones.

Keywords: Alias Assignment, Information Extraction, Entity Matching

1 Introduction that integrate data from multiple sources. Consequently, it has been explored by a Alias assignment is a variation of the en- big number of communities including statis- tity matching problem. Entity matching de- tics, information systems and artificial in- cides if two given named entities in the data, telligence. Concretely, many tasks related such as “George W. Bush” and “Bush”, re- to natural language processing have been fer to the same real-world entity. Varia- involved in the problem such as question tions in named entity expressions are due to answering, summarization, information ex- multiple reasons: use of abbreviations, diffe- traction, among others. Depending on the rent naming conventions (for example “Name area, variants of the problem are known Surname” and “Surname, N.”), aliases, mis- with some different names such as iden- spellings or naming variations over time tity uncertainty (Pasula et al., 2002), tu- (for example “Leningrad” and “Saint Peters- ple matching, record linkage (Winkler, 1999), burg”). In order to keep coherence in ex- deduplication (Sarawagi and Bhamidipaty, tracted or processed data for further analysis, 2002), merge/purge problem (Hernandez and to determine when different mentions refer to Stolfo, 1995), data cleaning (Kalashnikov the same real entity is mandatory. and Mehrotra, 2006), reference reconciliation This problem arises in many applications (Dong, Halevy, and Madhavan, 2005), men- tion matching, instance identification and so ment because the information contribution of others. the pair alias-entity is poorer than that of an Alias assignment decides if a mention in entity-entity pair. An alias is only a small one source can be referring to one or more group of words without attributes and, nor- entities in the data. The same alias can be mally, without any useful contextual infor- shared by some entities or, by the opposite, mation. However, using some domain know- it can be referring to an unknown entity. For ledge, some information about the entities instance, alias “Moore” would be assigned to and some information about the world, it is the entity “Michael Moore” and also to “John possible to improve the results of a system Moore” if we have both in the data. Howe- that uses only string similarity measures. ver, alias “P. Moore” can not be assigned to This paper presents a general method for any of them. Therefore, while entity match- alias assignment task in information extrac- ing problem consists of determining when two tion. We compared two approaches to face records are the same real entity, alias assign- the problem and learn a classifier. The first ment focuses on finding out whether referen- one quantifies a global similarity between the ces in a text are referring to known real en- alias and all the possible entities weighting tities in our database or not. After alias as- some features about each pair alias-entity. signment, a disambiguation procedure is re- The algorithm employed to find the best quired to decide which real entity among the weights is Hill Climbing. The second is a possible ones is the alias pointing to in each classical pairwise classification where each context. The disambiguation procedure, ho- instance is a pair alias-entity and its at- wever, is out of the scope of this paper. tributes are their features. The classifier is There is little previous work that directly learned with Support Vector Machines. Both addresses the problem of alias assignment approaches use the same feature functions as a main focus, but many solutions have about the pair alias-entity where every level been developed for the related problem of en- of abstraction, from raw characters up to se- tity matching. Early solutions employ man- mantic level, is treated in an homogeneous ually specified rules (Hernandez and Stolfo, way. In addition, we propose a set of ex- 1995), while subsequent works focus on learn- tended feature functions that break down the ing the rules from training data (Tejada, information and let the machine learning al- Knoblock, and Minton, 2002; Bilenko and gorithm to determine the final contribution Mooney, 2003). Numerous solutions focus of each value. The use of extended features on efficient techniques to match strings, ei- improves the results of the simple ones. ther manually specified (Cohen, Ravikumar, The rest of the paper is structured as fol- and Fienberg, 2003), or learned from training lows. In section 2, it is formalized the prob- data (Bilenko and Mooney, 2003). Some oth- lem of alias assignment and its representa- ers solutions are based in other techniques tion. Section 3 introduces the machine learn- taking advantage of the database topology ing algorithms used. Next, section 4 presents like clustering a large number of tuples (Mc- the experimental methodology and data used Callum, Nigam, and Ungar, 2000), exploi- in our evaluation. In section 5 we describe ting links (Bhattacharya and Getoor, 2004) the feature functions employed in our empi- or using a relational probability model to de- rical evaluation. Section 6 shows the results fine a generative model (Pasula et al., 2002). obtained and, finally, we expose our conclu- In the last years, some works take advan- sions in section 7. tage of some domain knowledge at the seman- tic level to improve the results. For example, 2 Problem definition and Doan et al. (Doan et al., 2003) shows how representation semantic rules either automatically learned The alias assignment problem can be formali- or specified by a domain expert can improve zed as pairwise classification: Find a function the results. Shen et al. (Shen, Li, and Doan, f : N ×N → {1, −1} which classifies the pair 2005) use probabilistic domain constraints in alias-entity as positive (1) if the alias is rep- a more general model employing a relaxation resenting the entity or negative (-1) if not. labeling algorithm to perform matching. The alias and the entity are represented as Some of the methods used for entity strings in a name space N. We propose a matching are not applicable to alias assign- variation of the classifier where we can use also some useful attributes we have about and also distinguish hard rules from soft ones. the entity. In our case, function to find will In (Shen, Li, and Doan, 2005) weights are es- be: f : N × M → {1, −1} where M repre- tablished by an expert user or learned from sents a different space including all entity’s the same data set to classify. In our work, attributes. we present another way to use this informa- We define a feature function as a function tion. We propose to add more feature func- that represents a property of the alias, the tions to increase the number of attributes for entity, or the pair alias-entity. Once a pair our classifier. Each new feature function de- alias-entity is represented as a vector of fea- scribes some characteristic of the alias, of the tures, one can combine them appropriately entity, or of the pair alias-entity that needs using machine learning algorithms to obtain some extra knowledge. The contribution of a classifier. In section 3 we explain how each feature will be learned as any other simi- we learn classifiers using two different ap- larity function when some machine learning proaches. Most of the feature functions used method is applied. here are similarity functions which quantify the similarity of the pair alias-entity accor- 3 Learning classifiers ding to some criteria. In a similarity func- Two approaches are used and compared in tion the returned value r indicates greater si- order to obtain a good classifier using fea- milarity in larger values while shorter values ture functions introduced above, Hill Climb- indicates lower similarity (dissimilarity). ing (Skalak, 1994) and Support Vector Ma- Feature functions can be divided in four chines (Cortes and Vapnik, 1995). Each one groups by its level of abstraction from raw has different points of view of the problem. characters up to semantic level. In the lower The first one, treats the problem as a near- level, the functions focus on character-based est neighbor model and tries to determine similarity between strings. These techniques a global Heterogeneous Euclidean-Overlap rely on character edit operations, such as Metric (HEOM) from the target alias to all deletions, insertions, substitutions and sub- the entities in the database. The alias will sequence comparison. Edit similarities find be assigned to the entities with a HEOM typographical errors like writing mistakes or shorter than some cut-value. Each pair alias- OCR errors, abbreviations, similar lemmas entity has a HEOM composed by all the va- and some other difference intra-words. lues of similarity. The second point of view The second level of abstraction is centered is a classical classifier based on the instance’s in vector-space based techniques and it is also attributes projected in a multidimensional known as token-level or word-level. The two space. The classifier consist in an hyperplane strings to compare are considered as a group that separates samples in two classes. Each of words (or tokens) disregarding the order in pair alias-entity with the values of the fea- which the tokens occur in the strings. Token- ture functions as attributes is an instance for based similarity metrics uses operations over the classifier that can be classified as positive sets such as union or intersection. (matching) or negative (not matching). In a higher level we find some structural The first point of view determines a features similar to the work in (Li, Morie, and HEOM composed by the values returned by Roth, 2004). Structural features encode in- the similarity functions. All the similarity formation on the relative order of tokens be- functions are normalized and transformed to tween two strings, by recording the location dissimilarities in order to obtain a small of the participating tokens in the partition. value of HEOM when alias and entity are The highest level includes the functions similar and large value otherwise. HEOM is with added knowledge. This extra know- obtained with all the dissimilarities weighted ledge can be obtained from other attributes of in a quadratic summatory: the entity, from an ontology or can be know- ledge about the world. Some previous works s HEOM = X w (d )2 (Shen, Li, and Doan, 2005; Doan et al., 2003) i i i use this extra knowledge as rules to be satis- fied. First, rules are specified manually or ob- where di is the dissimilarity correspond- tained from the data, and then they need to ing to the similarity function i and wi is assign some weight or probability to each rule the weight assigned to this value. Using a training data set, Hill Climbing determines aliases assigned by hand versus a database the best weight for each feature and the cut- with 500 football club entities. Some of them value in order to achieve the best possible are assigned to more than one club while performance. The algorithm in each step in- some others are not assigned because the re- creases and decreases each weight in a small ferring club is not in our database. Each al- step-value and selects the modification with gorithm is trained and tested doing a five- best results. The process is repeated until no fold cross-validation. Some examples of an- modification is found to improve the result of notated corpus can be seen in table 1. the current solution. The method is executed Several aliases found across the Web are several times starting with random weights. referring to organizations not included yet Some of the advantages of Hill Climbing is in the database. Furthermore, for each that it is easy to develop and can achieve alias-entity matching sample (classified as good results in a short time. positive) we have almost 500 samples not- The second approach consist in a pair matching (classified as negative). This situa- alias-entity classifier using Support Vector tion would drive accuracy always near 100% Machines (SVM) (Cortes and Vapnik, 1995). even in a blind classifier deciding always ne- SVM have been used widely as a classifier gative. In order to have a reasonable evalua- (Osuna, Freund, and Girosi, 1997; Furey et tion only the set of positive predictions Mp al., 2000). This technique has the appea- are used in evaluation and compared with ling feature of having very few tunable pa- the set Ma of examples annotated as posi- rameters and using structural risk minimiza- tive. The measures used are Precision (1), tion which minimizes a bound on the general- Recall (2) and F1 (3). Only F1 values are ization error. Theorically, SVM can achieve shown and compared in this paper. more precise values than Hill Climbing (for |M ∩ M | our task) because they search in a continuous P = p a (1) space while hill climbing is searching discrete |Mp| values. In addition, using kernels more com- |M ∩ M | plex than linear one, they might combine at- R = p a (2) tributes in a better way. Moreover, statistical |Ma| learning avoids one of the problems of local 2PR F = . (3) search, that is to fall in local minimums. In 1 P + R the other hand, SVM computational cost is higher than hill climbing. 5 Experiments We evaluated the task of alias assignment in 4 Evaluation framework two experiments. In the first one, we com- We evaluated both algorithms in the alias as- pared the performance of Hill Climbing and signment task with a corpus of organizations. SVM using a set of similarity functions. The Developing an IE system in the domain of second is focused on an improvement of fea- football (soccer) over the Web, one of the ture functions breaking them down in several problems we found is that clubs, federations, values representing more specific aspects of football players, and many other entities re- their characteristics. lated with football have too long official or real names. Consequently, some nicknames 5.1 Algorithm comparison or short names are used widely in either free In the first approach, functions return a value and structured texts. Almost all texts use of similarity depending on some criteria. In this short names to refer to the entities as- this case, we are trying to simplify the clas- suming that everyone is able to distinguish sification process including only the informa- which real entity is pointed. For instance, to tion we consider important. The larger num- refer to “Futbol Club Barcelona”, its typical ber of features included, the longer takes an to find “FC Barcelona” or “Barcelona”. We algorithm to train and achieve good results. based the results of this paper in our study in Based in this principle, we tried to insert as the specific domain of football, however, we much information as we could in a few values. are presenting a general method for the alias The feature functions used in this first ex- assignment task useful in any other domain. periment (example in figure 1) are the follow- The corpus consist in 900 football club ing: Alias Assigned entities Sydney FC Sydney Football Club Man Utd Manchester United Football Club Nacional Club Universidad Nacional AC UNAM, Club Deportivo El Nacional, Club Nacional, Club Nacional de Football Steaua Bucharest -not assigned- Newcastle United Newcastle United Jets Football Club Newcastle United Football Club Krylya Sovetov Professional Football Club Krylya Sovetov Samara

Table 1: Example of some pairs alias-entity in the football domain

5.1.1 Character-based entity name decrement the similarity as • Prefix and Suffix similarities count is shown bellow: the words of the alias that are the begin (prefix) or the end (suffix) of a word in |A ∩ B| − |W | the entity name. Sim(A, B) = max(0, a ) |A ∪ B| • Abbreviations similarity. If a word s in the alias is shorter than a word t where W represents the words appear- in the entity name they start with the a ing in A but not in B and max function is same character and each character of s used taking care that similarity function appear in t in the same order, the func- never returns a value lower than zero. tion concludes that s is an abbreviation of t. For example “Utd” is an abbrevia- • Keywords similarity is another lexi- tion of “United” and “St” is an abbre- cal similarity but avoiding typical do- viation of “Saint”. main related words. These kind of words occur in several names and can cause a 5.1.2 Token-based good lexical similarity when the impor- • Lexical similarity compares the words tant words (keywords) are not matching. between alias A and entity name B with- For example, “Manchester United Foot- out case sensitivity. A classical lexical ball Club” and “Dundee United Football similarity is: Club” have a good lexical similarity but bad keyword similarity because “foot- |A ∩ B| ball” and “club” are considered typical Sim(A, B) = domain-related words. It uses the same |A ∪ B| formula as Lexical similarity but not in- where |x ∩ y| correspond to a function cluding typical domain-related words in that returns the number of coincidences A and B. Lexical similarity and Key- between words in x and y, and |x ∪ y| words similarity could be combined in a symbolize the number of different words lexical similarity weighted with TF-IDF. in the union of x and y. However, the true contribution of each token to similarity is domain-specific However, in the case of study, we know and not always proportional to TF-IDF. that some word in the entity name may Some words have many occurrences but not occur in the alias but, almost always, are still important while some others ap- if a word occur in the alias, it must be in pear few times but are not helpful at all. the entity name. In other words, an alias use to be a reduced number of words of the entity name. Although, it is difficult 5.1.3 Structural to find an alias using words that do not • Acronyms similarity looks for a cor- occur in the entity name (it is possible, respondence between acronyms in the however). In order to take advantage of alias and capitalized words in the en- this asymmetry in our lexical similarity, tity name. This feature takes care of words of the alias not appearing in the the words order because the order of Entity milarity function because we have more Alias Football Club Internazionale Milano s.p.a. information about the entity than only Inter Milan www.inter.it the official name. In case we don’t have this information the return value would be zero. Football typical word 5.2 Extended features Inter abbreviation Club typical word The second experiment uses extended feature prefix functions. This means that most of the fea- ture functions used previously are modified Internazionale and now they return more than one value breaking down the information. The feature city abbreviation Milano city functions are the same but returning a vec- Milan prefix tor of values instead of one value. The clas- sifier may use this extra information if it is s.p.a. city helpful for classification. For instance, lexi- cal similarity now returns: number of words web www.inter.it in the alias, number of words in the entity name and number of equal words. Combin- ing these values the classifier can achieve a Figure 1: Example of a pair alias-entity and function like our original lexical similarity or its active features maybe a better one. In this second approach the target is to compare the original feature functions with the extended ones. We choose SVM for this the characters in an acronym defines the experiment because SVM can use polynomial order that words must have in the en- kernels that may combine attributes in a bet- tity name. An example of acronym is ter way than a linear classifier. Consequently, “PSV” which match with “Philips Sport in this experiment we compare the best clas- Vereniging Eindhoven”. sifier obtained in the first experiment with two SVM classifiers using the extended fea- 5.1.4 Semantic ture functions. One SVM will use a linear • City similarity returns 1 (maximum si- kernel while the other will try to take advan- milarity) only when one word in the alias tage of a quadratic one. correspond to a city, one word in the en- Table 2 shows the modifications realized tity name corresponds to a city and both in each feature function. are the same city. In other cases, returns 0 (no similarity). It can be useful when 6 Results some cities can have different names de- In our first experiment described in section pending on the language. For instance, 5.1, we tried the two algorithms mentioned “Moscow” and “Moskva” are the same above, Hill Climbing and SVM, with the fea- city or “Vienna” and “Wien”. This fea- ture functions described previously. Table 3 ture requires a world knowledge about shows the results comparing it with a baseline cities. consisting of some simple rules using only lex- • Website similarity function compares ical, keywords, acronyms and abbreviations the alias with the URL of the organiza- similarities. tion’s website if we have it. Avoiding the The first aspect to emphasize is that first TLD (.com, .de, .es) and sometimes the baseline, a simple rule-based classifier, the second (.co.uk, .com.mx) its usual achieves a F1 measure over 80%. This in- for an organization to register a domain dicates that the alias assignment task has a name with the most typical alias for it. high percentage of trivial examples. The use The return value of this function is the of machine learning and new features may ratio of words of alias included in the help with difficult ones. Actually, the results domain name divided by total number show how machine learning algorithms signi- of words in the alias. We can use this si- ficantly outperform the results obtained by Feature Return Values Baseline Hill Climbing SVM Prefix Pre1: # words in the alias that F1 80.3 87.1 87.9 are prefixes in the entity name Suffix Suf1: # words in the alias that Table 3: Results of experiment (1) comparing are suffixes in the entity name simple rule-based baseline with hill climbing Abbrev. Abr1: # words in the alias that and SVM are an abbreviation of a word in the entity name Lexical Lex1: # words in the alias Features Simple Extended Lex2: # words in the entity Algorithm SVM SVM SVM name Kernel linear linear quadratic Lex3: # equal words F1 87.9 93.0 93.0 Lex4: # equal words case sensi- tive Table 4: Results of experiment (2) comparing Keywords Key1: # keywords int the alias original features with extended features (words excluding typical domain words (football, club, etc)) Key2: # keywords in the entity name rent kernels using extended features are com- Key3: # of equal keywords pared with results obtained in the first expe- Acronym Acr1: the alias have an acronym riment. (boolean) The results indicates that extended fea- Acr2: the alias acronym tures outperform the original ones. In the matches with capitalized words other hand, we can see that a quadratic ker- in the entity name (boolean) nel does not improve the results of the linear Acr3: # words in the alias with- kernel. out acronyms Acr4: # words in the entity name without words involved in 7 Conclusions acronyms In this paper we have proposed a homoge- Acr5: # equal words without neous model to deal with the problem of clas- words involved in acronyms sifying a pair alias-entity into true/false ca- City Cit1: some word in the alias is tegories. The model consists in using a set a city (boolean) of feature functions instead of the state-of- Cit2: some word in the entity art approach based on distinguishing between name is a city (boolean) a set of lexico-ortographical similarity func- Cit3: both are the same city (boolean) tions and a set of semantic rules. Website Web1: The entity has a value in Some experiments have been performed in the website field (boolean) order to compare different configurations for Web2: # words occurring both the proposed model. The configurations dif- in the alias and in the URL of fer in the set of feature functions and in the the entity discretization strategy for feature weights. Also, two learning techniques have been ap- Table 2: Extended features used in the se- plied, namely, Hill Climbing and SVMs. cond experiment We have seen that Hill Climbing and SVM perform similar. Both algorithms used has the baseline. In the other hand, we find that some advantages and disadvantages. On one perform of Hill Climbing and SVM are simi- hand, Hill Climbing is simple and fast but has lar. SVM seems to achieve better results but two drawbakcs. The first one is that it looks the difference is not significant since the con- for weights by steps and it causes that the fidence interval at 95% significance level is weights are always discrete values decreas- 0.8%. ing sometimes the final accuracy. The other In the second approach we wanted to use drawback is that local search can fall in local the power of SVM combining features and we minima. Although, it may be palliated by break down the components of feature func- executing the algorithm several times start- tions as explained in section 5.2. SVM may ing with random values. On the other hand, use this extra information if it is helpful for SVM work in a continuous space and learn classification. In table 4 two SVM with diffe- statistically which avoids the two drawbacks of hill climbing. Although, SVM take longer the 1995 ACM SIGMOD international confer- to be tuned correctly. ence on Management of data, pages 127–138, In the second experiment, since SVM can New York, NY, USA. ACM Press. handle richer combinations of features when Kalashnikov, Dmitri V. and Sharad Mehrotra. using polynomial kernels, we tested SVMs 2006. Domain-independent data cleaning via using a linear kernel and a quadratic one, analysis of entity-relationship graph. ACM obtaining similar results. The feature set Trans. Database Syst., 31(2):716–767. used in this experiment was a refinement of Li, Xin, Paul Morie, and Dan Roth. 2004. Iden- the previous one, that is, the features con- tification and tracing of ambiguous names: tained the same information, but coded with Discriminative and generative approaches. finer granularity. The results pointed out In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTEL- that although the similarity functions used LIGENCE, pages 419–424. Menlo Park, CA; in the first approach produced accurated re- Cambridge, MA; London; AAAI Press; MIT sults, letting the SVM handle all the param- Press; 1999. eters results in a significative improvement. McCallum, Andrew, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high- References dimensional data sets with application to ref- Bhattacharya, Indrajit and Lise Getoor. 2004. erence matching. In KDD ’00: Proceedings of Iterative record linkage for cleaning and inte- the sixth ACM SIGKDD international confer- gration. In DMKD ’04: Proceedings of the 9th ence on Knowledge discovery and data mining, ACM SIGMOD workshop on Research issues pages 169–178, New York, NY, USA. ACM in data mining and knowledge discovery, pages Press. 11–18, New York, NY, USA. ACM Press. Osuna, Edgar, Robert Freund, and Federico Bilenko, Mikhail and Raymond J. Mooney. 2003. Girosi. 1997. Training support vector ma- Adaptive duplicate detection using learnable chines: an application to face detection. cvpr, string similarity measures. In KDD ’03: Pro- 00:130. ceedings of the ninth ACM SIGKDD interna- Pasula, H., B. Marthi, B. Milch, S. Russell, and tional conference on Knowledge discovery and I. Shpitser. 2002. Identity uncertainty and data mining, pages 39–48, New York, NY, citation matching. USA. ACM Press. Sarawagi, Sunita and Anuradha Bhamidipaty. Cohen, W., P. Ravikumar, and S. Fienberg. 2003. 2002. Interactive deduplication using active A comparison of string distance metrics for learning. In KDD ’02: Proceedings of the name-matching tasks. eighth ACM SIGKDD international confer- ence on Knowledge discovery and data mining, Cortes, Corinna and Vladimir Vapnik. 1995. pages 269–278, New York, NY, USA. ACM Support-vector networks. In Springer, edi- Press. tor, Machine Learning, pages 273–297. Kluwer Academic Publishers, Boston. Shen, W., X. Li, and A. Doan. 2005. Constraint- based entity matching. In Proceedings of Doan, AnHai, Ying Lu, Yoonkyong Lee, and Ji- AAAI. awei Han. 2003. Profile-based object match- ing for information integration. IEEE Intelli- Skalak, David B. 1994. Prototype and feature se- gent Systems, 18(5):54–59. lection by sampling and random mutation hill climbing algorithms. In International Confer- Dong, Xin, Alon Halevy, and Jayant Madhavan. ence on Machine Learning, pages 293–301. 2005. Reference reconciliation in complex in- Tejada, Sheila, Craig A. Knoblock, and Steven formation spaces. In SIGMOD ’05: Proceed- Minton. 2002. Learning domain-independent ings of the 2005 ACM SIGMOD international string transformation weights for high accu- conference on Management of data, pages 85– racy object identification. In KDD ’02: Pro- 96, New York, NY, USA. ACM Press. ceedings of the eighth ACM SIGKDD interna- Furey, T. S., N. Christianini, N. Duffy, D. W. tional conference on Knowledge discovery and Bednarski, M. Schummer, and D. Hauessler. data mining, pages 350–359, New York, NY, 2000. Support vector machine classification USA. ACM Press. and validation of cancer tissue samples using Winkler, W. 1999. The state of record linkage microarray expression data. Bioinformatics, and current research problems. 16(10):906–914. Hernandez, Mauricio A. and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In SIGMOD ’95: Proceedings of