Alias Assignment in Information Extraction

Alias Assignment in Information Extraction Emili Sapena, Llu´ısPadróand Jordi Turmo TALP Research Center Universitat Politècnicade Catalunya Barcelona, Spain {esapena, padro, turmo}@lsi.upc.edu Resumen: Este art´ıculo presenta un método general para la tarea de asignaciónde alias en extracciónde información. Se comparan dos aproximaciones para encarar el problema y aprender un clasificador. La primera cuantifica una similaridad global entre el alias y todas las posibles entidades asignando pesos a las caracter´ısticas sobre cada pareja alias-entidad. La segunda es el clásico clasificador donde cada instancia es una pareja alias-entidad y sus atributos son las caracter´ısticasde ésta. Ambas aproximaciones usan las mismas funciones de caracter´ısticas sobre la pareja alias-entidad donde cada nivel de abstracción, desde los carácteres hasta el nivel semántico, se tratan de forma homogénea. Además,se proponen unas funciones extendidas de caracter´ısticas que desglosan la informacióny permiten al algoritmo de aprendizaje automáticodeterminar la contribuciónfinal de cada valor. El uso de funciones extendidas mejora los resultados de las funciones simples. Palabras clave: asignaciónde alias, extracciónde información,entity matching Abstract: This paper presents a general method for alias assignment task in information extraction. We compared two approaches to face the problem and learn a classifier. The first one quantifies a global similarity between the alias and all the possible entities weighting some features about each pair alias-entity. The second is a classical classifier where each instance is a pair alias-entity and its attributes are their features. Both approaches use the same feature functions about the pair alias-entity where every level of abstraction, from raw characters up to semantic level, is treated in an homogeneous way. In addition, we propose an extended feature functions that break down the information and let the machine learning algorithm to determine the final contribution of each value. The use of extended features improve the results of the simple ones. Keywords: Alias Assignment, Information Extraction, Entity Matching 1 Introduction that integrate data from multiple sources. Consequently, it has been explored by a Alias assignment is a variation of the en- big number of communities including statis- tity matching problem. Entity matching de- tics, information systems and artificial in- cides if two given named entities in the data, telligence. Concretely, many tasks related such as “George W. Bush” and “Bush”, re- to natural language processing have been fer to the same real-world entity. Varia- involved in the problem such as question tions in named entity expressions are due to answering, summarization, information ex- multiple reasons: use of abbreviations, diffe- traction, among others. Depending on the rent naming conventions (for example “Name area, variants of the problem are known Surname” and “Surname, N.”), aliases, mis- with some different names such as iden- spellings or naming variations over time tity uncertainty (Pasula et al., 2002), tu- (for example “Leningrad” and “Saint Peters- ple matching, record linkage (Winkler, 1999), burg”). In order to keep coherence in ex- deduplication (Sarawagi and Bhamidipaty, tracted or processed data for further analysis, 2002), merge/purge problem (Hernandez and to determine when different mentions refer to Stolfo, 1995), data cleaning (Kalashnikov the same real entity is mandatory. and Mehrotra, 2006), reference reconciliation This problem arises in many applications (Dong, Halevy, and Madhavan, 2005), mention matching, instance identification and so ment because the information contribution of others. the pair alias-entity is poorer than that of an Alias assignment decides if a mention in entity-entity pair. An alias is only a small one source can be referring to one or more group of words without attributes and, nor- entities in the data. The same alias can be mally, without any useful contextual infor- shared by some entities or, by the opposite, mation. However, using some domain know- it can be referring to an unknown entity. For ledge, some information about the entities instance, alias “Moore” would be assigned to and some information about the world, it is the entity “Michael Moore” and also to “John possible to improve the results of a system Moore” if we have both in the data. Howe- that uses only string similarity measures. ver, alias “P. Moore” can not be assigned to This paper presents a general method for any of them. Therefore, while entity match- alias assignment task in information extrac- ing problem consists of determining when two tion. We compared two approaches to face records are the same real entity, alias assign- the problem and learn a classifier. The first ment focuses on finding out whether referen- one quantifies a global similarity between the ces in a text are referring to known real en- alias and all the possible entities weighting tities in our database or not. After alias as- some features about each pair alias-entity. signment, a disambiguation procedure is re- The algorithm employed to find the best quired to decide which real entity among the weights is Hill Climbing. The second is a possible ones is the alias pointing to in each classical pairwise classification where each context. The disambiguation procedure, ho- instance is a pair alias-entity and its at- wever, is out of the scope of this paper. tributes are their features. The classifier is There is little previous work that directly learned with Support Vector Machines. Both addresses the problem of alias assignment approaches use the same feature functions as a main focus, but many solutions have about the pair alias-entity where every level been developed for the related problem of en- of abstraction, from raw characters up to se- tity matching. Early solutions employ man- mantic level, is treated in an homogeneous ually specified rules (Hernandez and Stolfo, way. In addition, we propose a set of ex- 1995), while subsequent works focus on learn- tended feature functions that break down the ing the rules from training data (Tejada, information and let the machine learning al- Knoblock, and Minton, 2002; Bilenko and gorithm to determine the final contribution Mooney, 2003). Numerous solutions focus of each value. The use of extended features on efficient techniques to match strings, ei- improves the results of the simple ones. ther manually specified (Cohen, Ravikumar, The rest of the paper is structured as fol- and Fienberg, 2003), or learned from training lows. In section 2, it is formalized the prob- data (Bilenko and Mooney, 2003). Some oth- lem of alias assignment and its representa- ers solutions are based in other techniques tion. Section 3 introduces the machine learn- taking advantage of the database topology ing algorithms used. Next, section 4 presents like clustering a large number of tuples (Mc- the experimental methodology and data used Callum, Nigam, and Ungar, 2000), exploi- in our evaluation. In section 5 we describe ting links (Bhattacharya and Getoor, 2004) the feature functions employed in our empi- or using a relational probability model to de- rical evaluation. Section 6 shows the results fine a generative model (Pasula et al., 2002). obtained and, finally, we expose our conclu- In the last years, some works take advan- sions in section 7. tage of some domain knowledge at the semantic level to improve the results. For example, 2 Problem definition and Doan et al. (Doan et al., 2003) shows how representation semantic rules either automatically learned The alias assignment problem can be formali- or specified by a domain expert can improve zed as pairwise classification: Find a function the results. Shen et al. (Shen, Li, and Doan, f : N ×N → {1, −1} which classifies the pair 2005) use probabilistic domain constraints in alias-entity as positive (1) if the alias is rep- a more general model employing a relaxation resenting the entity or negative (-1) if not. labeling algorithm to perform matching. The alias and the entity are represented as Some of the methods used for entity strings in a name space N. We propose a matching are not applicable to alias assign- variation of the classifier where we can use also some useful attributes we have about and also distinguish hard rules from soft ones. the entity. In our case, function to find will In (Shen, Li, and Doan, 2005) weights are es- be: f : N × M → {1, −1} where M repre- tablished by an expert user or learned from sents a different space including all entity’s the same data set to classify. In our work, attributes. we present another way to use this informa- We define a feature function as a function tion. We propose to add more feature func- that represents a property of the alias, the tions to increase the number of attributes for entity, or the pair alias-entity. Once a pair our classifier. Each new feature function de- alias-entity is represented as a vector of fea- scribes some characteristic of the alias, of the tures, one can combine them appropriately entity, or of the pair alias-entity that needs using machine learning algorithms to obtain some extra knowledge. The contribution of a classifier. In section 3 we explain how each feature will be learned as any other simi- we learn classifiers using two different ap- larity function when some machine learning proaches. Most of the feature functions used method is applied. here are similarity functions which quantify the similarity of the pair alias-entity accor- 3 Learning classifiers ding to some criteria. In a similarity func- Two approaches are used and compared in tion the returned value r indicates greater si- order to obtain a good classifier using fea- milarity in larger values while shorter values ture functions introduced above, Hill Climb- indicates lower similarity (dissimilarity).

Load more