Aproximate Personal Name-Matching Through Finite-State Graphs
Total Page:16
File Type:pdf, Size:1020Kb
ASI5813_0000_20671.qxd 9/12/07 11:41 PM Page 1 Approximate Personal Name-Matching Through Finite-State Graphs Carmen Galvez and Félix Moya-Anegón Department of information Science, University of Granada, Campus Cartuja, Colegio Máximo, 18071 Granada, Spain. E-mail: {cgalvez, felix}@urg.es This article shows how finite-state methods can be individuals may have the same name) are due precisely to employed in a new and different task: the conflation of the ambiguity in that relationship of reference. The context personal name variants in standard forms. In biblio- could be clarified by determining the intellectual connection graphic databases and citation index systems, variant forms create problems of inaccuracy that affect informa- between the PNs, or by using subject indicators. tion retrieval, the quality of information from databases, The referential ambiguity of PNs is a major problem in IR and the citation statistics used for the evaluation of sci- and citation statistics. The disambiguation of authors whose entists’ work. A number of approximate string matching names coincide and overlap in the citation indexing systems techniques have been developed to validate variant calls for methods that distinguish the contexts in which forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid authors or individuals appear, such as cross-document forms. In establishing an equivalence relation between co-reference resolution algorithms that use the vector space valid variants and the standard form of its equivalence model to resolve ambiguities (Bagga & Baldwin, 1998; class, we defend the application of finite-state transducers. Gooi & Allan, 2004; Han, Giles, Zha, Li, & Tsioutsiouliklis, The process of variant identification requires the elabora- 2004; Wacholder, Ravin, & Choi, 1997), co-occurrence tion of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from analysis and clustering techniques (Han, Zha, & Giles, 2005; bibliographic records, selected from the Library and Infor- Mann & Yarowsky, 2003; Pedersen, Purandare, & Kulkarni, mation Science Abstracts and Science Citation Index 2005), probabilistic similarity metrics (Torvik, Weeber, Expanded databases. The evaluation involved calculating Swanson, & Smalheiser, 2005), or co-citation analysis the measures of precision and recall, based on complete- and visualization mapping algorithms through a Kohonen ness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with network (X. Lin, White, & Buzydlowski, 2003; McCain, methods based on similarity relations for the recognition of 1990). Underlying identity uncertainty problems, however, spelling variants and misspellings. are beyond the scope of the present article. PNs can be considered object tags that may appear in many different forms, known as variants. A PN variant can Introduction be described as a text occurrence that is conceptually well related with the correct form, or canonical form, of a name. A number of different interrelated areas are involved in the The recognition of the variant of these sequences would identification of Personal Names (PNs), including Natural revolve around one of three procedures specified by Thomp- Language Processing, Information Extraction, and Informa- son and Dozier (1999): name-recognition, name-matching, tion Retrieval (IR). Personal names are included in the cate- and name-searching. gory of proper names, defined as expressions or phrases that Name-recognition is the process by which a string of designate persons without providing conceptual information. characters is identified as a name. This is done both in data- Their primary semantic function is to establish a relationship base indexing processes and at the time of retrieval, after of reference between the phrase and the object. The identity parsing the user query. It is widely used to extract names certainty of this relationship is determined in many cases by from texts, as described in the Message Understanding Con- context. Some problems surrounding homonyms (i.e., different ferences (MUC-4, 1992; MUC-6, 1995), as part of informa- tion extraction. In MUC-6, the recognition of the named entity is considered a key element of extraction systems; Received January 29, 2007; accepted January 30, 2007 entities include names of persons, organizations, or places as © 2007 Wiley Periodicals, Inc. • Published online 00 xxxxxx 2007 in Wiley well as expressions of time or monetary expressions. In InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20671 MUC-7 (Chinchor, 1997), the named entity recognition JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 58(13):001–017, 2007 ASI5813_0000_20671.qxd 9/12/07 11:41 PM Page 2 implies identifying and categorizing three subareas which online catalogs and bibliographic databases (Borgman & are the recognition of time expressions, numerical expres- Siegfried, 1992; Bouchard & Pouyez, 1980; Bourne, 1977; sions, and entity names—persons, organizations, and places. Rogers & Willett, 1991; Ruiz-Perez, Delgado López-Cózar, E., There are many studies dedicated to the specification of & Jiménez-Contreras, 2002; Siegfried & Bernstein, 1991; the formation rules for names (Baluja, Mittal, & Sukthankar, Strunk, 1991; Tagliacozzo, Kochen, & Rosenberg, 1970; 2000; Bikel, Miller, Schwartz, & Weischedel, 1997; Tao & Cole, 1991; Taylor, 1984; Weintraub, 1991). In gen- Coates-Stephens, 1993; Gaizauskas, Wakao, Humphreys, eral, the techniques for identifying variants are included Cunningham, & Wilks, 1995; Navarro, Baeza-Yates, & under the common heading of authority work (Auld, 1982; Arcoverde, 2003; Paik, Liddy, Yu, & McKenna, 1993; Taylor, 1989; Tillett, 1989). An authority file serves to estab- Patman & Thompson, 2003; Ravin & Wacholder, 1996; lish the correspondences among all the permitted forms of a Wacholder, Ravin, & Byrd, 1994). sequence; that is, among any equivalent forms within a par- Name-matching is the process through which it is deter- ticular bibliographic file. This is particularly important when mined whether two strings of characters previously recognized different bibliographic databases use different authority con- as names actually designate the same person—name-match- ventions, requiring the use of a joint authority file (French, ing does not focus on the case of various individuals who Powell, & Schulman, 2000). have identical name labels. Within this processing, two In retrieving scientific literature and citation indexes, a situations can arise: (a) The matching is exact, in which case variety of formats may be used to cite the same author there is no problem; or (b) the matching is not exact, making (Cronin & Snyder, 1997; Garfield, 1979, 1983a, 1983b; Giles, it necessary to determine the origin of these variants and Bollacker, & Lawrence, 1998; Moed & Vriens, 1989; Rice, apply approximate string matching. Borgman, Bednarski, & Hart, 1989; Sher, Garfield, & Elias, Name-searching designates the process through which a 1966). This inconsistency is present in the databases of the name is used as part of a query to retrieve information asso- Science Citation Index Expanded (SCI-E), Social Science ciated with that sequence in a database. Here, likewise, two Citation Index (SSCI) and Arts & Humanities Citation Index problems can appear: (a) The names are not identified as (A&HCI), produced by the Institute for Scientific Information such in the database registers or in the syntax of the query, in (ISI) and now owned by the Thomson Corporation, the Cite- which case Name-recognition techniques are needed; and Seer Project at the NEC Research Institute, or the product (b) the names are recognized in the database records and in from Elsevier called Scopus. Because citation statistics are the query, but it is not certain whether the recognized names habitually used to assess the quality of scientists’work, world- designate the same person, and so Name-matching tech- wide, many researchers and policy makers depend on the cita- niques must be used. tion statistics to help determine funding. Citation inaccuracies The setting of application of approximate matching tech- cause serious problems in citation statistics. Therefore, niques is not substantially different from that of IR itself. experts in bibliometric analysis should put extra effort into These techniques would be integrated in automatic spelling validating the accuracy of data used in their studies. correction systems and automatic text transformation sys- Many studies focus on automatic spelling correction tems to track down correspondences between the correct or (Accomazzi, Eichhorn, Kurtz, Grant, & Murray, 2000; standard dictionary forms and the items included in the data- Angell, Freund, & Willett, 1983; Blair, 1960; Cucerzan & base records. A further application would be when matching Brill, 2004; Damerau & Mays, 1989; Davidson, 1962; Gadd, query terms with their occurrences in prestored dictionaries 1988; Kukich, 1992; Landau & Vishkin, 1986; Petersen, or indexes. Once again, information retrieval through names 1986; Pollock & Zamora, 1984; Rogers & Willett, 1991; entails two points of inaccuracies. First, the documents Takahashi, Itoh, Amano, & Yamashita, 1990; Ullmann, 1977; might contain one or more terms of the query but with dif- Zobel & Dart, 1995). A number of studies have highlighted ferent spellings, impeding successful retrieval. Second, the the context