
Disambiguation of Proper Names in Text Nina Wacholder" Yael Ravin Misook Choi CRIA TJ Watson Research Center TJ Watson Research Center Columbia University IBM IBM New York, NY 10027 Yorktown Heights, NY 10598 Yorktown Heights, NY 10598 nina@cs, columbia, edu yaelCwat son. ibm. com machoi@wat son. ibm. tom Abstract topical. This situation is the norm for dynamic ap- plications such as news providing services or Internet Identifying the occurrences of proper information indexing. names in text and the entities they refer to The next Section describes the different types of can be a difficult task because of the many- proper name ambiguities we have observed. Sec- to-many mapping between names and their tion 3 discusses the role of context and world knowl- referents. We analyze the types of ambi- edge in their disambiguation; Section 4 describes the guity -- structural and semantic -- that process of name discovery as implemented in Nomi- make the discovery of proper names dif- nator, a module for proper name recognition devel- ficult in text, and describe the heuristics oped at the IBM T.J. Watson Research Center. Sec- used to disambiguate names in Nomina- tions 5-7 elaborate on Nominator's disambiguation tor, a fully-implemented module for proper heuristics. name recognition developed at the IBM T.J. Watson Research Center. 2 The Ambiguity of Proper Names 1 Proper Name Identification in Name identification requires resolution of a subset of the types of structural and semantic ambiguities en- Natural Language Processing countered in the analysis of nouns and noun phrases Text processing applications, such as machine trans- (NPs) in natural language processing. Like common lation systems, information retrieval systems or nouns, ((Jensen and Binot 1987), (Hindle and Rooth natural-language understanding systems, need to 1993) and (Brill and Resnick 1994)), proper names identify multi-word expressions that refer to proper exhibit structural ambiguity in prepositional phrase names of people, organizations, places, laws and (PP) attachment and in conjunction scope. other entities. When encountering Mrs. Candy Hill A PP may be attached to the preceding NP and in input text, for example, a machine translation form part of a single large name, as in NP[Midwest system should not attempt to look up the transla- Center PP[for NP[Computer Research]]]. Alterna- tion of candy and hill, but should translate Mrs. to tively it may be independent of the preceding NP, the appropriate personal title in the target language as in NP[Carnegie Hall] PP[for NP[Irwin Berlin]], and preserve the rest of the name intact. Similarly, where for separates two distinct names, Carnegie an information retrieval system should not attempt Hall and Irwin Berlin. to expand Candy to all of its morphological variants As with PP-attachment of common noun phrases, or suggest synonyms (Wacholder et al. 1994). the ambiguity is not always resolved, even in hu- The need to identify proper names has two as- man sentence parsing (cf. the famous example I saw pects: the recognition of known names and the dis- the girl in the park with the telescope). The loca- covery of new names. Since obtaining and maintain- tion of an organization, for instance, could be part ing a name database requires significant effort, many of its name (City University of New York) or an applications need to operate in the absence of such attached modifier (The Museum of Modern Art in a resource. Without a database, names need to be New York City). Without knowledge of the official discovered in the text and linked to entities they re- name, it is sometimes difficult to determine the ex- fer to. Even where name databases exist, text needs act boundaries of a proper name. Consider examples to be scanned for new names that are formed when such as Western Co. of North America, Commod- entities, such as countries or commercial companies, ity Ezchange in New York and Hebrew University in are created, or for unknown names which become Jerusalem, Israel. important when the entities they refer to become Proper names contain ambiguous conjoined 202 phrases. The components of Victoria and Albert stand for longer ones. Shared knowledge and con- Museum and IBM and Bell Laboratories look identi- text are crucial disambiguation factors. Paris, usu- cal; however, and is part of the name of the museum ally refers to the capital of France, rather than a in the first example, but a conjunction joining two city in Texas or the Trojan prince, but in a particu- computer company names in the second. Although lar context, such as a discussion of Greek mythology, this problem is well known, a seazch of the computa- the presumed referent changes. tional literature shows that few solutions have been Beyond the ambiguities that proper names share proposed, perhaps because the conjunct ambiguity with common nouns, some ambiguities are particular problem is harder than PP attachment (though see to names: noun phrases may be ambiguous between (Agarwal and Boggess 1992) for a method of con- a name reading and a common noun phrase, as in junct identification that relies on syntactic category Candy, the person's name, versus candy the food, or and semantic label). The House as an organization versus a house refer- Similar structural ambiguity exists with respect ring to a building. In English, capitalization usually to the possessive pronoun, which may indicate a re- disambiguates the two, though not at sentence be- lationship between two names (e.g., Israel's Shimon ginnings: at the beginning of a sentence, the compo- Peres) or may constitute a component of a single nents and capitalization patterns of New Coke and name (e.g., Donoghue's Money Fund Report). New Sears are identical; only world knowledge in- The resolution of structural ambiguity such as forms us that New Coke is a product and Sears is a PP attachment and conjunction scope is required company. in order to automatically establish the exact bound- Furthermore, capitalization does not always dis- aries of proper names. Once these boundaries have ambiguate names from non-names because what been established, there is another type of well-known constitutes a name as opposed to a'non-name is structural ambiguity, involving the internal struc- not always clear. According to (Quirk et al. 1972) ture of the proper name. For example, Professor of names, which consist of proper nouns (classified into Far Eastern Art John Blake is parsed as [[Professor personal names like Shakespeare, temporal names [of Fax Eastern Art]] John Blake] whereas Professor like Monday, or geographical names like Australia) Art Klein is [[Professor] Art Klein]. have 'unique' reference. Proper nouns differ in their Proper names also display semantic ambiguity. linguistic behavior from common nouns in that they Identification of the type of proper nouns resem- mostly do not take determiners or have a plural bles the problem of sense disambiguation for com- form. However, some names do take determiners, mon nouns where, for instance, state taken out of as in The New York Times; in this case, they "are context may refer either to a government body or perfectly regular in taking the definite article since the condition of a person or entity. A name variant they are basically prernodified count nouns... The taken out of context may be one of many types, e.g., difference between an ordinary common noun and Ford by itself could be a person (Gerald Ford), an an ordinary common noun turned name is that the organization (Ford Motors), a make of car (Ford), unique reference of the name has been institution- or a place (Ford, Michigan). Entity-type ambiguity alized, as is made overt in writing by initial capital is quite common, as places are named after famous letter." Quirk et al.'s description of names seems to people and companies are named after their owners indicate that capitalized words like Egyptian (an ad- or locations. In addition, naming conventions are jective) or Frenchmen (a noun referring to a set of sometimes disregarded by people who enjoy creating individuals) are not names. It leaves capitalized se- novel and unconventional names. A store named Mr. quences like Minimum Alternative Taz, Annual Re- Tall and a woman named April Wednesday (McDon- port, and Chairman undetermined as to whether or ald 1993) come to mind. not they are names. Like common nouns, proper nouns exhibit system- All of these ambiguities must be dealt with if atic metonymy: United States refers either to a geo- proper names are to be identified correctly. In graphical area or to the political body which governs the rest of the paper we describe the resources this area; Wall Street Journal refers to the printed and heuristics we have designed and implemented object, its content, and the commercial entity that in Nominator and the extent to which they resolve produces it. these ambiguities. In addition, proper names resemble definite noun phrases in that their intended referent may be am- 3 Disambiguation Resources biguous. The man may refer to more than one male individual previously mentioned in the discourse or In general, two types of resources are available for present in the non-linguistic context; J. Smith may disambiguation: context and world knowledge. Each similarly refer to more than one individual named of these can be exploited along a continuum, from Joseph Smith, John Smith, Jane Smith, etc. Se- 'cheaper' to computationally and manually more ex- mantic ambiguity of names is very common because pensive usage. 'Cheaper' models, which include of the standard practice of using shorter names to no context or world knowledge, do very little dis- 203 ambiguation. More 'expensive' models, which use no other over-riding information, it may be safe to full syntactic parsing, discourse models, inference assume that the string McDonald's refers to an or- and reasoning, require computational and human re- ganization.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-