<<

Disambiguation of Proper in Text

Nina Wacholder" Yael Ravin Misook Choi CRIA TJ Watson Research Center TJ Watson Research Center Columbia University IBM IBM New York, NY 10027 Yorktown Heights, NY 10598 Yorktown Heights, NY 10598 nina@cs, columbia, edu yaelCwat son. ibm. com machoi@wat son. ibm. tom

Abstract topical. This situation is the norm for dynamic ap- plications such as news providing services or Internet Identifying the occurrences of proper information indexing. names in text and the entities they refer to The next Section describes the different types of can be a difficult task because of the many- proper ambiguities we have observed. Sec- to-many mapping between names and their tion 3 discusses the role of context and world knowl- referents. We analyze the types of ambi- edge in their disambiguation; Section 4 describes the guity -- structural and semantic -- that process of name discovery as implemented in Nomi- make the discovery of proper names dif- nator, a module for proper name recognition devel- ficult in text, and describe the heuristics oped at the IBM T.J. Watson Research Center. Sec- used to disambiguate names in Nomina- tions 5-7 elaborate on Nominator's disambiguation tor, a fully-implemented module for proper heuristics. name recognition developed at the IBM T.J. Watson Research Center. 2 The Ambiguity of Proper Names

1 Proper Name Identification in Name identification requires resolution of a subset of the types of structural and semantic ambiguities en- Natural Language Processing countered in the analysis of nouns and noun phrases Text processing applications, such as machine trans- (NPs) in natural language processing. Like common lation systems, information retrieval systems or nouns, ((Jensen and Binot 1987), (Hindle and Rooth natural-language understanding systems, need to 1993) and (Brill and Resnick 1994)), proper names identify multi-word expressions that refer to proper exhibit structural ambiguity in prepositional phrase names of people, organizations, places, laws and (PP) attachment and in conjunction scope. other entities. When encountering Mrs. Candy Hill A PP may be attached to the preceding NP and in input text, for example, a machine translation form part of a single large name, as in NP[Midwest system should not attempt to look up the transla- Center PP[for NP[Computer Research]]]. Alterna- tion of candy and hill, but should translate Mrs. to tively it may be independent of the preceding NP, the appropriate personal in the target language as in NP[Carnegie Hall] PP[for NP[Irwin Berlin]], and preserve the rest of the name intact. Similarly, where for separates two distinct names, Carnegie an information retrieval system should not attempt Hall and Irwin Berlin. to expand Candy to all of its morphological variants As with PP-attachment of common noun phrases, or suggest synonyms (Wacholder et al. 1994). the ambiguity is not always resolved, even in hu- The need to identify proper names has two as- man sentence parsing (cf. the famous example I saw pects: the recognition of known names and the dis- the girl in the park with the telescope). The loca- covery of new names. Since obtaining and maintain- tion of an organization, for instance, could be part ing a name database requires significant effort, many of its name (City University of New York) or an applications need to operate in the absence of such attached modifier (The Museum of Modern Art in a resource. Without a database, names need to be New York City). Without knowledge of the official discovered in the text and linked to entities they re- name, it is sometimes difficult to determine the ex- fer to. Even where name databases exist, text needs act boundaries of a proper name. Consider examples to be scanned for new names that are formed when such as Western Co. of , Commod- entities, such as countries or commercial companies, ity Ezchange in New York and Hebrew University in are created, or for unknown names which become Jerusalem, Israel. important when the entities they refer to become Proper names contain ambiguous conjoined

202 phrases. The components of Victoria and Albert stand for longer ones. Shared knowledge and con- Museum and IBM and Bell Laboratories look identi- text are crucial disambiguation factors. Paris, usu- cal; however, and is part of the name of the museum ally refers to the capital of France, rather than a in the first example, but a conjunction joining two city in Texas or the Trojan prince, but in a particu- computer company names in the second. Although lar context, such as a discussion of Greek mythology, this problem is well known, a seazch of the computa- the presumed referent changes. tional literature shows that few solutions have been Beyond the ambiguities that proper names share proposed, perhaps because the conjunct ambiguity with common nouns, some ambiguities are particular problem is harder than PP attachment (though see to names: noun phrases may be ambiguous between (Agarwal and Boggess 1992) for a method of con- a name reading and a common noun phrase, as in junct identification that relies on syntactic category Candy, the person's name, versus candy the food, or and semantic label). The House as an organization versus a house refer- Similar structural ambiguity exists with respect ring to a building. In English, capitalization usually to the possessive pronoun, which may indicate a re- disambiguates the two, though not at sentence be- lationship between two names (e.g., Israel's Shimon ginnings: at the beginning of a sentence, the compo- Peres) or may constitute a component of a single nents and capitalization patterns of New Coke and name (e.g., Donoghue's Money Fund Report). New Sears are identical; only world knowledge in- The resolution of structural ambiguity such as forms us that New Coke is a product and Sears is a PP attachment and conjunction scope is required company. in to automatically establish the exact bound- Furthermore, capitalization does not always dis- aries of proper names. Once these boundaries have ambiguate names from non-names because what been established, there is another type of well-known constitutes a name as opposed to a'non-name is structural ambiguity, involving the internal struc- not always clear. According to (Quirk et al. 1972) ture of the proper name. For example, Professor of names, which consist of proper nouns (classified into Far Eastern Art John Blake is parsed as [[Professor personal names like Shakespeare, temporal names [of Fax Eastern Art]] John Blake] whereas Professor like Monday, or geographical names like ) Art Klein is [[Professor] Art Klein]. have 'unique' reference. Proper nouns differ in their Proper names also display semantic ambiguity. linguistic behavior from common nouns in that they Identification of the type of proper nouns resem- mostly do not take determiners or have a plural bles the problem of sense disambiguation for com- form. However, some names do take determiners, mon nouns where, for instance, state taken out of as in The New York Times; in this case, they "are context may refer either to a government body or perfectly regular in taking the definite article since the condition of a person or entity. A name variant they are basically prernodified count nouns... The taken out of context may be one of many types, e.g., difference between an ordinary common noun and Ford by itself could be a person (Gerald Ford), an an ordinary common noun turned name is that the organization (Ford Motors), a make of car (Ford), unique reference of the name has been institution- or a place (Ford, Michigan). Entity-type ambiguity alized, as is made overt in writing by initial capital is quite common, as places are named after famous letter." Quirk et al.'s description of names seems to people and companies are named after their owners indicate that capitalized words like Egyptian (an ad- or locations. In addition, naming conventions are jective) or Frenchmen (a noun referring to a set of sometimes disregarded by people who enjoy creating individuals) are not names. It leaves capitalized se- novel and unconventional names. A store named Mr. quences like Minimum Alternative Taz, Annual Re- Tall and a woman named April Wednesday (McDon- port, and Chairman undetermined as to whether or ald 1993) come to mind. not they are names. Like common nouns, proper nouns exhibit system- All of these ambiguities must be dealt with if atic : refers either to a geo- proper names are to be identified correctly. In graphical area or to the political body which governs the rest of the paper we describe the resources this area; Wall Street Journal refers to the printed and heuristics we have designed and implemented object, its content, and the commercial entity that in Nominator and the extent to which they resolve produces it. these ambiguities. In addition, proper names resemble definite noun phrases in that their intended referent may be am- 3 Disambiguation Resources biguous. The man may refer to more than one male individual previously mentioned in the discourse or In general, two types of resources are available for present in the non-linguistic context; J. Smith may disambiguation: context and world knowledge. Each similarly refer to more than one individual named of these can be exploited along a continuum, from Joseph Smith, John Smith, Jane Smith, etc. Se- 'cheaper' to computationally and manually more ex- mantic ambiguity of names is very common because pensive usage. 'Cheaper' models, which include of the standard practice of using shorter names to no context or world knowledge, do very little dis- 203 ambiguation. More 'expensive' models, which use no other over-riding information, it may be safe to full syntactic parsing, discourse models, inference assume that the string McDonald's refers to an or- and reasoning, require computational and human re- ganization. But even if an existing database is reli- sources that may not always be available, as when able, names that are not yet in it must be discovered massive amounts of text have to be rapidly processed and information in the database must be over-ridden on a regular basis. In addition, given the current when appropriate. For example, if a new name such state of the art, full parsing and extensive world as IBM Credit Corp. occurs in the text but not in knowledge would still not yield complete automatic the database, while IBM exists in the database, au- ambiguity resolution. tomatic identification of IBM should be blocked in In designing Nominator, we have tried to achieve a favor of the new name IBM Credi~ Corp. balance between high accuracy and speed by adopt- If a name database exists, Nominator can take ing a model which uses minimal context and world advantage of it. However, our goal has been to de- knowledge. Nominator uses no syntactic contextual sign Nominator to function optimally in the absence information. It applies a set of heuristics to a list of such a resource. In this case, Nominator con- of (multi-word) strings, based on patterns of capi- sults a small authority file which contains informa- talization, punctuation and location within the sen- tion on about 3000 special 'name words' and their tence and the document. This design choice differ- relevant lexical features. Listed are personal entiates our approach from that of several similar (e.g., Mr., King), organizational (includ- projects. Most proper name recognizers that have ing strong identifiers such as Inc. and weaker do- been reported on in print either take as input text main identifiers such as Arts) and names of large tagged by part-of-speech (e.g., the systems of (Paik places (e.g., Los Angeles, California, but not Scars- et al. 1993) and (Mani et al. 1993)) or perform syn- dale, N.Y.). Also listed are exception words, such tactic and/or morphological analysis on all words, as upper-case lexical items that are unlikely to be including capitalized ones, that are part of candi- single-word proper names (e.g., Very, I or TV) and date proper names (e.g., (Coates-Stephens 1993) and lower-case lexical items (e.g., and and van) that can (McDonald 1993)). Several (e.g., (McDonald 1993), be parts of proper names. In addition, the authority (Mani et al. 1993), (Paik et al. 1993) and (Cowie file contains about 20,000 first names. et al. 1992)) look in the local context of the candi- Our choice of disambiguation resources makes date proper name for external information such as Nominator fast and robust. The precision and re- appositives (e.g., in a sequence such as Robin Clark, call of Nominator, operating without a database of presiden~ of Clark Co.) or for human-subject verbs pre-existing proper names, is in the 90's while the (e.g., say, plan) in order to determine the category processing rate is over 40Mg of text per hour on a of the candidate proper name. Nominator does not RISC/6000 machine. (See (Ravin and Wacholder use this type of external context. 1996) for details.) This efficient processing has been Instead, Nominator makes use of a different kind achieved at the cost of limiting the extent to which of contextual information -- proper names co- the program can 'understand' the text being ana- occuring in. the document. It is a fairly standard lyzed and resolve potential ambiguity. Many word- convention in an edited document for one of the first sequences that are easily recognized by human read- references to an entity (excluding a reference in the ers as names are ambiguous for Nominator, given the title) to include a relatively full form of its name. restricted set of tools available to it. In cases where In a kind of discourse anaphora, other references to Nominator cannot resolve an ambiguity with rela- the entity take the form of shorter, more ambiguous tively high confidence, we follow the principle that variants. Nominator identifies the referent of the full 'noisy information' is to be preferred to data omit- form (see below) and then takes advantage of the ted, so that no information is lost. In ambiguous discourse context provided by the list of names to cases, the module is designed to make conservative associate shorter more ambiguous name occurrences decisions, such as including non-names or non-name with their intended referents. parts in otherwise valid name sequences. It assigns In terms of world knowledge, the most obvious re- weak types such as ?HUMAN or fails to assign a source is a database of known names. In fact, this is type if the available information is not sufficient. what many commercially available name identifica- tion applications use (e.g., Hayes 1994). A reliable 4 The Name Discovery Process database provides both accuracy and efficiency, if fast look-up methods are incorporated. A database In this section, we give an overview of the process also has the potential to resolve structural ambigu- by which Nominator identifies and classifies proper ity; for example, if IBM and Apple Computers are names. Nominator's first step is to build a list of listed individually in the database but IBM and Ap- candidate names for a document. Next, 'splitting' ple Computers is not, it may indicate a conjunction heuristics are applied to all candidate names for the of two distinct names. A database may also con- purpose of breaking up complex names into smaller tain default world knowledge information: e.g., with ones. Finally, Nominator groups together name vari- 204 ants that refer to the same entity. After information categorized by an entity type and assigned a 'canon- about names and their referents has been extracted ical name' as its . The canonical name is from individual documents, an aggregation process the fullest, least ambiguous label that can be used combines the names collected from all the documents to refer to the entity. It may be one of the variants into a dictionary, or database of names, representa- found in the document or it may be constructed from tive of the document collection. (For more details components of different ones As the links are formed, on the process, see (Ravin and Wacholder 1996)). each group is assigned a type. In the sample output We illustrate the process of name discovery with shown below, each canonical name is followed by its an excerpt taken from a Wall Street Journal article entity type and by the variants linked to it. in the TIPSTER CD-ROM collection (NIST 1993). Paragraph breaks are omitted to conserve space. American Bar Association (ORG) : ABA Steptoe & Johnson (ORG) ... The professional conduct of lawyers in other Washington (PLACE) jurisdictions is guided by American Bar Association Dubuque (PLACE) rules or by state bar ethics codes, none of which Robert Jordan (PERSON) : Mr. Jordan permit non-lawyers to be partners in law firms. The ABA has steadfastly reserved the title of partner and After the whole document collection has been partnership perks (which include getting a stake of processed, linked groups are merged across docu- the firm's profit) for those with law degrees. But ments and their variants combined. Thus, if in Robert Jordan, a partner at Steptoe & Johnson who one document President Clinton was a variant of took the lead in drafting the new district bar code, William Clinton, while in another document Gover- said the ABA's rules were viewed as "too restrictive" nor Clinton was a variant of William Clinton, both by lawyers here. "The practice of law in Washing- are treated as variants of an aggregated William ton is very different from what it is in Dubuque," Clinton group. In this minimal sense, Nominator he said .... Some of these non-lawyer employees are uses the larger context of the document collection paid at partners' levels. Yet, not having the part- to 'learn' more variants for a . ner title "makes non-lawyers working in law firms In the following sections we describe how ambigu- second-class citizens," said Mr. Jordan of Steptoe & ity is resolved as part of the name discovery process. Johnson .... 5 Resolution of Structural Before the text is processed by Nominator, it is Ambiguity analyzed into tokens -- sentences, words, tags, and punctuation elements. Nominator forms a candidate We identify three indicators of potential structural name list by scanning the tokenized document and ambiguity, prepositions, conj unctions and possessive collecting sequences of capitalized tokens (or words) pronouns, which we refer to as 'ambiguous oper- as well as some special lower-case tokens, such as ators'. In order to determine whether 'splitting' conjunctions and prepositions. should occur, a name sequence containing an am- The list of candidate names extracted from the biguous operator is divided into three segments -- sample document contains: the operator, the substring to its left and the sub- string to its right. The splitting process applies a American Bar Association Robert Jordan set of heuristics based on patterns of capitalization, Steptoe &= Johnson lexical features and the relative 'scope' of operators ABA (see below) to name sequences containing these op- erators to determine whether or not they should be Washington Dubuque split into smaller names. We can describe the splitting heuristics as deter- Mr. Jordan of Steptoe & Johnson mining the scope of ambiguous operators, by analogy Each candidate name is examined for the presence to the standard linguistic treatment of quantifiers. of conjunctions, prepositions or possessive 's. A set From Nominator's point of view, all three operator of heuristics is applied to determine whether each types behave in similar ways and often interact when candidate name should be split into smaller inde- they co-occur in the same name sequence, as in New pendent names. For example, Mr. Jordan of Steptoe York's MOMA and the Victoria and Albert Museum Johnson is split into Mr. Jordan and Steptoe 8J in London. Johnson. The scope of ambiguous operators also interacts Finally, Nominator links together variants that with the 'scope' of NP-heads, if we define the scope refer to the same entity. Because of standard of NP-heads as the constituents they dominate. For English-language naming conventions, Mr. Jordan example, in Victoria and Albert Museum, the con- is grouped with Robert Jordan. ABA is grouped junction is within the scope of the lexical head with American Bar Association as a possible abbre- Museum because Museum is a noun that can take viation of the longer name. Each linked group is PP modification (Museum of Natural History) and 205 hence pre-modification (Natural History Museum). nications and Houston Industries Inc. or Dallas's Since pre-modifiers can contain conj unctions (Japan- MCorp and First RepublicBank and Houston's First ese Painting and Printing Museum), the conjunction City Bancorp. of Tezas. is within the scope of the noun, and so the name is not split. Although the same relationship holds 6 Resolution of Ambiguity at between the lexical head Laboratories and the con- Sentence Beginnings junction and in IBM and Bell Laboratories, another heuristic takes precedence, one whose condition re- Special treatment is required for words in sentence- quires splitting a string if it contains an initial position, which may be capitalized because immediately to the left or to the right of the am- they are part of a proper name or simply because biguous operator. they are sentence initial. It is not possible to determine relative scope While the heuristics for splitting names are lin- strength for all the combinations of different opera- guistically motivated and rule-governed, the heuris- tors. Contradictory examples abound: Gates of Mi- tics for handling sentence-initial names are based on crosoft and Gerstner of IBM suggests stronger scope patterns of word occurrence in the document. When of and over ok The Department of German Lan- all the names have been collected and split, names guages and Literature suggests the opposite. Since containing sentence-initial words are compared to it is usually the case that a right-hand operator other names on the list. If the sentence-initial candi- has stronger scope over a left-hand one, we evalu- date name also occurs as a non-sentence-initial name ate strings containing operators from right to left. or as a substring of it, the candidate name is as- To illustrate, New York's MOMA and the Victoria sumed to be valid and is retained. Otherwise, it is and Albert Museum in London is first evaluated for removed from the list. For example, if White occurs splitting on in. Since the left and right substrings at sentence-initiai position and also as a substring do not satisfy any conditions, we proceed to the next of another name (e.g., Mr. White) it is kept. If it operator on the left -- and. Because of the strong is found only in sentence-initial position (e.g., White scope of Museum, as mentioned above, no splitting paint is ...), White is discarded. occurs. Next, the second and from the right is eval- A more difficult situation arises when a sentence- uated. It causes a split because it is immediately initial candidate name contains a valid name that preceded by an all-capitalized word. We have found begins at the second word of the string. If the pre- this simple typographical heuristic to be powerful ceding word is an adverb, a pronoun, a verb or a and surprisingly accurate. preposition, it can safely be discarded. Thus a sen- Ambiguous operators form recursive structures tence beginning with Yesterday Columbia yields Co- and so the splitting heuristics apply recursively to lumbia as a name. But cases involving other parts name sequences until no more splitting conditions of speech remain unresolved. If they are sentence- hold. New York's MOMA is further split at's be- initial, Nominator accepts as names both New Sears cause of a heuristic that checks for place names on and New Coke; it also accepts sentence-initial Five the left of a possessive pronoun or a comma. Victo- Reagan as a variant of President Reagan, if the two ria and Albert Museum in London remains intact. co-occur in a document. Nominator's other heuristics resemble those dis- cussed above in that they check for typographical 7 Resolution of Semantic Ambiguity patterns or for the presence of particular name types to the left or right of certain operators. Some heuris- In a typical document, a single entity may be re- tics weigh the relative scope strength in the sub- ferred to by many name variants which differ in their strings on either side of the operator. If the scope degree of potential ambiguity. As noted above, Paris strength is similar, the string is split. We have ob- and Washington are highly ambiguous out of con- served that this type of heuristic works quite well. text but in well edited text they are often disam- Thus, the string The Natural History Museum and biguated by the occurrence of a single unambiguous The Board of Education is split at and because each variant in the same document. Thus, Washington is of its substrings contains a strong-scope NP-head (as likely to co-occur with either President Washington we define it) with modifiers within its scope. These or Washington, D.C., but not with both. Indeed, we two substrings are better balanced than the sub- have observed that if several unambiguous variants strings of The Food and Drug Administration where do co-occur, as in documents that mention both the the left substring does not contain a strong-scope owner of a company and the company named after NP-head while the right one does (Administration). the owner, the editors refrain from using a variant Because of the principle that noisy data is prefer- that is ambiguous with respect to both. able to loss of information, Nominator does not split To disambiguate highly ambiguous variants then, names if relative strength cannot be determined. As we link them to unambiguous ones occurring within a result, there occur in Nominator's output certain the same document. Nominator cycles through the 'names' such as American Television ~ Commu- list of names, identifying 'anchors', or variant names 206 that unambiguously refer to certain entity types. Further disambiguation may be possible during When an anchor is identified, the list of name candi- aggregation across documents. As mentioned be- dates is scanned for ambiguous variants that could fore, during aggregation, linked groups from differ- refer to the same entity. They are linked to the an- ent documents are merged if their canonical forms chor. are identical. As a rule, their entity types should Our measure of ambiguity is very pragmatic. It is be identical as well, to prevent a merge of Boston based on the confidence scores yielded by heuristics (PLACE) and Boston (ORG). Weak entity types, that analyze a name and determine the entity types however, are allowed to merge with stronger entity it can refer to. If the heuristic for a certain entity types. Thus, Jordan Hills (?PERSON) from one type (a person, for example) results in a high con- document is aggregated with Jordan Hills (PER- difence score (highly confident that this is a person SON) from another, where there was sufficient evi- name), we determine that the name unambiguously dence, such as Mr. Hills, to make a firmer decision. refers to this type. Otherwise, we choose the highest score obtained by the various heuristics. A few simple indicators can unambiguously deter- 8 Evaluation mine the entity type of a name, such as Mr. for a person or Inc. for an organization. More commonly, however, several pieces of positive and negative evi- An evaluation of an earlier version of Nominator, dence are accumulated in order to make this judge- was performed on 88 Wall Street Journal documents ment. (NIST 1993) that had been set aside for testing. We We have defined a set of obligatory and optional chose the Wall Street Journal corpus because it fol- components for each entity type. For a human name, lows standard stylistic conventions, especially capi- these components include a professional title (e.g., talization, which is essential for Nominator to work. Attorney General), a personal title (e.g., Dr.), a first Nominator's performance deteriorates if other con- name, middle name, , last name, and suffix ventions are not consistently followed. (e.g., Jr.). The combination of the various compo- A linguist manually identified 2426 occurrences nents is inspected. Some combinations may result in of proper names, which reduced to 1354 unique to- a high negative score -- highly confident that this kens. Of these, Nominator correctly identified the cannot be a person name. For example, if the name boundaries of 91% (1230/1354). The precision rate lacks a personal title and a first name, and its last was 92% for the 1409 names Nominator identified name is listed as an organization word (e.g., Depart- (1230/1409). In terms of semantic disambiguation, ment) in the authority list, it receives a high negative Nominator failed to assign an entity type to 21% score. This is the case with Justice Department or of the names it identified. This high percentage is Frank Sinatra Building. The same combination but due to a decision not to assign a type if the confi- with a last name that is not a listed organization dence measure is too low. The payoff of this choice word results in a low positive score, as for Justice is a very high precision rate -- 99 % -- for the as- Johnson or Frank Sinatra. The presence or absence signment of semantic type to those names that were of a personal title is also important for determining disambiguated. (See (Ravin and Wacholder 1996) confidence: If present, the result is a high confidence for details. score (e.g., Mrs. Ruth Lake); No personal title with The main reason that names remain untyped is a known first name results in a low positive confi- insufficent evidence in the document. If IBM, for dence score (e.g'., Ruth Lake, Beverly Hills); and no example, occurs in a document without Interna- personal title with an unknown first name results in tional Business Machines, Nominator does not type a zero score (e.g., Panorama Lake). it; rather, it lets later processes inspect the local By the end of the analysis process, Justice De- context for further clues. These processess form partmen~ has a high negative score for person and part of the Talent tool set under development at a low positive score for organization, resulting in its the T.:]. Watson Research Center. They take as classification as an organization. Beverly Hills, by their input text processed by Nominator and fur- contrast, has low positive scores both for place and ther disambiguate untyped names appearing in cer- for person. Names with low or zero scores are first tain contexts, such as an appositive, e.g., president tested as possible variants of names with high posi- tive scores. However, if they are incompatible with of CitiBank Corp. any, they are assigned a weak entity type. Thus in Other untyped names, such as Star Bellied the absence of any other evidence in the document, Sneetches or George Melloan's Business World, are Beverly Hills is classified as a ?PERSON. (?PER- neither people, places, organizations nor any of the SON is preferred over ?PLACE as it tends to be the other legal or financial entities we categorize into. correct choice most of the time.) This analysis of Many of these uncategorized names are titles of ar- course can be over-ridden by a name database list- ticles, books and other works of art that we currently ing Beverly Hills as a place. do not handle. 207 9 Conclusion the Fourth Message Understanding Conference, pp.223-232. Ambiguity remains one of the main challenges in the processing of natural language text. Efforts to Jensen K. and Binot J-L, 1987. Disambiguating resolve it have traditionally focussed on the devel- prepositional phrase attachments by using on-line opment of full-coverage parsers, extensive lexicons, definitions, In Computational Linguistics, Vol. 13, and vast repositories of world knowledge. For some 3-4, pp.251-260. natural-language applications, the tremendous ef- Hayes P., 1994. NameFinder: Software that finds fort involved in developing these tools is still re- names in text, In Proceedings of RIAO 94, quired, but in other applications, such as informa- pp.762-774, New York, October. tion extraction, there has been a recent trend to- Hindle D. and M. Rooth., 1993. Structural am- wards favoring minimal parsing and shallow knowl- biguity and lexical relations, In Computational edge (Cowie and Lehnert 1996). In its minimal use Linguistics, Vol.19, i, pp.103-119. of resources, Nominator follows this trend: it relies on no syntactic information and on a small seman- Mani I., T.R. Macmillan, S. Luperfoy, E.P. Lusher, tic lexicon - an authority list which could easily be and S.J. Laskowski, 1993. Identifying unknown modified to include information about new domains. proper names in newswire text. In B. Boguraev Other advantages of using limited resources are ro- and J. Pustejovsky, eds., Corpus Processing for bustness and execution speed, which are important Lexical Acquisition, pp.41-54, MIT Press, Cam- in processing large amounts of text. bridge, Mass. In another sense, however, development of a mod- McDonald D.D., 1993. Internal and external evi- ule like Nominator still requires considerable hu- dence in the identification and semantic catego- man effort to discover reliable heuristics, particu- rization of proper names. In B. Boguraev and larly when only minimal information is used. These J. Pustejovsky, eds, Corpus Processing for Lezi- heuristics are somewhat domain dependent: dif- cal Acquisition, pp.61-76, MIT Press, Cambridge, ferent generalizations hold for names of drugs and Mass. chemicals than those identified for names of people NIST 1993. or organizations. In addition, as the heuristics de- TIPSTER Information-Retrieval Text on CD-ROM, published by pend on linguistic conventions, they are language Research Collection, The National Institute of Standards and Technol- dependent, and need updating when stylistic con- Gaithersburg, Maryland. ventions change. Note, for example, the recent pop- ogy, ularity of software names which include exclamation Paik W., E.D. Liddy, E. Yu, and M. McKenna, 1993. points as part of the name. Because of these dif- Categorizing and standardizing proper nouns for ficulties, we believe that for the forseeable future, efficient information retrieval, In B. Boguraev and practical applications to discover new names in text J. Pustejovsky, eds, Corpus Processing for Lezi- will continue to require the sort of human effort in- cal Acquisition, pp.44-54, MIT Press, Cambridge, vested in Nominator. Mass. Quirk R., S. Greenbaum, G. Leech and J. Svar- tik, 1972. A Grammar of Contemporary English, References Longman House, Harlow, U.K. Agarwal R. and L. Boggess, 1992. A simple but Ravin Y. and N. Wacholder, 1996. Extracting useful approach to conjunct identification In Pro- Names from Natural-Language Text, IBM Re- ceedings of the 30th Annual Meeting of the ACL, search Report 20338. pp.15-21, Newark, Delaware, June. Wacholder N., Y. Ravin and R.J. Byrd, 1994. Re- Brill E. and P. Resnick, 1994. A rule-based ap- trieving information from full text using linguis- proach to prepositional phrase disambiguation, tic knowledge, In Proceedings of the Fifteenth URL: http://xxx.lanl.gov/list/cmp.lg/9410026. National Online Meeting, pp.441-447, New York, May. Coates-Stephens S., 1993. The analysis and acquisi- tion of proper names for the understanding of free text, In Computers and the Humanities, Vol.26, pp.441-456. Cowie J. and W. Lehnert., 1996. Information Extraction In Communications of the ACM , Vol.39(1), pp.83-92. Cowie J., L. Guthric, Y. Wilks, J. Pustejovsky and S. Waterman, 1992. Description of the Solomon System as used for MUC-4 In Proceedings of

208