Using an On-Line Dictionary to Extract a List of Sense-Disambiguated Synonyms
Total Page:16
File Type:pdf, Size:1020Kb
Using an On-line Dictionary to Extract a List of Sense-Disambiguated Synonyms G. Jan Wilms Computer Science, Mississippi State University [email protected] [email protected] POBox 1715, Mississippi State, MS 39762 ABSTRACT and riot). Chodorow used the synonyms and hypernyms in the on-line version of The New Collins Thesaurus to The feasibility of extracting both explicit and implicit compute the semantic distance between any two words synonym references from a machine readable dictionary (using a process he calls sprouting), but found that even is investigated; the extracted synonyms, both symmetric after sense-disambiguating the thesaurus many words and asymmetric, are then sense-disambiguated. At the remained spuriously related because of "poor sense same time lemma numbers and unbound parts-of-speech separation" in the thesaurus (CHODOROW 88, p. 149). of synonyms become instantiated. The dictionary source This paper argues that a dictionary has better sense is also a resource for parsing the definitions, but its separation than a thesaurus. comprehensiveness is often a mixed blessing as a disambiguation tool. The extracted taxonomy of related words will be used in an ongoing project at Mississippi State University, that is INTRODUCTION attempting to automate, in a domain-independent way, the extraction of knowledge contained in a machine Most computational lexicons used in natural language readable technical corpus into an object-oriented understanding systems have been manually constructed, knowledge base (HODGES et al. 91). An important a painstaking effort that is error-prone, labor intensive, research issue involves checking for redundancy; i.e. and usually is attempted only to create a bare-bone when more than one name is used to refer to a single real- lexicon that is sufficient for the immediate operation of world object (e.g. hemorrhage and bleeding), they should the program when applied to a restricted domain of be mapped to only one knowledgebase object. Another interest. To overcome this bottleneck many researchers issue is the automatic bootstrapping of a semantic lexicon have turned to machine readable resources as a possible for the domain text; one strategy for dealing with source for automating the acquisition of a semantic unknown words is to check for semantically related lexicon; various taxonomies of semantic relations have entries that already exist in the lexicon. been extracted (e.g. AMSLER 81, CALZOLARI 84, BYRD et al. 87, VERONIS and IDE 90). FUNK AND WAGNALLS STANDARD DESK DICTIONARY This paper investigates the feasibility of automating the In the preface to the F&W dictionary, synonyms are extraction of a list of sense-disambiguated synonyms defined in terms of verbal equivalences: "what word or from a machine readable dictionary (the Funk and phrase or circumlocution can serve as an equivalent for Wagnalls (F&W) Dictionary). Since most words are the word defined" (p xxii). The test for synonymy is polysemous, identifying what sense a synonym is used in given as interchangeability (in certain contexts), and the is essential to avoid relating words based on orthographic entry for synonym in the dictionary defines it as "a word in stead of semantic similarities (e.g. disorder is a syn- having the same or almost the same meaning as some onym of sickness, but only in the sense of ailment, not in other: opposed to antonym". Synonyms are explicitly the other senses of confusion, or tumult marked with the keyword Syn. Sometimes there is a pointer to a particular sense of the lemma being defined (e.g. Alien1 : adj. 1. Owing allegiance to another country; unnaturalized; This work was supported in part by the NSF grant foreign. 2. Of or related to aliens. 3. Not one's own; strange. 4. Not number IRI-9002135. I would like to thank Lois Boggess consistent with; incongruous; opposed: with to. — n. 1. An unnaturalized for her helpful comments. foreign resident. 2. A member of a foreign nation, tribe, people, etc. 3. One estranged or excluded. — Syn. (adj.) 4. extrinsic, extraneous, irrelevant. The superscript 1 next to the headword (the lemma being defined) is a lemma number; see below). Number of synonyms recovered % of total Explicit References 898 2.3 'Also' 925 2.4 Collateral Adjectives 95 0.3 One-word definitions 30,720 79.7 Conjoined one-word definitions 5,526 14.3 'Compare' 84 0.2 Synonym description 310 0.8 38,558 Table 1 Source of extracted synonyms Explicit references account for 898 synonym entries, lost in the process of making it machine readable: using and represent only a small percentage of the dictionary's a scanner and optical-character-recognition software, the potential (see table 1). Another source of synonyms are book was stored in plain ascii format (to make matters the fields flagged with also (e.g. Abomasum1 : The fourth or true worse, the scanning process introduced some errors of its digestive stomach of a ruminant: also called reed. Also ab.o.ma.sus). own, especially damaging when they involve critical flags The example shows that care must be taken to distinguish like sense numbers; e.g. the letter 'l' in stead of number 1, those cases where also introduces a spelling variant, and alphabetic 'Oh' rather than zero in 1O. See WILMS which can be considered a synonym only in a very 90). specialized sense. The also field contributes 2.4 % of the total synonyms extracted (table 1). If only the above relations were used, all of which are explicitly flagged in the dictionary, the extracted lists Another minor source (84 instances) is the compare would compare very poorly with the wealth of informa- field, aimed at the human dictionary browser. It points to tion found in a thesaurus. However, many more can be words that are not quite synonyms, but that are found by looking at the actual definitions of each lemma. semantically related (e.g. Collage1 : n. An artistic composition Definitions in the F&W, like most dictionaries, follow the consisting of or including flat materials pasted on a picture surface; þ Aristotelian principle of genus (supertype) and differen- Compare ASSEMBLAGE (def. 4)). tiae (necessary and sufficient conditions that separate it). A working hypothesis adopted in this paper is that A unique feature of the F&W dictionary are its collateral definitions consisting only of a genus can be treated as adjectives, which yield a small (0.3 %) but important set synonyms. Practically, this means that one-word defini- of 'synonyms'. They are flagged by and indicate tions can be treated as a synonym. And whereas one- adjectival forms of a noun entry "so remote in spelling word definitions are bad lexicographic practice that F&W that they may not be brought to mind by the noun" claims never to be guilty of (preface, p. 6a), many defini- (preface, p 7a) (e.g. Arm1 : n. 1. Anat. An upper limb of the human tions are actually multi-part explanations, some portions body, from the shoulder to the hand or wrist. Collateral adjective: of which are single words (e.g. Call1 : v.t. 1. To say in a loud brachial). Their importance comes from the fact that the voice; shout; proclaim. 2. To summon. 3. To convoke; convene: to call synonym crosses part-of-speech (POS) boundaries while a meeting. 4. To invoke solemnly). Notice that sense number two remaining closely semantically related to headword. is an example of something F&W claims never to do. Some liberty is taken in specifying single-word INCREASE COVERAGE definitions: fillers like a(n), the, to, any are removed to From the above discussion it is clear that the dictionary leave true single genuses. has a quasi-formal structure with specific fields for each entry, which makes the extraction of information from it Thus at the small cost of some extra parsing overhead, the easier than if it were unrestricted free-format text, as for number of synonyms is enormously increased (30,720 example a machine readable technical textbook. Building entries, or 79.7% of the total; see table 1). The following a grammar and writing a parser to convert the machine strategies, with varying degrees of parsing sophistication, readable dictionary into a computerized dictionary is no also contribute to increase coverage: trivial task, however, as many of the clues to interpret its structure are designed with a human reader in mind (the 1. Ignore differentiae that are not so 'necessary and visual layout, for example, is important). The task of sufficient', and contribute little. For instance, phrases bracketing the fields in the F&W dictionary was made introduced by as; their purpose is to give a (prototypical) even more difficult by the fact that most of the example that is not meant to be exclusive (e.g. Collect1 : þ 6. typographical clues (italics, bold, superscript, etc) were To accumulate, as sand or dust). Sometimes the purpose of including them seems to be a justification on the part of candidates (e.g. Allied1 : adj. 1. United, confederated, or leagued) is lexicographer for isolating and adding a separate sense fairly fool-proof. But if one side of the conjunction (e.g. Cahoots1 : n. pl U.S. Slang Affiliation; partnership, as in the phrase consists of two (or more) words (templates like X or Y Z in cahoots). Similarly, parenthesized expressions some- and Z Y or X, assuming Y Z or Z Y isn't a compound), this times are illustrative objects (e.g. Abstract1 : v.t. þ 3. To can cause problems for synonym candidate X because Z withdraw or disengage (the attention, interest, etc.)), extra (non- may have to be distributed, in which case we are no 'necessary') information (e.g. A:1 þ 4.