<<

A TOOL KIT FOR BUILDING

Thomas E. Ahlswede Computer Science Department Illinois Institute of Technology Chicago, Illinois 6e616, USA

The interactive lexicon builder will ABSTRACT be the basis for a fully automatic lexicon builder which uses Sager's Linguistic This paper describes a set of String Parser (LSP) to parse machine- interactive routines that can be used to readable text into a relational network create, maintain, and update a computer based on a modified version of Werner's lexicon. The routines are available to NTQ (Modification-Taxonomy-Queueing) the user as a set of commands resembling a schema. Initially this program will be simple operating system. The lexicon pro- applied to Webster's Seventh Collegiate duced by this system is based on lexi- and the Longman Dictionary of cal-semantic relations, but is compatible Contemporary English, both of which are with a of other models of lexicon available in machine-readable form. structure. The lexicon builder is suit- able for the generation of moderate-sized and has been used to construct a lexicon for a small medical expert system. A future version of the LEXICAL-SENANTIC RELATIONS lexicon builder will create a much larger lexicon by parsing definitions from The semantic component of the lexicon machine-readable . produced by this system consists princi- pally of a network of lexical-semantic relations. That is, the meaning of a in the lexicon is indicated as far as possible by its relationships with other . These relations often have seman- tic content themselves and thus contribute INTRODUCTION to the definition of the words they link.

Natural processing systems The two most familiar such relations need much larger than those are synonymy and antonymy, but others are available today. Furthermore, a good com- interesting and important. For instance, puter lexicon with semantic as well as to take an example from the of syntactic information is elaborate and stroke reports, the carotid is a kind of hard to construct. We have created a artery and an artery is a kind of blood program which enables its user to inter- Vessel. This "is a kind of" relation is actively build and extend a lexicon. The taxonomy. We express the taxonomic rela- program sets up a user environment similar tions of "carotid', "artery" and "blood to a simple interactive operating system; vessel" with the relational arcs in this environment lexical entries can be produced through a small set of commands, carotid T artery combined with prompts specified by the artery T blood vessel user for the desired kind of lexicon. Another important relation is that of The interactive lexicon builder is the part to the whole: being used to help construct entries for a lexicon to be used to parse and generate ventricle PART heart stroke case reports. Many terms in this Broca's area PART brain medical sublanguage either do not appear in standard dictionaries or are used in Note that taxonomy is transitive: if the sublanguage with special meanings. the carotid is an artery and an artery is The design of the lexicon builder is a blood vessel, then the carotid is a inuended to be general enough to make it blood vessel. The presence or absence of useful for others building lexicons for the properties of transitivity, reflexiv- large natural language processing systems ity and symmetry are important in using involving different sublanguages. relations to make inferences.

268 The part-whole relation is more Another kind of relation is the "col- complicated than taxonomy in its proper- locational relation', which governs the ties; some instances of it are transitive combining of words. These are particu- and others are not. From this and other larly useful for generating idiomatic criteria, Iris et al. (forthcoming) text. Consider the "typical preposition" distinguish four different part-whole relation PREP: relations. on PREP list Taxonomy and part-whole are very common relations, by no means restricted which says that an item may be "on a list" to any particular sublanguage. Sublan- as opposed to "in a list" or "at a list." guages may, however, use relations that are rare or nonexistent in the general language. In the stroke vocabulary, there Although the lexicon builder is based are many words for pathological conditions on a relational model, it can be adapted involving the failure of some physical or for use in connection with a variety of mental function. We have invented a rela- models of lexicon structure. A semantic- tion NNABLE to express the connection field approach can be handled by the same between the condition and the function: mechanism as relations; the lexicon builder also recognizes unary attributes aphasia NNABLE of words, and these attributes can be amnesia NNABLE memory treated as semantic features if one wishes to build a feature-based lexicon. Relations such as T, PART, and NNABLE are especially useful in making infer- ences. For instance, if we have another APPLICATIONS FOR THE LEXICON BUILDER relation FUNC, describing the typical function of a body part, we might combine This project was motivated partly by the relational arc theoretical questions of lexicon design and partly by projects which required the speech FUNC Broca's area use of a lexicon. with the arc For instance, the Michael Reese Hos- pital Stroke Registry includes a text aphasia NNABLE speech generation module powered by a relational lexicon (Evens et al., 1984). This appli- to infer that when aphasia is present, the cation provided a framework of goals diagnostician should check for the possl- within which the interactive lexicon bility of damage to Broca's area (as well builder was developed. The vocabulary as to any other body part which has speech required for the Stroke Registry text as a function). generator is of moderate size, about 2000 words and phrases. This is small enough thau a lexicon for it can be built interactively.

One can imagine many applications for a large lexicon such as the automatic lexicon builder will construct. Question answering is one of our original areas of interest; a large, densely connected vocabulary will greatly add to the variety of inferences a question answering system can make. Another area is information re- trieval, where experiments (Evens et al., forthcoming) have shown that the use of a relational leads to improvements in both recall and precision.

On a more theoretical level, the automatic lexicon builder will add greatly to our understanding of sublanguages, notably that of the dictionary itself. We have noted that a specialized relation such as NNABLE, unusual in the general language, may be important in a sub- language. We believe that such specific relations are part of the distinctive character of every sublanguage. The very Figure i. Part of a relational network possibility of creating a large, general-

269 language lexicon points toward a time when SHOW displays the full contents of sublanguages will be obsolete for many of one or more entries. the purposes for which they are now used; but they will still be useful and interesting for a long time to come, and RELATIONS displays a table of the the automatic lexicon builder gives us a lexical-semantic relations used by the new tool for analyzing them. lexicon builder. This table is created by the user in a separate operation.

THE INTERACTIVE LEXICON BUILDER UNDEF is a special form of EDIT. In creating an entry, the user may create relational arcs from the current word to Commands other words that are not in the lexicon. The system keeps a queue of undefined The interactive lexicon builder words. UNDEF invokes EDIT for the word at consists of an operatlng-system-like the head of the queue, thus saving the environment in which the user may invoke user the trouble of looking up undefined the following commands: words. HELP displays a set of one-line summaries of the commands, or a paragraph- PACK performs file management on the length description of a specified command. lexicon, sorting the entries and elimi- This paragraph describes the command-line nating space left by deleted ones. arguments, optional or required, for the given command, and briefly explains the This routine works in two passes. In function of the command. the first pass, the entries are copied from the existing lexicon file to a new ADDENTRY provides a series of prompts file in lexicographic order and a table is to enable the user to create a lexical created that maps the entries from their entry. Some of these prompts are hard old locations to their new ones. At this coded; others can be set up in advance by stage, a relational arc from one entry to the user so that the lexicon can be another still points to the other entry's tailored to the user's needs. old location. The second pass updates the new lexicon, modifying all relational arcs EDIT enables the user to modify an to point to the correct new locations. existing entry. It displays the existing contents of the entry item by item, prompting for changes or additions. If QUIT exits from the lexicon builder the desired entry is not already in the environment. Any new entries or changes lexicon, EDIT behaves in the same way as made during the lexicon building session ADDENTRY. are incorporated and the directory is updated. DELETE lets the user delete one or more entries. An entry is not physically deleted; it is removed from the direc- Extensions to the commands tory, and all entries with arcs pointing to it are modified to eliminate those All of the commands can be abbrevi- arcs. (This is simple to do, since for ated; so far they all have distinctive every such arc there is an inverse arc initials and can thus be called with a pointing to that entry from the deleted single keystroke. one.) On the next PACK operation (see below) the deleted entry will not be Each command may be accompanied by preserved in the lexicon. command-line arguments to define its ac- tion more precisely. Display commands, This command can also be used to such as HELP or SHOW, allow the user to delete the defective entries that are get a printout of the display. Where an occasionally caused by unresolved bugs in entry name is to be specified, the user the entry-creating routines, or which can get more than one entry by means of might arise from other circumstances. A "wild cards." For instance, the command special option with this command searches "LIST produc= might yield a list showing the directory for a variety of "illegal" entries for "produce', "produced", "pro- conditions such as nonprinting characters, duces", "producing', "product', and zero-length names, etc. "production. ~ LIST gives one-line listings of some or all of the entries in the lexicon. The Additional commands are currently listing for each entry includes the name being developed to help the user manage (the word itself), sense number, part of the relation table and the attribute list speech, and the first forty characters of from within the lexicon builder the definition if there is one. environment.

270 The design of the user interface took 6. A predicate calculus definition. into account both the available facilities For example, for the most common sense of and the expected users. The lexicon the "promise', the predicate calculus builder runs on a VAX 11-75B, normally definition is expressed as accessed with line-edlting terminals. This suggests that a single-line command promiseix,y,z) = say(x,w,z) format is most appropriate. Since much of _eventiy) => w = will happen(y) the work with the system is done over 3~0 _thing(y) => w = will receive(z,y) baud telephone lines, conciseness is also important. The users have all had some or, in freer form, programming experience (though not neces- sarily very much) so an operating-system- ix promises y to z} = ix says w to z) like interface is easy for them to get where w = used to. If the lexicon builder becomes (y will happen) popular, we hope to have the opportunity if y is an event to develop a more sophisticated interface, (z will receive y) perhaps with a combination of features for if y is a physical object. beginners and more experienced users. This is entered by the user.

Structure of a lexlcal entry We have been inclined to think of the A complete lexical entry consists of: relational lexicon as a network, since the network representation vividly brings out the interconnected quality which the i. The "name" of the entry -- its relational model gives to the lexicon. character-string form. Predicate calculus is better in other respects; for instance, it expresses the 2. Its sense. We represent senses above definition of "promise" much more by simple numbers, not attempting to elegantly than any network notation could. formally distinguish and homo- The two methods of representation have nymy, or any other degree of semantic traditionally been seen as alternatives difference. The system leaves to the user rather than as supplementing each other; the problem of distinguishing different we believe that predicate calculus has an senses from extensions of a single sense: important supplementary role to play in that is, where a word has already been defining the core vocabulary of the entered in some sense, the user must lexicon, although we are not sure yet how decide whether to modify the entry for to use it. that sense or create a new entry for a new sense. 7. Case structure (for ). This 3. , or "class." Our is a table describing, for each syntactic classification of parts of speech is slot associated with the verb (subject, basically the traditional classification direct object, etc.) the semantic case or with some convenient additions, largely cases that may be used in that slot drawn from the classification used by ('age,in, "experiencer', etc.), whether it Sager in the LSP (Sager, 1981). Most of is required, optional, or may be expressed the additions are to the category of elliptically (as with the direct and verbs: "verb" to the lexicon builder de- indirect object in "I promisei" referring notes the stem form, while the third to an earlier statement). person and past tense are distinguished as "finite verb', and the past and present Space is reserved in this structure participles are classified separately. for selection restrictions. A relational model gives us the much more powerful op- 4. The text of the definition, tion of indicating through relations such entered by the user. as "permissible subject', "permissible object', etc., not only what words may go At this stage in our work, the with what others, but whether the usage is definition is not parsed or otherwise ana- literal, a conventional figure of speech, lyzed, so its presence is more for fanciful, or whatever. Selection restric- purposes of documentation than anything tions do, however, have the virtue of else. In future versions of the lexicon conciseness, and they permit us to make builder, the definition will play an generalizations. Relational arcs may then important role in constructing the entry be used to mark exceptions. but in the entry itself will be replaced by information derived from its analysis. 8. A list of zero or more relations, 5. A list of attributes (or semantic each with one or more pointers to other features), each with its value, which may entries, to which the current entry is be binary or scalar. connected by that relation.

271 We find it convenient to treat mor- wordl. This means that whenever we create phological derivations such as plural of a new arc in the course of building or , tenses and participles of verbs, as modifying an entry for wordl, we must relations connecting separate entries. update the entry for word2 so that it will The entry for a regularly derived form contain the appropriate inverse arc back such as a plural is a minimal one, to wordl• Word2~s entry has to be updated consisting of name, sense, part of speech, or created from scratch; we need to and one relational arc, linking the entry structure the lexicon file so that this to the stem form. The lexicon builder updatin9 process, which may take place generates these regular forms automati- anywhere in the file, can be done with the cally. It also distinguishes these "regu- least possible dislocation. lar" entries from "undefined" entries, which have been entered indirectly as target words of relational arcs and which aphasia (1) n. are on the queue accessed by UNDEF, as definition well as from "defined" entries. a disorder of language due to injury to the brain attributes name nonhuman collective sense predicate calculus have(x, aphasia) -- "able(speak(x)) class relations TAX text of [aphasia is a kind of x] definition deficit disorder attribute list loss inability predicate "TAX calculus Ix is a kind of aphasia] definition anomic global case structure gerstmann ' s table semantic We rnicke ' s relation~ Sroca 's list conduction transcortical SYMPTOM w2- [aphasia is a symptom of x] stroke TIA w2 1.2[ ASSOC I [aphasia may be associated with x] apraxia _CAUSE [x is a cause of aphasia] l :I injury lesion NNABLE Figure 2, Structure of a lexical entry [aphasia is the inability to do x] speech language File structure of the lexicon Figure 3. Lexical entry for "aphasia" There are four data files associated wi~h the lexicon. The size of an entry can vary enormously. Regular derived forms contain The first is the lexicon proper. The only the name, sense, class and one rela- biggest complicating factor in the design tional arc (to the stem form), as well as of the lexicon is the extremely inter- a certain amount of overhead for the connected nature of the data; a change in definition, predicate calculus definition one portion of the file may necessitate and attribute list although these are not changes in many other places in the file. used. The smallest possible entry takes Each entry is linked through relational up about thirty bytes. At the other arcs to many other entries; and for every extreme, a word may have an extensive arc pointing from wordl to word2, there attribute list, elaborate text and must be an inverse arc from word2 to predicate calculus definitions, and dozens

272 or even (eventually) hundreds of rela- modified or deleted. At the end of a tional arcs. "Aphasia', a moderately lexicon building session, the updated large entry with 19 arcs, occupies 322 directory is written out to disk. bytes. Like all entries in the current lexicon, it will be subject to updating The third (optional) file is a table and will certainly become much larger. of attributes, with pointers into the lexicon proper. This can be extended into With this range of entry sizes, the a feature matrix. choice between fixed-size and variable- size records becomes somewhat painful. Variable-size records would be highly The fourth (also optional) is a table convenient as well as efficient except for of pre-defined relations. This table the fact that when we add a new entry that includes, for each relation: is related to existing entries, we must add new arcs to those entries. The existing entries thus no longer fit into (i) its mnemonic name. their previous space and must be either broken up or moved to a new space. The (2) its properties. A relation may former option creates problems of be reflexive, symmetric or transitive; identifying the various pieces of the there may be other properties worth entry; the latter requires that yet more including. existing entries be modified. (3) a pointer to the relation's Because of these problems, we have inverse. If x REL y, then we can define opted for a fixed-size record. Some space some REL such that y REL x. If REL is is wasted, either in empty space if the reflexive or symmetric, then REL = REL. record is too large or through prolifera- tion of pointers if the record is too small; but the amount of necessary up- (4) the appropriate parts of speech dating is much less, and the file can be for the words linked by the relation. For kept in order through frequent use of the instance, the NNABLE relation links two PACK command. The choice of record size nouns, while the collocational PREP rela- is conditioned by many factors, system tion links a preposition to a noun. requirements as well as the range of entry Taxonomy can link any two words (apart sizes. We are currently working on deter- from prepositions, conjunctions, etc.) as mining the best record size for the MRH long as they are of the same part of application. speech: nouns to nouns, verbs to verbs, etc. So far the user does not have the op- tion of saving or rejecting the results of (5) the text of a prompt. ADDENTRY a lexicon building session, since entries uses this prompt when querying the user are written to the file as soon as they for the occurrence of relational arcs are created. We are studying ways of involving this relation. For instance, if providing this option. A brute force way we are entering the word "promise" and our would be to keep the entire lexicon in memory and rewrite it at the end of the application uses the taxonomy relation, we session. This is feasible if the host might choose a short prompt, in which case computer is large and the lexicon is the query for taxonomy might take the form small. The 2~g0-word lexicon for the "promise" T: [user enters word2 here] Michael Reese stroke database takes up about a third of a megabyte, so this or we could use something more explicit: approach would work on a mainframe or a large minicomputer such as our Vax 75g, "promise" is a kind of: but could not readily be ported to a smaller machine; nor could we handle a Users familiar with lexical-semantic much larger vocabulary such as we plan to create with the automatic lexicon builder. relations might prefer the shorter mnemonic prompt, whereas other users might prefer a prompt that better expressed the The second file is a directory, significance of the relation. showing each entry's name, sense, and status (defined, undefined or regular derivauive), with a pointer to the appro- THE AUTOMATIC LEXICON BUILDER priate entry in the lexicon proper. The directory entries are linked in lexico- graphic order. When the lexicon builder Building a very large lexicon is invoked, the entire directory is read into a buffer in memory, and this buffer There are numerous logistical prob- is update~ as entries are created, lems in implementing the sort of very

73 large lexicon that would result from anal- werner's of modification and ysis of an entire dictionary, as the work taxonomy reflects 's model of the of Amsler and White (1979) or Kelly and definition as consisting of species, genus Stone (1975) shows. Integrating the and differentiae -- taxonomy links the lexicon builder with the LSP, and species to the genus and modification preprocessors for dictionary data, will links the differentiae to the genus. A also be big jobs. Fully automatic analy- study of definitions in W7 and LDOCE shows sis of dictionary material, then, is a that they do indeed follow this pattern, long-range goal. although (as in adjective definitions) the pattern is not always obvious.

A major problem in the relational analysis of the dictionary is that of The special power of MTQ in the determining what relations to use. Noun analysis of definitions is that in a and verb definitions rely on taxonomh ~ to a definition following the Aristotelian great extent (e.g. Amsler and White, structure, taxonomy and modification can 1979) but there are definitions that do be identified by purely syntactic means. not clearly fit this pattern; further- One (or occasionally more than one) word more, even in a taxonomic definition, much in the definition is modified directly or semantic information is contained in the indirectly by all the other words. The qualifying or differentiating part of the core word is linked to the defined word by definition. taxonomy; all the others are linked to the core word by modification. (Queueing so far does not seem to be important in Adjective definitions are another the analysis of definitions.) problem area. Adjectives are usually defined in terms of nouns or verbs rather In order to avoid certain ambiguities than other adjectives, so simple taxonomy that arise in a very elaborate network does not work neatly. In a sample of such as that generated from a large dic- about 7,0~ definitions from W7, we tionary, we have replaced the separate identified nineteen major relations unique modification and taxonomy arcs with a to adjective definitions, and these single, ternary relational arc that keeps covered only half of the sample. The the species, genus and differentiating remaining definitions were much more items of any particular definition linked varied and would probably require far more to each other. then nineteen additional relations. And for each relation, we had to identify words or phrases (the "defining formulas') The problem of identifying "higher that signaled the presence of the level" relations such as PART and NNABLE relation. in an MT0 network still remains. At this point it seems to be similar to the prob- lem of identifying higher level relations from defining formulas. The M'~ model Another pleasant discovery is that For these reasons as well as the Linguistic String Parser, which we theoretical ones, we need a simplifying have used successfully for some years, is model of relations, a model that enables exceptionally well suited for this strat- us either to avoid the endless identifica- egy, since it is geared toward an analysis tion of new relations or to conduct the of sentences and phrases in terms of identification within an orderly frame- "centers" or "cores" with their modifying work. Werner's MTQ schema (Werner, 1978; "adjuncts', which is exactly the kind of Werner and Topper, 1976) seems to provide analysis we need to do. the basis for such a model.

Design of the automatic lexicon builder Werner idennifies only three rela- tions: modification, taxonomy and queue- The automatic lexicon builder will ing. He asserts that all other relations contain at least the following suDsystems: can be expressed as compounds of these relations and of lexical items -- for instance, the PART relation can be I. The standard data structure for expressed, with the help of the lexical the lexical entry, as described for the item "part', by the relational arcs interactive lexicon builder, with slight changes to adjust to the use of MTQ. Broca's area T part brain M part The relation list is presently structured as a linked list of relations, which say in effect that Broca's area is a each pointing to a linked list of wordis. kind of part, specifically a "brain-part." ('Wordi" refers to any word related to the

274 word (=wordl') we are currently investi- The immediate motivation for the gating.) Incorporating the ternary MTQ program was to produce a relational model, we would have two relation lists: lexicon for text generation of clinical a T list and an M list. The T list would reports by a diagnostic expert system. It be a linked list of words connected to is now being used for that purpose. It wordl by the T relation; its structure can equally well be used in any other sub- would be identical to the present relation language environment; in addition, it is list except that its nodes would be intended to be compatible, as far as lexical entry pointers instead of rela- possible, with models of lexicon structure tions. Each of these lexical entry point- other than the relational model on which ers would, like the relation nodes in the it is based. existing implementation, point to a linked list of word2s. The word2s in the T list The interactive lexicon builder is would be connected to the T words by an further intended as the starting point for inverse-modification relation ('M) and the a fully automatic lexicon building program word2s in the M list would be connected to which will create a large, general purpose the M words by inverse taxonomy ('T). relational lexicon from machine readable dictionary text, using a slightly modified 2. Preprocessors to convert pre- form of Werner's Modification-Taxonomy- existing data to the standard form. The relational model. preprocessor need not be intelligent; its Queueing job is to identify and decode part-of- speech and other such information, sepa- rating this from the definition proper.

Part of the preprocessing phase is to generate a "dictionary" for the LSP. This REFERENCES dictionary need only contain part- of-speech information for all the words that will be used in definitions; other information such as part- of-speech Ahlswede, Thomas E., and Evens, Martha W., subclass and selection restrictions is 1983. "Generating a Relational Lexicon from a Machine-Readable Dictionary." helpful but not necessary. Sager and her associates (198B) have created programs to Proceedings of the Conference on Artificial Intelligence, Oakland Univer- do this. sity, Rochester, Michigan. 3. Batch and interactive input modules. The batch input reads a data Ahlswede, Thomas E., and Evens, Martha W., file in standard form, perhaps optionally 1984. "A Lexicon for a Medical Expert noting where further information would be System." Presented at the Workshop on especially desirable. The interactive Relational Models, Coling '84, Stanford input is preserved from the interactive University, Palo Alto, California. version of the system and allows the user to "improve" on dictionary data as well as Ahlswede, Thomas E., in press. =A Lin- to observe the results of the dictionary guistic String of Adjective parse. Definitions." In S. Williams, ed. Humans and Machines: The Interface Through 4. Definition analyzer. In this Language, Ablex. module, the LSP will parse the definition to produce a parse tree. which will then Amsler, Robert A., and White, John S., be converted into an MTQ network to be 1979. Development of a Computational linked into the overall lexical network. Methodology for Deriving Natural Language Semantic Structures via Analysis of 5. Entry generator. This module, Machine Readable Dictionaries. Linguis- like the preprocessor, can be tailored to tics Research Center, University of Texas. the user's needs. Evens, Martha W., Ahlswede, Thomas E., Hill, Howard, and Li, Ping-Yang, 1984. SU~X "Generating Case Reports from the Michael Reese Stroke Database." Proc. 1984 A program has been written that Conference on Intelligent Systems and enables a user interested in creating a Machines, Oakland University, Rochester, lexicon for natural language processing to Michigan, April. generate lexical entries interactively and link them automatically to other lexical Evens, Martha W., Vandendorpe, James, and entries through lexical-semantic rela- Wang, Yih-Chen, in press. "Lexical- tions. The program provides a small set Semantic Relations in Information Retriev- of commands that allow the user to create, al," In S. Williams, ed. Humans and modify, delete, and display lexical Machines: The Interface Through Language, entries, among other operations. Ablex.

275 Iris, Madelyn, Litowitz, Bonnie, and Evens, Martha W., unpublished. "The Part-Whole Relation in the Lexicon: an Investigation of Semantic Primitives." Kelly, Edward F., and Stone. Philip J., 1975. Computer Recognition of English Word Senses. North-Holland, Amsterdam. Sager, Naomi, 1981. Natural Language Information Processing. Addison-Wesley, New York. Sager Naomi, Hirschman, Lynette, White, Carolyn, Foster, Carol, Wolff, Susanne, Grad, Robert, and Fitzpatrick, Eileen, 198~. Research into Methods for Automatic Classification and Fact Retrieval in Science Subfields. String Reports No. 13, New York University.

Werne~, Oswald, 1978. "The Synthetic Informant Model: the Simulation of Large Lexical/Semantic Fields." In M. Loflin and J. Silverberg, eds., Discourse and Difference in Cognitive Anthropology. Mouton, The Hague. Warner, Oswald, and Topper, Martin D., 1976. "On the Theoretical Unity of Ethnoscience and Ethnoscience Ethnographies." In C. Rameh, ed., Seman- tics, Theory and Application, Proc. Georgetown University Round Table on Language and .

276