A Tool Kit for Lexicon Building

A TOOL KIT FOR LEXICON BUILDING Thomas E. Ahlswede Computer Science Department Illinois Institute of Technology Chicago, Illinois 6e616, USA The interactive lexicon builder will ABSTRACT be the basis for a fully automatic lexicon builder which uses Sager's Linguistic This paper describes a set of String Parser (LSP) to parse machine- interactive routines that can be used to readable text into a relational network create, maintain, and update a computer based on a modified version of Werner's lexicon. The routines are available to NTQ (Modification-Taxonomy-Queueing) the user as a set of commands resembling a schema. Initially this program will be simple operating system. The lexicon pro- applied to Webster's Seventh Collegiate duced by this system is based on lexi- Dictionary and the Longman Dictionary of cal-semantic relations, but is compatible Contemporary English, both of which are with a variety of other models of lexicon available in machine-readable form. structure. The lexicon builder is suit- able for the generation of moderate-sized vocabularies and has been used to construct a lexicon for a small medical expert system. A future version of the LEXICAL-SENANTIC RELATIONS lexicon builder will create a much larger lexicon by parsing definitions from The semantic component of the lexicon machine-readable dictionaries. produced by this system consists princi- pally of a network of lexical-semantic relations. That is, the meaning of a word in the lexicon is indicated as far as possible by its relationships with other words. These relations often have semantic content themselves and thus contribute INTRODUCTION to the definition of the words they link. Natural language processing systems The two most familiar such relations need much larger lexicons than those are synonymy and antonymy, but others are available today. Furthermore, a good com- interesting and important. For instance, puter lexicon with semantic as well as to take an example from the vocabulary of syntactic information is elaborate and stroke reports, the carotid is a kind of hard to construct. We have created a artery and an artery is a kind of blood program which enables its user to inter- Vessel. This "is a kind of" relation is actively build and extend a lexicon. The taxonomy. We express the taxonomic rela- program sets up a user environment similar tions of "carotid', "artery" and "blood to a simple interactive operating system; vessel" with the relational arcs in this environment lexical entries can be produced through a small set of commands, carotid T artery combined with prompts specified by the artery T blood vessel user for the desired kind of lexicon. Another important relation is that of The interactive lexicon builder is the part to the whole: being used to help construct entries for a lexicon to be used to parse and generate ventricle PART heart stroke case reports. Many terms in this Broca's area PART brain medical sublanguage either do not appear in standard dictionaries or are used in Note that taxonomy is transitive: if the sublanguage with special meanings. the carotid is an artery and an artery is The design of the lexicon builder is a blood vessel, then the carotid is a inuended to be general enough to make it blood vessel. The presence or absence of useful for others building lexicons for the properties of transitivity, reflexiv- large natural language processing systems ity and symmetry are important in using involving different sublanguages. relations to make inferences. 268 The part-whole relation is more Another kind of relation is the "col- complicated than taxonomy in its proper- locational relation', which governs the ties; some instances of it are transitive combining of words. These are particu- and others are not. From this and other larly useful for generating idiomatic criteria, Iris et al. (forthcoming) text. Consider the "typical preposition" distinguish four different part-whole relation PREP: relations. on PREP list Taxonomy and part-whole are very common relations, by no means restricted which says that an item may be "on a list" to any particular sublanguage. Sublan- as opposed to "in a list" or "at a list." guages may, however, use relations that are rare or nonexistent in the general language. In the stroke vocabulary, there Although the lexicon builder is based are many words for pathological conditions on a relational model, it can be adapted involving the failure of some physical or for use in connection with a variety of mental function. We have invented a rela- models of lexicon structure. A semantic- tion NNABLE to express the connection field approach can be handled by the same between the condition and the function: mechanism as relations; the lexicon builder also recognizes unary attributes aphasia NNABLE speech of words, and these attributes can be amnesia NNABLE memory treated as semantic features if one wishes to build a feature-based lexicon. Relations such as T, PART, and NNABLE are especially useful in making inferences. For instance, if we have another APPLICATIONS FOR THE LEXICON BUILDER relation FUNC, describing the typical function of a body part, we might combine This project was motivated partly by the relational arc theoretical questions of lexicon design and partly by projects which required the speech FUNC Broca's area use of a lexicon. with the arc For instance, the Michael Reese Hos- pital Stroke Registry includes a text aphasia NNABLE speech generation module powered by a relational lexicon (Evens et al., 1984). This appli- to infer that when aphasia is present, the cation provided a framework of goals diagnostician should check for the possl- within which the interactive lexicon bility of damage to Broca's area (as well builder was developed. The vocabulary as to any other body part which has speech required for the Stroke Registry text as a function). generator is of moderate size, about 2000 words and phrases. This is small enough thau a lexicon for it can be built interactively. One can imagine many applications for a large lexicon such as the automatic lexicon builder will construct. Question answering is one of our original areas of interest; a large, densely connected vocabulary will greatly add to the variety of inferences a question answering system can make. Another area is information re- trieval, where experiments (Evens et al., forthcoming) have shown that the use of a relational thesaurus leads to improvements in both recall and precision. On a more theoretical level, the automatic lexicon builder will add greatly to our understanding of sublanguages, notably that of the dictionary itself. We have noted that a specialized relation such as NNABLE, unusual in the general language, may be important in a sublanguage. We believe that such specific relations are part of the distinctive character of every sublanguage. The very Figure i. Part of a relational network possibility of creating a large, general- 269 language lexicon points toward a time when SHOW displays the full contents of sublanguages will be obsolete for many of one or more entries. the purposes for which they are now used; but they will still be useful and interesting for a long time to come, and RELATIONS displays a table of the the automatic lexicon builder gives us a lexical-semantic relations used by the new tool for analyzing them. lexicon builder. This table is created by the user in a separate operation. THE INTERACTIVE LEXICON BUILDER UNDEF is a special form of EDIT. In creating an entry, the user may create relational arcs from the current word to Commands other words that are not in the lexicon. The system keeps a queue of undefined The interactive lexicon builder words. UNDEF invokes EDIT for the word at consists of an operatlng-system-like the head of the queue, thus saving the environment in which the user may invoke user the trouble of looking up undefined the following commands: words. HELP displays a set of one-line summaries of the commands, or a paragraph- PACK performs file management on the length description of a specified command. lexicon, sorting the entries and elimi- This paragraph describes the command-line nating space left by deleted ones. arguments, optional or required, for the given command, and briefly explains the This routine works in two passes. In function of the command. the first pass, the entries are copied from the existing lexicon file to a new ADDENTRY provides a series of prompts file in lexicographic order and a table is to enable the user to create a lexical created that maps the entries from their entry. Some of these prompts are hard old locations to their new ones. At this coded; others can be set up in advance by stage, a relational arc from one entry to the user so that the lexicon can be another still points to the other entry's tailored to the user's needs. old location. The second pass updates the new lexicon, modifying all relational arcs EDIT enables the user to modify an to point to the correct new locations. existing entry. It displays the existing contents of the entry item by item, prompting for changes or additions. If QUIT exits from the lexicon builder the desired entry is not already in the environment. Any new entries or changes lexicon, EDIT behaves in the same way as made during the lexicon building session ADDENTRY. are incorporated and the directory is updated. DELETE lets the user delete one or more entries. An entry is not physically deleted; it is removed from the direc- Extensions to the commands tory, and all entries with arcs pointing to it are modified to eliminate those All of the commands can be abbrevi- arcs.

A Tool Kit for Lexicon Building

The Generative Lexicon

THE LEXICON: a SYSTEM of MATRICES of LEXICAL UNITS and THEIR PROPERTIES ~ Harry H

CS474 Natural Language Processing Semantic Analysis Caveats Introduction to Lexical Semantics

The Art of Lexicography - Niladri Sekhar Dash

The Combinatory Morphemic Lexicon

Changing the Rules: Why the Monolingual Learner's Dictionary Should Move Away from the Native-Speaker Tradition

Part 4: Lexical Semantics 2

The Project, Its Ratiale and Processes,'Lrcludes A,Descrip-Tionf

Chapter 4 Lexicon Reading and Lecture Notes

SLIDE – a Sentiment Lexicon of Common Idioms

Ontology and the Lexicon

Lexicography, Linguistics, and Minority Languages