A Word Database for Natural Language Processing

A Word Database for Natural Language Processing Brigitte Barnett Hubert Lehmann Magdalena Zoeppritz IBM Scientific Center, Tiergartenstral3e 15, 6900 Heidelberg, Federal Republic of Germany Abstract: The paper describes the design of a fair sized lexi- version in the database. It is the only component allo- cal database that is to be used with a natural language wed to update the permanent knowledge base. It loads based expert system with German as the language of inte- knowledge from the database into storage and requests raction. Sources for entries and tools for constructing and consistency checks for new knowledge. With the excep- maintaining the database are discussed, as well as the in- tion of the lexical database, the Knowledge Manager formation needed in the lexicon for the purposes of syntactic also accesses the database on behalf of other system and semantic processing. components. • The SQL Data System (IBM (1983)) maintains the da- 1 Introduction tabase which is a repository of facts and rules: . linguistic knowledge, e.g. dictionary and grammar The intent of this paper is to show some aspects of a com- • common sense knowledge, including a thesaurus puter dictionary geared towards the natural language com- • legal knowledge (law, rules from commentaries and ponent of an expert system. The dictionary is organized as decisions, legal strategies) a database to integrate tile various aspects of lexicographic • cases work and, at the same time, enable fast access from a parser. • user profiles Work on the lexicon was long neglected - both in theoretical linguistics and natural language processing projects - so we • The natural language analyzer and its dictionary are ex- felt that a principled approach was overdue (cf. Sedelow tensions and modifications of the existing User Specialty (1985) for a survey of related work). In the past two years, Languages system (USL) developed at the Heidelberg we concentrated therefore on the formulation of criteria for Scientific Center (Lehmann, H., N. Ott, M. Zoeppritz establishing syntactic features which have to be coded in the (1985), M. Zoeppritz (1984)). USL is a natural lan- lexicon, and we will report here on some of our findings. guage front end to SQL/DS (IBM 1983) operational in This will be preceded by a brief overview of the aims of our six languages. Within the scope of this project, it will be overall project and a short description of the prototype sy- enhanced to suit the requirements of a natural language stem we are building. We will then describe the design of (German) based expert system. This means that it must our lexicographic database including the criteria for selecting be able to deal with both running texts and queries and sources of the vocabulary and some of our tools for editing to translate them into their corresponding logical form. and querying. The Natural Language Analyzer consists of the follo- The main objectives of the project Linguistics and Logic wing parts: Based Legal Expert System, which is a Joint Research Pro- ject between the University of T/ibingen and the IBM - a sentence separator splitting texts into sentences, Scientific Center Heidelberg, are to design and implement a • a pre-parser for dictionary look-up, natural language based knowledge acquisition and query . the parser and the routines for semantic analysis, • routines for the generation of the logical form from system and to build a legal expert system on its basis. It intermediate structures (cf. Guenthner and Leh- consists of the following components: mann (1984) for a description), • The dialog component controls the interaction with users • routines for semi-automatic generation of thesaurus and contains among other things an explanation compo- extensions (Wirth, R. (1984)). nent and a component for preparing system output for display and for eventually generating natural language As a specific application, the area of German traffic law was chosen for the expert system which shall be used in two explanatory texts. modes: for consultation by a legal expert and as a tutor for law students (cf. Alschwee et al. (1985) for details). In a so-called user profile, as much information about a user is kept as necessary: to improve answers and explanations, one must know certain things about the user, 2 mainly about her or his knowledge in current sessions. Descriptionof the Dictionary For example, one may want to avoid explanations about Within such an environment, a fairly large-sized and detai- details the user already knows. led dictionary is needed. Aspects of its design, the structure • The deductive component is activated by user queries, by in the database, and the editing and querying facilites will input of new knowledge, and by requests of the Natural be discussed (cf. also Barnett (1985)). The expected size of Language Analyzer. the dictionary within the scope of the project is estimated to be some 20,000 entries. Its current size is some 12,000 ent- • The knowledge manager administers the actual know- ries. ledge base in the working area as well as its permanent 435 2.1 Word Database • The vocabulary of the application area, i.e. from the legal domain, stems from the following sources: Because we must be able to handle a large number of words • A collection of relevant court decisions (from our in this project, we felt that it would be necessary to admini- study partner), strate them in a more appropriate form than the usual file • A number of accident descriptions collected from organization and that a relational database would be the newspapers, best tool for dealing with lexical information because of the • A few word lists used for document retrieval from following advantages: both the Legal and Public Relations departments of IBM Germany. • excerpting grammatical information according to speci- • We plan to investigate to what extent machine-readable fic features; dictionaries or legal texts can be used for an automatic • links to related information not necessarily kept in the or semi-automatic acquisition of lexical and grammatical same table; information and of common-sense knowledge. • easier control of updates; • many types of integrity checks; 2.4 Layout of the Dictionary Relation • automatic backup so that, in case of a systems break- down, a consistent status remains available; In our word database, every word constitutes an entry, and • another great advantage of database technology is con- most columns in the entry contain information concerning a currency capabilities which preventusers working on the particular word. Even though semantic aspects are not same table from getting in each other's way.; coded in this particular version, one may regard the codes • and, within the realm of this project, the possibility to as a representation of a word's morphological and syntactic link to the Natural Language Analyzer. meaning. Some words have more than one entry: to code multiple entries becomes necessary when different grammatical feature sets have to be assigned to one lemma. 2.2 Scope All words are contained in a single table or relation. One The scope of the information contained in our dictionary is could also envisage a separate table for every part of speech; geared towards the processing of natural language by com- however, this would be rather inconvenient, as it would be puter. Lexical information must therefore be more detailed impossible to compare grammatical phenomena across dif- and more explicit than in standard dictionaries intended for ferent categories. Also it may be desirable to look at words humans. Also, a computer dictionary is of no value unless of the same root but belonging to different parts of speech. it matches the grammar and the needs of the semantic pro- With this necessity in mind, we designed an overall, general cessing. relation which would contain all words. In order to treat the words individually and according to their specific needs, a We started with the coding of morphological and syntactic so-called "view" was defined for each part of speech. The information, since we felt to be on rather stable ground present structure of the relation is described in Figure 1. there. We will report on some of the difficulties we encoun- tered - many of them not unknown to theoretical linguistics 2.5 Tools and Aids - in the next section. To facilitate coding and to ensure its accuracy, we use the Semantic information is coded primarily in the form of following tools: meaning rules, but we have not included these in our lexical database yet, as we are still experimenting with different Editing: A Dictionary Editor (a menu-driven program run- kinds of information and representations before we go to ning under ISPF (IBM 1982) interacting with the SQL/DS large-scale coding. We also hope that, at least to some ex- database) was developed to facilitate adding, updating, de- tent, the acquisition of such information can be automated leting, and checking of entries at the terminal. (cf. the approach taken by Wirth (1984)). Under this editor, a specific set of menus and help panels was implemented for nouns, verbs, and adjectives. Whereas 2.3 Sources the main menus contain only short hints to the grammatical information as a sort of reminder to the lexicographer, help For the purpose of our particular application, we need to menus give more detailed examples for the individual codes. cover the vocabulary occurring in German traffic law. Subpanels, as extensions to the main panel for input, and However, to meet the goal of general applicability, it is also error messages also assist the lexicographer. Codes arc ve- necessary to include the core of the general German voca- rified by the Dictionary Editor to keep down the error rate. bulary. We will try therefore to code the relevant legal words based on texts from this very domain.

A Word Database for Natural Language Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support