Design and Implementation of a Lexical Data Base

Design and Implementation of a Lexical Data Base

DESIGN AND IMPLEMENTATION OF A LLXICAL DATA BASE Eric Wehrli Department of Linguistics U.C.L.A. 405 Hilgard Ave, Los Angeles, CA 90024 ABSTRACT theoretical linguistics and computational linguistics. Section 2 discusses the relational This paper is concerned with the word-based model of the lexicon and the role specifications and the implementation of a morphology is assigned in this model. Finally, it particular concept of word-based lexicon to be spells out some of the details of the used for large natural language processing systems implementation of this model. such as machine translation systems, and compares it with the morpheme-based conception of the lexicon traditionally assumed in computational linguistics. OVERVIEW OF THE PROBLSM It will be argued that, although less concise, a relational word-based lexicon is One of the well-known characteristic features superior to a morpheme-based lexicon from a of natural languages is the size and the theoretical, computational and also practical complexity of their lexicons. This is in sharp viewpoint. constrast with artificial languages, which typically have small lexicons, in most cases made up of simple, unambiguous lexical items. Not only INTRODUCTION do natural languages have a huge number of lexical elements -- no matter what precise definition of It has been traditionally assumed by this latter term one chooses -- but these lexical computational linguists and particularly by elements can furthermore (i) be ambiguous in designers of large natural language processing several ways (ii) have a non-trivial internal systems such as machine translation systems that structure, or (iii) be part of compounds or the lexicon should be limited to lexical idiomatic expressions, as illustrated in (1)-(A): information that cannot be derived by rules. According to this view, a lexicon consists of a (I) ambiguous words: list of basic morphemes along with irregular or can, fly, bank, pen, race, etc. unpredictable words. (2) internal structure: use-ful-ness, mis-understand-ing, lake-s, In this paper, I would like to reexamine this tri-ed traditional view of the lexicon and point out some (3) compounds: of the problems it faces which seriously question milkman, moonlight, etc. the general adequacy of this model for natural (4) idiomatic expressions: language processing. to kick the bucket, by and large, to pull someone's leg, etc. As a trade-off between the often conflicting linguistic, computational and also practical considerations, an alternative conception of the In fact, the notion of word, itself, is not lexicon will be discussed, largely based on all that clear, as numerous linguists -- Jackendoff's (1975) proposal. According to this theoreticians and/or computational linguists -- view, lexical entries are fully-specified but have acknowledged. Thus, to take an example from related to one another. First developed for a the computational linguistics literature, Kay French parser (cf. Wehrli, 1984), this model has (1977) notes: been adopted for an English parser in development, as well as for the prototype of a French-English "In common usage, the term word refers translation system. sometimes to sequences of letters that can be bounded by spaces or punctuation This paper is organized as follows: the first marks in a text. According to this view, section addresses the general issue of what run, runs, runnin~ and ran are constitutes a lexical entry as well as the different words. But common usage also question of the relation between lexicon and allows these to count as instances of morphology from the point of view of both the same word because they belong to the 146 same paradigm in English accidence and The no-morphology option, which can be viewed are listed in the same entry in the as an extreme version of the word-based lexicon dictionary." mentioned above modulo the redundancy rules, has been adopted mostly for convenience by researchers Some of these problems, as well as the working on parsers for languages fairly general question of what constitutes a lexical uninteresting from the point of view of entry, whether or not lexical items should be morphology, e.g. English. It has the non-trivial related to one another, etc. have been much merit of reducing the lexical analysis to a simple debated over the last I0 or 15 years within the dictionary look-up. Since all flectional forms of framework of generative grammar. Considered as a a given word are listed independently, all the relatively minor appendix of the phrase-structure orthographic words must be present in the lexicon. rule component in the early days of generative Thus, this option presents the double advantage of grammar, the lexicon became little by little an being simple and efficient. The price to pay is autonomous component of the grammar with its own fairly high, though, in the sense that the specific formalism -- lexical entries as matrices resulting lexicon displays an enormous amount of of features, as advocated by Chomsky (1965). redundancy: lexical information relevant for a Finally, it also acquired specific types of rules, whole class of morphologically related words has the so-called word formation rules (cf. Halle, 1973; Aronoff, 1976; Lieber, 1980; Selkirk, 1983, to be duplicated for every member of the class. This duplication of information, in turn, makes and others), and lexical redundancy rules (cf. Jackendoff, 1975; Bresnan, 1977). the task of updating and/or deleting lexical entries much more complex than it should be. By and large, there seems to be widespread agreement among linguists that the lexicon should This option is more seriously flawed than be viewed as the repository of all the just being redundant and space-greedy, though. By idiosyncratic properties of the lexical items of a ignoring the obvious fact that words in natural language (phonological, morphological, syntactic, languages do have some internal structure, may semantic, etc.). This agreement quickly belong to declension or conjugation classes, but disappears, however, when it comes to defining above all that different orthographical words may what constitutes a lexical item, or, to put it in fact realize the same grammatical word in slightly differently, what the lexicon is a list different syntactic environments it fails to be of, and how should it be organized. descriptively adequate. Interestingly enough, this inadequacy turns out to have serious consequences. Among the many proposals discussed in the Consider, for example, the case of a translation linguistic literature, I will consider two system. Because a lexicon of this exhaustive list radically opposed views that I shall call the type has no way of representing a notion such as morpheme-bayed and the word-based conceptions of "lexeme", it lacks the proper level for lexical the lexicon . transfer. Thus, if been, was, were, a._m.mand be are treated as independant words, what should be their The morpheme-based lexicon corresponds to the translation, say in French, especially if we traditional derivational view of the lexicon, assume that the French lexicon is organized on the shared by the structuralist school, many of the same model? The point is straightforward: there is generative linguists and virtually all the no way one can give translation equivalents for computational linguists. According to this option, orthographic words. Lexical transfer can only be only non-derived morphemes are actually listed in made at the more abstract level of lexeme. The the lexicon, complex words being derived by means choice of a particular orthographic word to of morphological rules. In contrast, in a realize this lexeme is strictly language word-based lexicon a la Jackendoff, all the words dependent. In the previous example, assuming that, (simple and complex) are listed as independent say, were is to be translated as a form of the lexical entries, derivational as well as verbe etre, the choice of the correct flectional inflectional relgt~ons being expressed by means of form will be governed by various factors and redundancy rules-'-. properties of the French sentence. In other words, a transfer lexicon must state the fact that the The crucial distinction between these two verb to be is translated in French by etre, rather views of the lexicon has to do with the role of than the lower level fact that under some morphology. The morpheme-based conception of the circumstances were is translated by etaient. lexicon advocates a dynamic view of morphology, i.e. a conception according to which "words are The problems caused by the size and the generated each time anew" (Hoekstra et al. 1980). complexity of natural language lexicons, as well This view contrasts with the static conception of as the basic inadequacy of the "no morphology" morphology assumed in Jackendoff's word-based option just described, have been long acknowledged theory of the lexicon. by computational linguists, in particular by those involved in the development of large-scale Interestingly enough, with the exception of application programs such as machine translation. some (usually very small) systems with no It is thus hardly surprising that some version of morphology at all, all the lexicons in the morpheme-based lexicon has been the option computational linguistic projects seem to assume a common to all large natural language systems. dynamic conception of morphology. There is no doubt that restricting the lexicon to basic morphemes and deriving all complex words as This point was already noticed in Halle well as all the inflected forms by morphological (1973), who suggested that in addition to the list rules, reduces substantially the size of the of morphemes and the word formation rules which lexicon. This was indeed a crucial issue not so characterize the set of possible words, there must long ago, when computer memory was scarce and exist a list of actual words which functions as a expensive.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us