A Dictionary and Morphological Analyser for English G.J
Total Page:16
File Type:pdf, Size:1020Kb
A Dictionary and Morphological Analyser for English G.J. Russell G.D. Rttchle S.G. Puhnan A.W. Black Computer Laboratory, Department of University of Cambridge Artificial Intelligence, University of Edinburgh 1. Introduction and Overview associated values, a morphosyntactlc category, and a rule This paper describes the current state of a three-year Identifier. project aimed at the development of software for use in The system is written in FRANZ LISP (opus 42.15) handling large quantities of dictionary information within running under Berkeley 4.2 Unix. Future developments natural language processing systems. 1 The project was will concentrate on improving its efficiency, in particular accepted for funding by SERC/Alvey commencing ill by restructuring the code. We also hope to produce an June 1984, and is being carried out by Graeme Ritchie implementation in C, which should offer a faster and Alan Black at the University of Edinburgh and response time. Steve Puhnan and Graham Russell at the University of Cambridge. It is one of three closely related projects 2. Linguistic Assumptions funded under the Alvey IKBS Programme (Natural The grammatical framework underlying the linguistic Language Tlleme); a parser is under development at aspects of the system is that of Generalized Phrase Edinburgh by Henry Thompson and John Phillips, and a Structure Grammar, as set out in Gazdar et al. (1985). sentence grammar is being devised by Ted Briscoe and Morphological categories employed here correspond to the Clare Grover at Lancaster and Bran Boguraev and John syntactic categories in that work, and the type of syn- Carroll at Cambridge. It is intended that the software tactic information present in dictionary entries is and rules produced by all three projects will be directly intended to facilitate the use of the system as part of a compatible and capable of functioning in an integrated more general GPSG-based program. In developing our system. prototype, have adopted many of the proposals made Realistic and useful natural language processing sys- we tems such as database front-ends require large numbers in that work. To that extent, certain assumptions about a correct analysis of English sentence syntax are built in of words, together with associated syntactic and semantic to the lexlcal entries, but this should not preclude adap- Information, to be efficiently stored in machine-readable tation by users to suit different analyses. form. Our system is Intended to provide the necessary Following what has become a general assumption in facilities, being designed to store a large number (at least syntactic theory, we take the major lexlcal categories to 10,000) of words and to perform morphological analysis be partitioned into four classes by the two binary-valued on them, covering both Inflectional and derlvatlonal mor- phology. In pursuit of these objectives, the dictionary features [+ N] and [:k V]. The major lexlcat categories have phrasal projections; these are distinguished from associates with each word information concerning its their lexlcal counterparts by their value for the feature morphosyntactlc properties. Users are free to modify the BAR. Lexlcal categories have the value 0, and phrasal system In a number of ways; they may add to the lexi- categories (including sentences) have the value 1 or 2. cal entries Lisp functions that perform semantic manipu- Thus, a Noun Phrase is of the category: latlons, and tailor the dictionary to the particular subject matter they are interested in (different databases, for ((V -) (N +) (BAR 2)) example). It Is also hoped that the system is general In our analysis, 'bound morphemes', that is to say enough to be of use to linguists wishing to Investigate prefxes and suffixes, are distinguished from others by the morphology of English and other languages. Con- their BAR specification; tile suffix ing is the sole member tents of the basle data files may be altered or replaced: of the category: 1. A 'Word Grammar' file contains rules assigning inter- nal structure to complex words, ((V 4-) (N -) (VFORM ING) (BAR -1)) 2. A 'Lexicon' file holds the morpheme entries which As in other GPSG-based work, our analysis encodes the include syntactic and other Information associated subcategorlzational prbpertles of lexlcal Items in the value with stems and affixes. of a feature SUBCAT. Transitive verbs such as devour 3. A 'Spelling Rules' file contains rules governing permis- are specified as (SUBCAT NP), and Intransitives such as sible correspondences between the form of morphemes elapse as (SUBCAT NULL). listed in the lextcon and complex words consisting of As an example from the current analysis of how the sequences of these morphemes. system can operate to produce well-formed words, con- Once these data flies have been prepared, they are com- sider the familiar fact of English morphology that no piled using a number of pre-processtng functions that word may contain more than one imqection. The word operate to produce a set of output files. These grammar must permit both walked and walking, but not constitute a fully expanded and cross-Indexed dictionary walkinged. This is achiev~xi by restricting the distribu- which can then be accessed from within LISP. tion of inflectional suffixes so that they attach to non- The process of morphological analysis consists of pars- Inflected stems only. A general statement of this type lng a sequence of Input morphemes with respect to the of restriction is made in terms of a feature INFL: stems word grammar, It Is Implemented as an active chart specified as (INFL +) may take an lnflecUonal sulfix, parser (Thompson & Rltchle (1984)), and builds a struc- while those specified as (INFL ~) may not. The STEM ture in the form of a tree in which each node has two feature described in section 4 provides one means of 1 This work is supported by SERC/AIvey grant number enforcing correct stem-affix combinations; if the suffixes GR/C/79114. ed and ing are specified with (STEM ((INFL +))), they 277 will attach only to categories which Include the al. (1968)), and the other holding the expanded entries specification (INFL +). Walk, as a regular verb, is so relating to them. The comptlatlon of a lexicon can take specified; wallced and waltcing are therefore accepted. Ed, a considerable amount of time; our prototype incorporates ing, other tnfectlonal suffixes, and irregular (i.e. a lexicon with approximately 3500 entries, which com- unlnflectable) words, however, are specified as (INFL -). plies In approximately ninety minutes. Our grammar assigns a binary structure to the words in question. In order for this method to prevent e.g. walk- 4. The Word Grammar inged, the stem walking must also bear the (INFL -) The internal structure of words is handled by a specification. This it does, since we regard sutfixes as unification feature grammar with rules of the form: being the head of a word, and as contributing to the categorial content of the word as a whole. If the INFL mother -~ daughter 1 daughter 2 ... specification of the suf~x is copied into the mother where 'mother', 'daughtcrl', etc. are categories. A rule category, the STEM specification of a further suffix will which adds the plural morpheme to a noun might be not be satisfied. See section 4 for more discussion of given as shown below: these matters. ((BAR 0) (V -) (N +) (PLU +) (INFL -)) => 3. The Lexicon ((BAR 0) (V -) (N +) (INFL +)) The lexicon itself consists of a sequence of entries, each ((BAR -1) (V -) (N 4-) (PLU 4-) (INFL -)) in the form of a Lisp s-expression. An entry has five elements: (1) and (ii) the head word, in its written form The system provides two methods of writing rules in a and in a phonological transcription, (ill) a 'syntactic more general form; variables and feature-passing conven- tions. field', (iv) a 'semantic field', and (v) a 'user field'. The semantic field has been provided as a facility for users, In our grammar, the category and inflectabllity of a and any Lisp s-expression can be inserted here. No suffixed word are determined by the category and significant semantic information is present in our entries, lnflectablllty of the suffix; in the rule below, ALPHA, beyond the fact that e.g. better and best are related in BETA, and GAMMA are variables ranging over the set meaning to good. of values {+, -}: Similarly, the user feld Is unexploited, being occupied ((V ALPHA)(N BETA)(INFL GAMMA)(BAR 0)) => in all cases by the atom 'nil'. It serves primarily as a ((BAR 0)) place-holder, in that, while it is desirable to maintain ((V ALPHA)(N BETA)(INFL GAMMA)(BAR -1)) the possibility for users to include in an entry whatever additional information they desire, the form which that Since variables are interpreted consistently throughout a Information might take in practice is clearly not predict- rule, the mother category and suffix will be identical In able. their specifications for N, V and INFL. As an alternative to variables, feature passing conven- The syntax field consists of a syntactic category, as tions are also available. These relate categories in what defined by Gazdar et al. (1985), i.e. a set of feature- Gazdar et al. (1.985) term 'local trees', i.e. sections of value pairs. Some of these are relevant only to the morphological structure consisting of a mother category workings of the word grammar, and may thus be and all of Its immediate daughters. The conventions Ignored by other components In an integrated natural refer to 'pre-lnstantlatlon' features; these are features language processing system. Their purpose is to control present in the categories mentioned In the relevant rule. the distribution of morphemes in complex words, as 'Extension' and 'unification' are meant In the sense of described in the following section.