<<

A and Morphological Analyser for English G.J. Russell G.D. Rttchle S.G. Puhnan A.W. Black Laboratory, Department of University of Cambridge Artificial Intelligence, University of Edinburgh 1. Introduction and Overview associated values, a morphosyntactlc category, and a rule This paper describes the current state of a three-year Identifier. project aimed at the development of software for use in The system is written in FRANZ LISP (opus 42.15) handling large quantities of dictionary information within running under Berkeley 4.2 Unix. Future developments natural processing systems. 1 The project was will concentrate on improving its efficiency, in particular accepted for funding by SERC/Alvey commencing ill by restructuring the code. We also hope to produce an June 1984, and is being carried out by Graeme Ritchie implementation in C, which should offer a faster and Alan Black at the University of Edinburgh and response time. Steve Puhnan and Graham Russell at the University of Cambridge. It is one of three closely related projects 2. Linguistic Assumptions funded under the Alvey IKBS Programme (Natural The grammatical framework underlying the linguistic Language Tlleme); a parser is under development at aspects of the system is that of Generalized Phrase Edinburgh by Henry Thompson and John Phillips, and a Structure Grammar, as set out in Gazdar et al. (1985). sentence grammar is being devised by Ted Briscoe and Morphological categories employed here correspond to the Clare Grover at Lancaster and Bran Boguraev and John syntactic categories in that work, and the type of syn- Carroll at Cambridge. It is intended that the software tactic information present in dictionary entries is and rules produced by all three projects will be directly intended to facilitate the use of the system as part of a compatible and capable of functioning in an integrated more general GPSG-based program. In developing our system. prototype, have adopted many of the proposals made Realistic and useful natural language processing sys- we tems such as database front-ends require large numbers in that work. To that extent, certain assumptions about a correct analysis of English sentence syntax are built in of , together with associated syntactic and semantic to the lexlcal entries, but this should not preclude adap- Information, to be efficiently stored in machine-readable tation by users to suit different analyses. form. Our system is Intended to provide the necessary Following what has become a general assumption in facilities, being designed to store a large number (at least syntactic theory, we take the major lexlcal categories to 10,000) of words and to perform morphological analysis be partitioned into four classes by the two binary-valued on them, covering both Inflectional and derlvatlonal mor- phology. In pursuit of these objectives, the dictionary features [+ N] and [:k V]. The major lexlcat categories have phrasal projections; these are distinguished from associates with each information concerning its their lexlcal counterparts by their value for the feature morphosyntactlc properties. Users are free to modify the BAR. Lexlcal categories have the value 0, and phrasal system In a number of ways; they may add to the lexi- categories (including sentences) have the value 1 or 2. cal entries Lisp functions that perform semantic manipu- Thus, a Noun Phrase is of the category: latlons, and tailor the dictionary to the particular subject matter they are interested in (different databases, for ((V -) (N +) (BAR 2)) example). It Is also hoped that the system is general In our analysis, 'bound ', that is to say enough to be of use to linguists wishing to Investigate prefxes and suffixes, are distinguished from others by the morphology of English and other . Con- their BAR specification; tile suffix ing is the sole member tents of the basle data files may be altered or replaced: of the category: 1. A 'Word Grammar' file contains rules assigning inter- nal structure to complex words, ((V 4-) (N -) (VFORM ING) (BAR -1)) 2. A '' file holds the entries which As in other GPSG-based work, our analysis encodes the include syntactic and other Information associated subcategorlzational prbpertles of lexlcal Items in the value with stems and affixes. of a feature SUBCAT. Transitive verbs such as devour 3. A ' Rules' file contains rules governing permis- are specified as (SUBCAT NP), and Intransitives such as sible correspondences between the form of morphemes elapse as (SUBCAT NULL). listed in the lextcon and complex words consisting of As an example from the current analysis of how the sequences of these morphemes. system can operate to produce well-formed words, con- Once these data flies have been prepared, they are com- sider the familiar fact of English morphology that no piled using a number of pre-processtng functions that word may contain more than one imqection. The word operate to produce a set of output files. These grammar must permit both walked and walking, but not constitute a fully expanded and cross-Indexed dictionary walkinged. This is achiev~xi by restricting the distribu- which can then be accessed from within LISP. tion of inflectional suffixes so that they attach to non- The process of morphological analysis consists of pars- Inflected stems only. A general statement of this type lng a sequence of Input morphemes with respect to the of restriction is made in terms of a feature INFL: stems word grammar, It Is Implemented as an active chart specified as (INFL +) may take an lnflecUonal sulfix, parser (Thompson & Rltchle (1984)), and builds a struc- while those specified as (INFL ~) may not. The STEM ture in the form of a tree in which each node has two feature described in section 4 provides one means of 1 This work is supported by SERC/AIvey grant number enforcing correct stem-affix combinations; if the suffixes GR/C/79114. ed and ing are specified with (STEM ((INFL +))), they

277 will attach only to categories which Include the al. (1968)), and the other holding the expanded entries specification (INFL +). Walk, as a regular verb, is so relating to them. The comptlatlon of a lexicon can take specified; wallced and waltcing are therefore accepted. Ed, a considerable amount of time; our prototype incorporates ing, other tnfectlonal suffixes, and irregular (i.e. a lexicon with approximately 3500 entries, which com- unlnflectable) words, however, are specified as (INFL -). plies In approximately ninety minutes. Our grammar assigns a binary structure to the words in question. In order for this method to prevent e.g. walk- 4. The Word Grammar inged, the stem walking must also bear the (INFL -) The internal structure of words is handled by a specification. This it does, since we regard sutfixes as unification feature grammar with rules of the form: being the head of a word, and as contributing to the categorial content of the word as a whole. If the INFL mother -~ daughter 1 daughter 2 ... specification of the suf~x is copied into the mother where 'mother', 'daughtcrl', etc. are categories. A rule category, the STEM specification of a further suffix will which adds the plural morpheme to a noun might be not be satisfied. See section 4 for more discussion of given as shown below: these matters. ((BAR 0) (V -) (N +) (PLU +) (INFL -)) => 3. The Lexicon ((BAR 0) (V -) (N +) (INFL +)) The lexicon itself consists of a sequence of entries, each ((BAR -1) (V -) (N 4-) (PLU 4-) (INFL -)) in the form of a Lisp s-expression. An entry has five elements: (1) and (ii) the head word, in its written form The system provides two methods of writing rules in a and in a phonological transcription, (ill) a 'syntactic more general form; variables and feature-passing conven- tions. field', (iv) a 'semantic field', and (v) a 'user field'. The semantic field has been provided as a facility for users, In our grammar, the category and inflectabllity of a and any Lisp s-expression can be inserted here. No suffixed word are determined by the category and significant semantic information is present in our entries, lnflectablllty of the suffix; in the rule below, ALPHA, beyond the fact that e.g. better and best are related in BETA, and GAMMA are variables ranging over the set meaning to good. of values {+, -}: Similarly, the user feld Is unexploited, being occupied ((V ALPHA)(N BETA)(INFL GAMMA)(BAR 0)) => in all cases by the atom 'nil'. It serves primarily as a ((BAR 0)) place-holder, in that, while it is desirable to maintain ((V ALPHA)(N BETA)(INFL GAMMA)(BAR -1)) the possibility for users to include in an entry whatever additional information they desire, the form which that Since variables are interpreted consistently throughout a Information might take in practice is clearly not predict- rule, the mother category and suffix will be identical In able. their specifications for N, V and INFL. As an alternative to variables, feature passing conven- The syntax field consists of a syntactic category, as tions are also available. These relate categories in what defined by Gazdar et al. (1985), i.e. a set of feature- Gazdar et al. (1.985) term 'local trees', i.e. sections of value pairs. Some of these are relevant only to the morphological structure consisting of a mother category workings of the word grammar, and may thus be and all of Its immediate daughters. The conventions Ignored by other components In an integrated natural refer to 'pre-lnstantlatlon' features; these are features language processing system. Their purpose is to control present in the categories mentioned In the relevant rule. the distribution of morphemes in complex words, as 'Extension' and 'unification' are meant In the sense of described in the following section. Gazdar et al. (1985), q.v. The content of a syntax field is often at least par- tlally predictable. This fact allows us to employ as an The Word-Head Convention: aid to users wishing to write their own dictionary rules After lnstantlatlon, the set of WHead features in the which add information to the lexicon during the compi- mother is the unification of the pre-lnstantlatlon lation process. Recall that, in our analysis of English, WHead features of the Mother with the pre- the lnflectablllty of a word is governed by the value in lnstantlatlon WHead features of the Rlghtdaughter. that word's category for INFL. Completion Rules (CRs) This convention is analogous to the simplest case of the can be written that will add the specification (INFL-) Head Feature Convention in Gazdar et at. (1985). to any entry already Including (PLU +) (for e.g. men), Although there is no formal notion of 'head' in the sys- (AFORM ER) (for e.g. (VFORM ING), etc,, thus worse), tem, this convention embodies the Implicit claim that the removing the need to state Individually that a given head in a local tree is always the right daughter. If the word cannot be inflected. daughters are a prefix and a stem (as in e.g. re-apply), A second means of reducing the amount of prepara- the WHead features of the stem are passed up to the tory work is provided in the form of Multiplication mother. Features encoding morphosyntactic category can Rules (MRs). Whereas CRs add further specifications to be declared as members of the WHead set, and re-apply a single entry, MRs have the effect of Increasing the is then of the same category as, and shares various number of entries In some principled way. One applica- sentence-level syntactic properties with, apply. If the tion of MRs Is to express the fact that nouns and adjec- daughters are a stem and a suffix, the category of the tlves do not subcategorize for obligatory complements. mother Is determined not by the stem, but rather by the A MR can be written which, for each entry containing suffix. For example, possible and ity may be combined to the specification (N +) and some non-NULL value for form possibility, whose 'nountness' is due to the category SUBCAT, produces a copy of that entry where the SUB- of the suffix. CAT specification is replaced by (SUBCAT NULL). The lexicon complies Into two files, one holding mor- phemes stored in a tree-shaped structure (cf. Thorne et

278 The Word-Daughter Convention: additional e is Inserted when some nouns are suffixed (a) If any WDaughter features exist on the Right- with the plural morpheme +s: daughter then the WDaughter features on the Epenthesls Mother are the unification of the pre-lnstantlaUon +:e <=~> { < s:s h:h > s:s x:x z:z } --- s:s WDaughter features on the Mother with the pre- or < c:c h:h2> .... s:s lnstantlatlon WDaughter featm-es on the Right-. The epenthests rule states that e must be inserted at a daughter. morpheme boundary if an(:[ only if the boundary has to (b) If no WDaughter features exist on the Right- its left sh, s, x, z or eh and to Its right s. The daughter then the WDaughter features on the Interpretation of the rule Is simple; the character pair Mother are the unification of the pre-lnstantiatlon ('lexical character:surface character') to the left of the WDaughter features on the Mother with the pre- arrow specifies the change that takes place between the lnstantlation WDaughter features on the Left- contexts (again stated in character pairs) given to the daughter. right of the arrow. Braces ('{','}') Indicate disjunction The subcategorlzation class of a word remains constant and angled brackets Indicate a sequence, Alternative under Inflection, but is likely to be changed by the contexts may be specified using the word 'or'. IJexlcal attachment of a derlvatlonal suffix. Moreover, the sub- and surface strings of unequal length can be matched by categorization of a prefixed word is the same as that of using the null character '0', and special characters may its stem. The WDaughter convention is designed to be defined and used in rules, for example to cover the reflect these facts by enforcing a feature correspondence set of alphabetic characters representing vowels. between one of the daughters and the mother. When The spelling rules are able to match any pair of char- the feature set WDaughter is defined as including the acter strings. It would for example be possible to subcategorlzation feature SUBCAT, the convention results analyse the suppletlve went as a surface form in configuratkms such as: corresponding to the lexlcal form go+ed. In this case, ((SUBCAT NP)) ((SUBCAT NP)) four rules would be needed to effect the change, and a ((V +)(N +]) ((SUBCAT NP)) better solution is to list went separately In the lexicon. ((SUBCAT NP)) ((VFORM ING)) in practice, the choice between treating this type of alternation dynamically, with morphological and spelling which show the relevant feature specifications in local rules, and statically, by exploiting the lexicon directly, trees arising from suffixatton of an adjective with +ize to produce a transitive verb and suffixatlon of a transitive depends on the user's Idea of which is the more elegant verb with +ing to produce a present participle. solution. While elegance may be in the eye of the beholder, computational efficiency is mffortunately not. The Word-Sister Convention: I[ will generally be more efficient to list a word In the When one daughter is specified for STEM, the lexicon titan to add spelling or morphological rules category of the other daughter must be an extension specific to small number of cases. of the value of STEM. The purpose of this third convention is to allow the Rel'el'ellCCS subcategorization of affixes with respect to the type of Gazdar, G., E. Klein, G.K. Pullmn, and I.A. Sag (1985) stem they may attach to. The behavlour of affixes that Generalized Phrase Structure Grammar. : attach to more than one category can be handled natur- Blackwells. ally by giving them a suitable specification for STEM. Karttunen, L. (1983) "KIM:MO - A General Morphologi- If it is desired to have anti- attached to both nouns and cal Processor", in Texas Linguistic Forum 22, 165 - adjectives, for example, the specification (STEM ((N +))) 186. Department of , University of Texas, will have that effect, since both adjectives and nouns are Austin, Texas. extensions of the category ((N +)1. Koskennieml, K. (1983a) "Two-level model for morpho- The user can define the sets WHead and WDaughter logical analysis", in Proceedings of the Eighth Interna- as he wishes, or, by leaving them undefined, avoid their tiona2 Joint Conference on AzTificial Intelligence, effects altogether. The feature STEM is built in, and Karlsruhe, 683 - 685. need not be defined. The effects of the Word-Sister Koskennleml, K. (1983b) Two-level Morphology: a general Convention can be modified by changing the STEM computational model for word-form recognition and pro- specifications ill the lexlcal entries, and avoided by duction, Publication No. 11, University of tIelslnkl, omitting them. Finland. Koskennteml, K. (1985) "Compilation of Automata from 5. The Spelling Rules Two-level Rules", talk given at the Workshop on Finite-State Morphoiogy, CSLI, Stanford, July, 1985. The rules are based on the work of Koskennlemt (1983a, Thompson, IL and G.D. Rltchte (19841 "Implementing 1983b, Karttunen 1983), though their application here is Natural Language Parsers", in T. O'Shea and M. Elsen+ solely to the question of 'morphographemlcs'; the more stadt (eds.) Az~tificial Intelligence: Tools, Techniques general morphological effects of Koskenniemi's rules are and Applications. New York: Harper and Row. produced dlffenmtly. The current version of the system contains a compiler allowing the rules to be written in a Thorne, J.P., P. Bratley, and, It. Dewar (1968) "The syn- high level notation based on KoskennIemi (1985). Any tactic analysis of English by machine", in D. Mlchie number of spelling rules can be employed, though our (ed.) Machine Intelligence 3. Edinburgh: Edinburgh system has fifleen. They are compiled during the gen- University Press. eral dictionary pre-processlng stage into deterministic finite state transducers, of which one tape represents the lexlcal form and the other the surface form. The following rule describes the process by which an

279