Grammatical Analysis by Computer of the Lancaster-Oslo/Bergen (LOB)
Total Page:16
File Type:pdf, Size:1020Kb
GRAMMATICAL ANALYSIS BY COMPUT~ OF THE LANCASTER-OSLO/BERGEN (LOB) CORPUS OF BRITISH ~NGLISH TEXTS. Andrew David Beale Unit for Computer Research on the English Language Bowland College, University of Lancaster Bailrigg, Lancaster, England LA1 aYT. ABSTRACT etc. ; H. Miscellaneous ; J. Learned and Scientific; K. General Research has been under way at the Fiction; L. Mystery and Detective Unit for Computer Research on the ~hglish Fiction; M. Science Fiction; N. Language at the University of Lancaster, Adventure and Western Fiction, Romance England, to develop a suite of computer and Love Story; R. Humour. There are programs which provide a detailed two main sections, informative prose and grammatical analysis of the LOB corpus, imaginative prose, and all the texts a collection of about 1 million words of contained in the corpus weee printed in British English texts available in a single year (1961). machine readable form. The structure of the LOB corpus was The first phrase of the pruject, designed to resemble that of the Brown completed in September 1983, produced a corpus as closely as possible so that grammatically annotated version of the a systematic comparison of British and corpus giving a tag showing the word American written English could be made. class of each word token. Over 93 per Both corpora contain samples of texts cent of the word tags were correctly published in the same year (1961) so selected by using a matrix of tag pair that comparisons are not distorted by probabilities and this figure was upgraded diachronic factors. by a further 3 per cent by retagging problematic strings of words prior to The LOB corpus is used as a database disambiguation and by altering the for linguistic research and language probability weightings for sequences of description. Historically, different three tags. The remaining 3 to ~ per ]inguists have been concerned to a cent were corrected by a human post-editor. greater or lesser extent with the use of corpus citations, to some degree, at The system was originally designed to least, because of differences in the run in batch mode over the corpus but we perceived view of the descriptive have recently modified procedures to run requirements of grammar. Jespersen interactively for sample sentences typed (1909-A9), Kruisinga and Erades (1911) in by a user at a terminal. We are gave frequent examples of citations from currently extending the word tag set and assembled corpora of written texts to improving the word tagging procedures to illustrate grammatical rules. Work on further reduce manual intervention. A text corpora is, of course, very much similar probabilistic system is being alive toda~v. Storage, retrieval and developed for phrase and clause tagging. processing of natural language text is a more efficient and less laborious task with modern computer hardware than it ~qE STI~JCTURE A~D PURPOSE was with hand-written card files but OF THE LOB CORPUS. data capture is still a significant problem (Francis, 1980). The forthcoming The LOB Corpus (Johansson, Leech and work, A Comprehensive Grammar of the Goodluck, 1978), like its American ~Elish Lan~la~e (Quirk, Greenbaum, ~/gl~sh counterpart, the Brown Corpus leech, and ~arr.vik, 1985) contains many LKucera and Francis, 196a; Hauge and citations from both LOB and Brown ;Iofland, 1978), is a collection of 500 Corpora. samples of British ~hglish texts, each containing about 2,000 word tokens. The samples are representations of 15 different ~ext categories: A. Press (Reportage); B. Press (Editorial); C. Press (Reviews); D. Religion; E. ~ills and Hobbies; F. Popular Lore; G. Belles Lettres, Biography, r'[emoirs, 293 A GRAF~ATICALLY ANNOTA~ VERSION word types as possible. A list of full OF ~E CORPUS word forms, known as the 'wordlist', i& used for exceptions to the suffixlist, Since 1981, research has been directed and, in addition, word forms that occur towards writing programs to grammatically more than 50 times in the corpus are annotate the LOB cor~is. From 1981-83, included in the wordlist, for speed of the research effort produced a version of processing. The term 'suffixlist' is the corpus with every word token labelled used as a convenient name, and the reader by a grammatical tag showing the word is warned that the list does not class of each word form. Subsequent necessarily contain word final morphs; research has attempted to build on the strings of between one and five word techni~les used for automatic word final characters are included if their tagging by using the output from the word occurrence as a gagged form in the Brown tagging programs as input to phrase and corpus merits it. clause tagging and by using probabilistic methods to provide a constituent analysis ~e 'suffixlist' used by Greene and of the LOB corpus. Rubin (op.cit.) was substantially revised and extended by Johansson and Jahr (1982) ~e programs and data files used for using reverse alphabetical lists of word tagging were developed from work done approximately 50,000 word types of the at Brown University (Greene and BAbin, Brown Corpus and 75,000 word types of 1971). Staff and research associates at both Brown and LOB corpora. Frequency Lancaster undertook the programming in lists specifying the fre~uehcy of tags PASCAL while colleagues in Oslo revised for word endings consistlng of 1 to 5 and extended the lists used by Greene and characters were used to establish the R~bin (op.cit.) for word tag assignment. efficiency of each rule. Johansson and Half of the corpus was post-edited at J~r were guided by the Longman Lancaster and the other half at the Dictionary of Contemporary ~hglish (1978) Norwegian Computing Centre for the and other dictionaries and grammars Humanities. including ~/irk, Greenbaum, Leech and ~art-vik (1972) in identifying tags for How word tagging works. each item in the wordlist. For the version used for Lancaster-Oslo/BerEen ~he major difficulties to be word tagging (1985), the suffixlist was encountered with word tagging of written expanded to about 7~90 strings of word English are the lack of distinctive final characters, the wordlist consisted inflectional or derivational endings and of about 7,000 entries and a total of the large proportion of word forms that 135 word tag types were used. belong to more than one word class. ~hdings such as -able, -ly and -ness are Potential ~ag disambiguation. graphic realizations"---of morphologlc'-~l units indicating word class, but they ~%e problem of resolving lexical occur infrequently for the purposes of ambiguity for the large proportion of automatic word tag assignment; the English words that occur in more than one reader will be able to establish word class, (BLOW, CONTACT, HIT, LEFT, exceptions to rules assigning word classes RA2~, RUN, REFUSE, RDSE, 'dALE, WATCH ...), to words with these suffixes, because the is solved, whenever possible by examining characters do not invariably represent the local context. '~rd tag selection the same morphemes. for homographs in Greene a~d Rubin (op. cir.) was attempted by using 'context The solution we have adopted is to use frame rules', an ordered list of 5,300 a look up procedure to assign one or more rules designed to take into account the potential ~ags to each input word. ~e tags assigned to up to two words appropriate word tag is then selected for preceding or following the ambiguous words with more than one potential tag homograph. ~3~e program was 77 per cent by ca]culatLug the probability of the successful but several errors were due to tag's occurrence ~iven neighbouring appropriate rules being blocked when potential tags. adjacent ambi~lities were encountered (Marshall, 1983: 140). Moreover, about ~otential word tag assignment. 80 per cent of rule application took just one immediately neighbouring tag In cases where more than one potential into account, even though only a quarter tag is assigned to the inpu~ word, the of the context frame rules specified tags represent word classes of the word only one immediately neighbouring tag. without taking the syntactic environmeat into account. A list of one to five word To overcome these difficulties, flnal characters, known as the research associates at Lancaster have 's~ffixlist', is used for assignment of devised a transition probability matrix appropriate word class tags to as many of tag pairs to compute the most probable 294 tag for an ambiguous form given the at least some kind of analysis for unusual immediately preceding and following tags. or eccentric syntax and prevented the ~his method of calculating one-step system from grinding to a halt when transition probabilities is suitable for confronted with a construction that it disambiguating strings of ambiguously did not recognize. tagged words because the most likely path through a string of ambiguously tagged Once these refinements to the suite of words can be calculated. word tagging programs were made, the corpus was word-tagged. It was estimmted The likelihood of a tag being selected that the number of manual post-editing in context is also influenced by likeli- interventions had been reduced from about hood markers which are assigned to 230,000 required for word tagging of the entries with more than one tag in the Brown corpus to about 35,000 required lists. Only two markers, '@' and '%', for the IDB corpus (Leech, Garside and are used, '@' notionally Ludicat~ng Atwell, 1983: 36). The method achieves that the tag is correct for the far greater consistency than could be associated form less than 1 in lO attained by a human, were such a person occasions, '%' notionally indicating that able to labour through the task of the tag occurs less than 1 in lOO attributing a tag to every word token in occasions.