Lexical Resource Reconciliation in the Xerox Linguistic Environment
Total Page:16
File Type:pdf, Size:1020Kb
Lexical Resource Reconciliation in the Xerox Linguistic Environment Ronald M. Kaplan Paula S. Newman Xerox Palo Alto Research Center Xerox Palo Alto Research Center 3333 Coyote Hill Road 3333 Coyote Hill Road Palo Alto, CA, 94304, USA Palo Alto, CA, 94304, USA kapl an~parc, xerox, com pnewman©parc, xerox, tom Abstract This paper motivates and describes the morpho- logical and lexical adaptations of XLE. They evolved This paper motivates and describes those concurrently with PARGRAM, a multi-site XLF_~ aspects of the Xerox Linguistic Environ- based broad-coverage grammar writing effort aimed ment (XLE) that facilitate the construction at creating parallel grammars for English, French, of broad-coverage Lexical Functional gram- and German (see Butt et. al., forthcoming). The mars by incorporating morphological and XLE adaptations help to reconcile separately con- lexical material from external resources. structed linguistic resources with the needs of the Because that material can be incorrect, in- core grammars. complete, or otherwise incompatible with The paper is divided into three major sections. the grammar, mechanisms are provided to The next section sets the stage by providing a short correct and augment the external material overview of the overall environmental features of the to suit the needs of the grammar developer. original LFG GWB and its provisions for morpho- This can be accomplished without direct logical and lexicon processing. The two following modification of the incorporated material, sections describe the XLE extensions in those areas. which is often infeasible or undesirable. Externally-developed finite-state morpho- 2 The GWB Data Base logical analyzers are reconciled with gram- mar requirements by run-time simulation GWB provides a computational environment tai- of finite-state calculus operations for com- lored especially for defining and testing grammars bining transducers. Lexical entries derived in the LFG formalism. Comprehensive editing fa- by automatic extraction from on-line dic- cilities internal to the environment are used to con- tionaries or via corpus-analysis tools are in- struct and modify a data base of grammar elements corporated and reconciled by extending the of various types: morphologicalrules, lexical entries, LFG lexicon formalism to allow fine-tuned syntactic rules, and "templates" allowing named ab- integration of information from difference breviations for combinations of constraints. (See Ka- sources. plan and Maxwell, 1996; Kaplan and Bresnan, 1982; and Kaplan, 1995 for descriptions of the LFG for- malism.) Separate "configuration" specifications in- 1 Introduction dicate how to select and assemble collections of these elements to make up a complete grammar, and al- The LISP-based LFG Grammar Writers Workbench ternative configurations make it easy to experiment (GWB) (Kaplan and Maxwell, 1996) has long served with different linguistic analyses. both as a testbed for the development of parsing al- This paper focuses on the lexical mapping process, gorithms (Maxwell and Kaplan, 1991; Maxwell and that is, the overall process of translating between the Kaplan, 1993) and as a self-contained environment characters in an input string and the initial edges of for work in syntactic theory. The C/Unix-based Xe- the parse-chart. We divide this process into the typ- rox Linguistic Environment (XLE) further develops ical stages of tokenization, morphological analysis, the GWB parsing algorithms, extends them to gen- and LFG lexicon lookup. In GWB tokenizing is ac- eration, and adapts the environment to a different complished with a finite-state transducer compiled set of requirements. from a few simple rules according to the methods 54 described by (Kaplan and Kay, 1994). It tokenizes same syntactic category appears m more than one the input string by inserting explicit token bound- included rule collection, or an entry for the same ary symbols at appropriate character positions. This headword appears in more than one included lexical process can produce multiple outputs because of un- collection, the instance from the collection of high- certainties in the interpretation of punctuation such est precedence is the one included in the grammar. as spaces and periods. For example, "I like Jan." Thus the grammar writer can tentatively replace a results in two alternatives ("I@like@Jan@.@" and few rules or lexical entries by placing some very small "I@like@Jan.@.@") because the period in "Jan." collections containing the replacements later in the could optionally mark an abbreviation as well as a configuration list. sentence end. We constructed XLE around the same database- Morphological analysis is also implemented as a plus-configuration model but adapted it to operate finite-state transducer again compiled from a set of in the C/Unix world and to meet an additional set rules. These rules are limited to describing only of user requirements. GWB is implemented in a res- simple suffixing and inflectional morphology. The idential Lisp system where rules and definitions on morphological transducer is arranged to apply to in- text files are "loaded" into a memory-based database dividual tokens produced by the tokenizer, not to and then selected and manipulated. In C/Unix we strings of tokens. The result of applying the mor- treat the files themselves as the analog of the GWB phological rules to a token is a stem and one or more database. Thus, the XLE user executes either a inflectional tags, each of which is the heading for an "create-parser" or "create-generator" command to entry in the LFG lexicon. Morphological ambiguity select a file containing one or more configurations can lead to alternative analyses for a single token, so and to select one of those configurations to specify this stage can add further possibilities to the alterna- the current grammar. The selected configuration, in tives coming from the tokenizer. The token "cooks" turn, names a list of files comprising the data base, can be analyzed as "cook +NPL" or "cook +V3SG", and identifies the elements in those files to be used for instance. in the grammar. In the final phase of GWB lexical mapping, these This arrangement still supports alternative ver- stem-tag sequences are looked up in the LFG lexicon sions of lexical entries and rules, but the purpose to discover the syntactic category (N, V, etc.) and is not just to permit easy and rapid exploration of constraints (e.g. (1" NUM)=PL) to be placed on a these alternatives. The XLE database facilities also single edge in the initial parse chart. The category of enable linguistic specifications from different sources that edge is determined by the particular combina- and with different degrees of quality to be combined tion of stems and tags, and the corresponding edge together in an orderly and coherent way. For XLE constraints are formed by conjoining the constraints the goal is to produce efficient, robust, and broad found in the stem/tag lexical entries. Because of the coverage processors for parsing and generation, and ambiguities in tokenization, morphological analysis this requires that we make use of large-scale, in- and also lexical lookup, the initial chart is a network dependently developed morphological analyzers and rather than a simple sequence. lexicons. Such comprehensive, well-engineered com- The grammar writer enters morphological rules, ponents exist for many languages, and can relieve syntactic rules, and lexical entries into a database. the grammar developer of much of the effort and These are grouped by type into named collections. expense of accounting again for those aspects of lan- The collections may overlap in content in that differ- guage processing. ent syntactic rule collections may contain alternative expansions for a particular category and different 3 XLE Morphological Processing lexical collections may contain alternative definitions One obvious obstacle to incorporating externally de- for a particular headword. A configuration contains veloped morphological analyzers, even ones based on an ordered list of collection names to indicate which finite-state technology, is that they may be supplied alternatives to include in the active grammar. in a variety of data-structure formats. We overcame This arrangement provides considerable support this obstacle by implementing special software inter- for experimentation. The grammar writer can in- preters for the different transducer formats. But, as vestigate alternative hypotheses by switching among we learned in an evolutionary process, a more fun- configurations with different inclusion lists. Also, damental problem in integrating these components the inclusion list order is significant, with collec- with syntactic processing is to reconcile the analyses tions mentioned later in the list having higher prece- they produce with the needs of the grammar. dence than ones mentioned earlier. If a rule for the Externally-available morphological components 55 are, at present, primarily aimed at relatively unde- +Vbase : +Verb +Inf manding commercial applications such as informa- +Vbase : +Verb +Not3Sg tion retrieval. As such, they may have mistakes or gaps that have gone unnoticed because they have no Finally, in cases where the output of the exter- effect on the target applications. For example, func- nal analyzer is simply wrong, corrections can be de- tion words are generally ignored in IR, so mistakes veloped again in a locally specified way and com- in those words might go undetected; but those words bined with a "priority union" operator that blocks are crucial in syntactic processing. And even correct the incorrect analyses. The priority union of reg- analyses generally deviate in some respects from the ular relations A and B is defined as A Up B = characterizations desired by the grammar writer. A U lid(Dora(A)) o B]. The relation A Up B con- These mistakes, gaps, and mismatches are often tains all the pairs of strings in A together with all not easy to correct at the source. It may not be the pairs in B except those whose domain string is practical to refer problems back to the supplier as also in the domain of A.