SUPPLE: A Practical Parser for Natural Language Engineering Applications

Robert Gaizauskas, Mark Hepple, Horacio Saggion, Mark A. Greenwood and Kevin Humphreys∗ Department of Computer Science University of Sheffield, Sheffield, UK robertg|hepple|saggion|m.greenwood|- @dcs.shef.ac.uk { }

Abstract mars (CF-PSGs), written in Prolog, that has a num- ber of characteristics making it well-suited for use We describe SUPPLE, a freely-available, in LE applications. It is available both as a language open source natural language parsing sys- processing resource within the GATE General Ar- tem, implemented in Prolog, and designed chitecture for Text Engineering (Cunningham et al., for practical use in language engineering 2002) and as a standalone program requiring vari- (LE) applications. SUPPLE can be run as ous preprocessing steps to be applied to the input. a stand-alone application, or as a compo- We will here list some of its key characteristics. nent within the GATE General Architec- Firstly, the parser allows multiword units identi- ture for Text Engineering. SUPPLE is dis- fied by earlier processing components, e.g. named tributed with an example grammar that has entity recognisers (NERs), gazetteers, etc, to be been developed over a number of years treated as non-decomposable units for syntactic pro- across several LE projects. This paper de- cessing. This is important as the identification of scribes the key characteristics of the parser such items is an essential part of analyzing real text and the distributed grammar. in many domains. The parser allows a layered parsing process, with 1 Introduction a number of separate grammars being applied in se- ries, one on top of the other, with a “best parse” se- In this paper we describe SUPPLE1 — the Sheffield lection process between stages so that only a sub- University Prolog Parser for Language Engineering set of the constituents constructed at each stage is — a general purpose parser that produces both syn- passed forward to the next. While this may make tactic and semantic representations for input sen- the parsing process incomplete with respect to the tences, which is well-suited for a range of LE ap- total set of analyses licensed by the grammar rules, plications. SUPPLE is freely available, and is dis- it makes the parsing process much more efficient and tributed with an example grammar for English that allows a modular development of sub-grammars. was developed across a number of LE projects. We Facilities are provided to simplify handling will describe key characteristics of the parser and the feature-based grammars. The grammar representa- grammar in turn. tion uses flat, i.e. non-embedded, feature represen- 2 The SUPPLE Parser tations which are combined used Prolog term uni- fication for efficiency. Features are predefined and SUPPLE is a general purpose bottom-up chart parser source grammars compiled into a full form repre- for feature-based context free phrase structure gram- sentation, allowing grammar writers to include only

∗At Microsoft Corporation since 2000 (Speech and Natural relevant features in any rule, and to ignore feature or- Language Group). Email: [email protected]. dering. The formalism also permits disjunctive and 1In previous published materials and in the current GATE release the parser is referred to as buChart. This is name is now optional right-hand-side constituents. deprecated. The chart parsing algorithm is simple but very

200 Proceedings of the Ninth International Workshop on Parsing (IWPT), pages 200–201, Vancouver, October 2005. c 2005 Association for Computational Linguistics efficient, exploiting the characteristics of Prolog to stituent, in which tensed verbs are interpreted as re- avoid the need for active edges or an agenda. In in- ferring to unique events, and noun phrases as refer- formal testing, this approach was roughly ten times ring to unique objects. Where relations between syn- faster than a related Prolog implementation of stan- tactic constituents are identified in parsing, semantic dard bottom-up active chart parsing. relations between associated objects and events are The parser does not fail if full sentential parses asserted in the SQLF. cannot be found, but instead outputs partial anal- While linguistically richer grammatical theories yses as syntactic and semantic fragments for user- could be implemented in the grammar formalism selectable syntactic categories. This makes the of SUPPLE, the emphasis in our work has been on parser robust in applications which deal with large building robust wide-coverage tools — hence the re- volumes of real text. quirement for only minimal lexical morphosyntac- tic and semantic information. As a consequence the 3 The Sample Grammar combination of parser and grammars developed to date results in a tool that, although capable of return- The sample grammar distributed with SUPPLE has ing full sentence analyses, more commonly returns been developed over several years, across a number results that include chunks of analysis with some, LE projects. We here list some key characteristics. but not all, attachment relations determined. The morpho-syntactic and semantic information required for individual lexical items is minimal — 4 Downloading SUPPLE Resources inflectional root and word class only, the word SUPPLE resources, including source code and the class inventory is basically the PTB tagset. sample grammar, and also a longer paper providing A conservative is adopted regarding a more detailed account of both the parser and gram- identification of verbal arguments and attachment of mar, are available the supple homepage at: nominal and verbal post-modifiers, such as preposi- http://nlp.shef.ac.uk/research/supple tional phrases and relative clauses. Rather than pro- ducing all possible analyses or using probabilities to 5 Conclusion generate the most likely analysis, the preference is to offer a single analysis that spans the input sentence The SUPPLE parser has served as a component in only if it can be relied on to be correct, so that in numerous LE projects, and is currently in many cases only partial analyses are produced. The use in a Question Answering system which partic- philosophy is that it is more useful to produce par- ipated in recent TREC/QA evaluations. We hope tial analyses that are correct than full analyses which its availability as a GATE component will facilitate may well be wrong or highly disjunctive. Output its broader use by NLP researchers, and by others from the parser can be passed to further processing building applications exploiting NL . components which may bring additional information Acknowledgements to bear in resolving attachments. An analysis of verb phrases is adopted in which The authors would like to acknowledge the sup- a core verb cluster consisting of verbal head plus port of the UK EPSRC under grants R91465 and auxiliaries and adverbials is identified before any at- K25267, and also the contributions of Chris Huyck tempt to attach any post-verbal arguments. This con- and Sam Scott to the parser code and grammars. trasts with analyses where complements are attached to the verbal head at a lower level than auxiliaries and adverbials, e.g. as in the Penn TreeBank. This References decision is again motivated by practical concerns: it H. Cunningham, D. Maynard, K. Bontcheva, and is relatively easy to recognise verbal clusters, much V. Tablan. GATE: A framework and graphical devel- opment environment for robust NLP tools and applica- harder to correctly attach complements. tions. Proceedings of the 40th Anniversary Meeting of A semantic analysis, or simplified quasi-logical the Association for Computational Linguistics, 2002. form (SQLF), is produced for each phrasal con-

201