High-Performance Multi-Pass Unication Parsing
Paul Wesley Placeway May 14, 2002 CMU-LTI-02-172
Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy.
Thesis Committee: Eric Nyberg, Chair Jaime Carbonell Alon Lavie Robert Bobrow, BBN Technologies
Copyright c 2002 Paul Wesley Placeway
This research was supported in part by Carnegie Mellon University
The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ofcial policies, either expressed or implied, of Carnegie Mellon University. Keywords: Parsing, Unication, Ambiguity For Mary, Mom and Dad.
Abstract
Parsing natural language is an attempt to discover some structure in a text (or textual representation) generated by a person. This structure can be put to a variety of uses, including machine translation, grammar conformance checking, and determination of prosody in text-to-speech tasks. Recent theories of Syntax use Unication to better describe the intricacies of natu- ral language [137]. For parsing systems, unication techniques have been either added to a context-free base system [152, 40, 4, 23], or replaced the context-free base en- tirely [118, 135, 45] (possibly putting it back later [136]). The seemingly small step of adding unication has opened a Pandora’s Box of computational complexity, in- creasing the difculty of the problem from polynomial [48] to somewhere between NP-complete and intractable, depending on the details of the unication system and how it was added [10]. Worse, unication on a context-free base parser can break the packing technique used to address the problem of ambiguity, leading to exponential blow-ups of the parser’s performance in both space and time in practice. I propose the use of a multi-pass strategy to avoid these problems in practice. I describe a parser which combines the use of shallow, simple value unication with some approximation techniques in order to nd a covering packed parse-forest. This parse- forest is then searched for a single-best fully-unifying value; the scoring system which drives the heuristic search encodes linguistically-based disambiguation preferences. The resulting two-pass parser is compared to an ordinary single-pass parser in the context of a heavy-weight knowledge-based machine translation system. The two-pass parser is shown to be competitive with the single-pass parser on average data, both in terms of time and space. It is also shown to be able to avoid a common class of ambiguity blow-up that the single-pass parser is subject to. These results indicate that the multi-pass technique, interleaving some of the unication equations in the parse, is the superior approach for heavy-weight unication parsing. Acknowledgements
I would like to thank the many people without whom this work would not have been possible:
My advisor, Eric Nyberg, for asking the critical question “Why do these sentences take so long?”, many technical and philosophical discussions, and for help in turning my writing into English.
The members of my committee, Jaime Carbonell, Alon Lavie, and Robert “Rusty” Bobrow, for guidance in setting the technical direction of this work, many useful discussions, and their patience.
Robert Moore, for technical advise related to high-performance context-free parsing.
Kathy Baker, for supporting the KANT grammar, and for helpful technical discussions.
Krzysztof Czuba, for providing the Broadcast News grammar and test set, and many helpful technical discussions.
David Svoboda, for help in organizing the Catalyst 10,000 sentence test corpus.
Robert Igo, for organ