High-Performance Multi-Pass Unication

Paul Wesley Placeway May 14, 2002 CMU-LTI-02-172

Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy.

Thesis Committee: Eric Nyberg, Chair Jaime Carbonell Alon Lavie Robert Bobrow, BBN Technologies

Copyright 2002 Paul Wesley Placeway

This research was supported in part by Carnegie Mellon University

The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ofcial policies, either expressed or implied, of Carnegie Mellon University. Keywords: Parsing, Unication, Ambiguity For Mary, Mom and Dad.

Abstract

Parsing natural language is an attempt to discover some structure in a text (or textual representation) generated by a person. This structure can be put to a variety of uses, including , grammar conformance checking, and determination of prosody in text-to-speech tasks. Recent theories of use Unication to better describe the intricacies of natu- ral language [137]. For parsing systems, unication techniques have been either added to a context-free base system [152, 40, 4, 23], or replaced the context-free base en- tirely [118, 135, 45] (possibly putting it back later [136]). The seemingly small step of adding unication has opened a Pandora’s Box of computational complexity, in- creasing the difculty of the problem from polynomial [48] to somewhere between NP-complete and intractable, depending on the details of the unication system and how it was added [10]. Worse, unication on a context-free base parser can break the packing technique used to address the problem of ambiguity, leading to exponential blow-ups of the parser’s performance in both space and time in practice. I propose the use of a multi-pass strategy to avoid these problems in practice. I describe a parser which combines the use of shallow, simple value unication with some approximation techniques in order to nd a covering packed parse-forest. This parse- forest is then searched for a single-best fully-unifying value; the scoring system which drives the heuristic search encodes linguistically-based disambiguation preferences. The resulting two-pass parser is compared to an ordinary single-pass parser in the context of a heavy-weight knowledge-based machine translation system. The two-pass parser is shown to be competitive with the single-pass parser on average data, both in terms of time and space. It is also shown to be able to avoid a common class of ambiguity blow-up that the single-pass parser is subject to. These results indicate that the multi-pass technique, interleaving some of the unication equations in the parse, is the superior approach for heavy-weight unication parsing. Acknowledgements

I would like to thank the many people without whom this work would not have been possible:

My advisor, Eric Nyberg, for asking the critical question “Why do these sentences take so long?”, many technical and philosophical discussions, and for help in turning my writing into English.

The members of my committee, Jaime Carbonell, Alon Lavie, and Robert “Rusty” Bobrow, for guidance in setting the technical direction of this work, many useful discussions, and their patience.

Robert Moore, for technical advise related to high-performance context-free parsing.

Kathy Baker, for supporting the KANT grammar, and for helpful technical discussions.

Krzysztof Czuba, for providing the Broadcast News grammar and test set, and many helpful technical discussions.

David Svoboda, for help in organizing the Catalyst 10,000 sentence test corpus.

Robert Igo, for organ