High-Performance Multi-Pass Unification Parsing

High-Performance Multi-Pass Unication Parsing Paul Wesley Placeway May 14, 2002 CMU-LTI-02-172 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Eric Nyberg, Chair Jaime Carbonell Alon Lavie Robert Bobrow, BBN Technologies Copyright c 2002 Paul Wesley Placeway This research was supported in part by Carnegie Mellon University The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ofcial policies, either expressed or implied, of Carnegie Mellon University. Keywords: Parsing, Unication, Ambiguity For Mary, Mom and Dad. Abstract Parsing natural language is an attempt to discover some structure in a text (or textual representation) generated by a person. This structure can be put to a variety of uses, including machine translation, grammar conformance checking, and determination of prosody in text-to-speech tasks. Recent theories of Syntax use Unication to better describe the intricacies of natural language [137]. For parsing systems, unication techniques have been either added to a context-free base system [152, 40, 4, 23], or replaced the context-free base en- tirely [118, 135, 45] (possibly putting it back later [136]). The seemingly small step of adding unication has opened a Pandora’s Box of computational complexity, in- creasing the difculty of the problem from polynomial [48] to somewhere between NP-complete and intractable, depending on the details of the unication system and how it was added [10]. Worse, unication on a context-free base parser can break the packing technique used to address the problem of ambiguity, leading to exponential blow-ups of the parser’s performance in both space and time in practice. I propose the use of a multi-pass strategy to avoid these problems in practice. I describe a parser which combines the use of shallow, simple value unication with some approximation techniques in order to nd a covering packed parse-forest. This parse- forest is then searched for a single-best fully-unifying value; the scoring system which drives the heuristic search encodes linguistically-based disambiguation preferences. The resulting two-pass parser is compared to an ordinary single-pass parser in the context of a heavy-weight knowledge-based machine translation system. The two-pass parser is shown to be competitive with the single-pass parser on average data, both in terms of time and space. It is also shown to be able to avoid a common class of ambiguity blow-up that the single-pass parser is subject to. These results indicate that the multi-pass technique, interleaving some of the unication equations in the parse, is the superior approach for heavy-weight unication parsing. Acknowledgements I would like to thank the many people without whom this work would not have been possible: My advisor, Eric Nyberg, for asking the critical question “Why do these sentences take so long?”, many technical and philosophical discussions, and for help in turning my writing into English. The members of my committee, Jaime Carbonell, Alon Lavie, and Robert “Rusty” Bobrow, for guidance in setting the technical direction of this work, many useful discussions, and their patience. Robert Moore, for technical advise related to high-performance context-free parsing. Kathy Baker, for supporting the KANT grammar, and for helpful technical discussions. Krzysztof Czuba, for providing the Broadcast News grammar and test set, and many helpful technical discussions. David Svoboda, for help in organizing the Catalyst 10,000 sentence test corpus. Robert Igo, for organizing the FOATS regression test corpus. The many other members of the KANT team. Finally, my wife Mary Placeway, for her love and support throughout the adventure of grad- uate school. vi Contents 1 Introduction 1 1.1 Introduction . 1 1.1.1 Statement of Thesis . 3 1.1.2 Summary of Contributions . 3 1.1.3 Motivation: Why parsing is useful? . 4 1.1.4 As part of Knowledge-based translation . 4 1.1.5 Checking conformance to a restricted language . 5 1.1.6 Discovering prosody in text-to-speech . 5 1.2 Dissertation Overview . 6 2 Background 9 2.1 General Background . 9 2.1.1 Preliminaries . 9 2.1.2 Families of grammars . 10 2.1.3 Context-Free Parsing . 12 2.1.4 Unication Parsing . 13 2.1.5 Unication grammars are computationally powerful . 20 2.1.6 Parsing as Constraint Satisfaction . 22 2.2 Parsing Applications . 23 2.2.1 Machine Translation Systems . 23 3 Unication Parsing 31 3.1 Unication Parsing . 31 3.1.1 Pure Unication Parsing . 32 3.1.2 Unication Parsing on Context Free Spine . 34 3.2 About Pseudo-Unication . 39 vii 3.3 Parsing and Ambiguity . 40 3.3.1 Ambiguity is a problem for context-free parsing . 40 3.3.2 Context free parsing with packing . 43 3.3.3 Context free parsing with packing and unication . 47 3.3.4 Ambiguity inherently causes disjunction . 53 3.3.5 Solution to ambiguity Packing is not Subsumption . 54 3.3.6 Problems with Packing in Disjunctions . 57 3.3.7 Pseudo-unication and Disjunction . 58 4 Delayed Unication Parsing 61 4.1 Delaying Unication Until After Parsing . 61 4.1.1 Interleaved unication Versus Delayed unication . 62 4.2 Delaying Some Unication Until After Parsing . 64 4.2.1 Negative Restriction . 64 4.3 The Two Purposes of Interleaved Unication . 66 4.3.1 ‘Cheating’ in the interleaved unication . 66 4.3.2 Our unication approach . 70 5 Overview of the Approach 71 5.1 Conceptual Design . 72 5.1.1 Don’t try to do the parse all in one shot. 72 5.1.2 Don’t keep the unication values from the parse phase. 73 5.1.3 Don’t follow the grammar precisely early in the process. 74 5.1.4 Don’t try to nd all possible nal unication values. 75 5.1.5 Don’t pick just any single unication value; pick a good one. 75 5.2 System Requirements . 76 5.3 System Architecture . 79 5.3.1 Preprocessing . 80 5.3.2 Run-time Processing . 84 5.4 Evaluation during Development . 90 5.4.1 Development test conditions . 90 5.4.2 Development hardware . 92 5.4.3 Run-time Performance and Optimization Priorities . 92 5.5 Summary . 93 viii 6 Efcient Chart Parsing 95 6.1 Motivation . 95 6.1.1 Chapter Outline . 96 6.2 Prior Context-Free Parsers . 96 6.3 The Tree-Structured Grammar . 102 6.3.1 Building a Tree-Structured Grammar . 104 6.3.2 Using a Tree-Structured Grammar . 105 6.3.3 Previous Approaches . 106 6.4 Left-Corner and Look-ahead Filtering . 107 6.4.1 Left-Corner Constraint . 108 6.4.2 Look-ahead Constraint . 113 6.4.3 Left-corner of Look-ahead . 115 6.5 Other Parser Features . 116 6.5.1 Complete algorithm . 117 6.6 Context-free Parsing Results . 118 6.6.1 Discussion of results . 118 7 Pseudo-Unication: Implementation and Optimization 125 7.1 Introduction to Pseudo-Unication . 126 7.1.1 On Interpreting Pseudo-Unication . 126 7.2 Modications to the Pseudo-Unier . 134 7.2.1 ‘Gray-Box’ Adaptation . 135 7.2.2 Handling of Data Disjunctions . 135 7.2.3 Explicit No-Value Values . 136 7.2.4 Wild-Carded Values . 136 7.2.5 Complements of Unications . 138 7.2.6 Explicit over-write value equation . 141 7.3 Compilation and Optimization of Pseudo-Unication . 141 7.3.1 Unwinding of Conditional ORs . 143 7.3.2 Disjunction Flattening . 145 7.3.3 Multiple-Value Strength Reduction . 147 7.4 Shallow Pseudo-Unication as a First-Pass Filter . 150 7.4.1 Wild-carding deep structure assignments . 150 7.4.2 Pseudo-Optimizations for Shallow Unication . 152 ix 7.4.3 Effectiveness of Shallow Approximate Unication . 159 7.5 Optimizations That Did Not Help . 170 7.5.1 Approximated unication packing in disjunctions . 170 7.5.2 Length limits in approximate packing . 170 7.5.3 Vector Unier (is not faster) . 171 8 Post-parse Search 173 8.1 Introduction . 173 8.1.1 Previous Approaches . 174 8.1.2 Method of Attack . 175 8.2 The Search Component . 176 8.2.1 Best-First Search . 177 8.2.2 Searching a parse forest . 180 8.2.3 An All-Paths Search of a Parse Forest . 183 8.2.4 A Backtracking Greedy search for a best parse . 185 8.2.5 Full branch-and-bound search . 200 8.2.6 N-Best search . 209 8.3 Disambiguation Cost Calculator . ..

Load more