Optimising the Speed and Accuracy of a Statistical GLR Parser

UCAM-CL-TR-743 Technical Report ISSN 1476-2986 Number 743 Computer Laboratory Optimising the speed and accuracy of a Statistical GLR Parser Rebecca F. Watson March 2009 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2009 Rebecca F. Watson This technical report is based on a dissertation submitted September 2007 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Darwin College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 3 Abstract The focus of this thesis is to develop techniques that optimise both the speed and accuracy of a unification-based statistical GLR parser. However, we can apply these methods within a broad range of parsing frameworks. We first aim to optimise the level of tag ambiguity resolved during parsing, given that we employ a front-end PoS tagger. This work provides the first broad comparison of tag models as we consider both tagging and parsing performance. A dynamic model achieves the best accuracy and provides a means to overcome the trade-off between tag error rates in single tag per word input and the increase in parse ambiguity over multiple- tag per word input. The second line of research describes a novel modification to the inside-outside algorithm, whereby multiple inside and outside probabilities are assigned for elements within the packed parse forest data structure. This algorithm enables us to compute a set of ‘weighted GRs’ directly from this structure. Our experiments demonstrate substantial increases in parser accuracy and throughput for weighted GR output. Finally, we describe a novel confidence-based training framework, that can, in prin- ciple, be applied to any statistical parser whose output is defined in terms of its con- sistency with a given level and type of annotation. We demonstrate that a semisupervised variant of this framework outperforms both Expectation-Maximisation (when both are constrained by unlabelled partial-bracketing) and the extant (fully supervised) method. These novel training methods utilise data automatically ex- tracted from existing corpora. Consequently, they require no manual effort on be- half of the grammar writer, facilitating grammar development. 4 5 Acknowledgements I would first like to thank Ted Briscoe, who was an excellent supervisor. He has helped to guide this thesis with his invaluable insight and I have appreciated his patience and enthusiasm. Without his easy-going nature and constant support and direction this thesis would not have been completed as and when it was. Most importantly, he always reminded me to enjoy my time at Cambridge and have a nice glass of wine whenever possible! I would also like to thank John Carroll who even at a distance has managed to provide a great deal of support and was always available when I needed help or advice. People from the NLIP group and administrative staff at the Computer Laboratory were also very helpful. I enjoyed my many talks with Anna Ritchie, Ben Medlock and Bill Hollingsworth. I will miss their moral support and I’m grateful that fate locked us in a room together for so many years! Thanks also to Gordon Royle and other staff at the University of Western Australia who supported me while I completed my research while visiting this University at home in Perth. I also greatly appreciated the feedback I received during my PhD Viva. Both of my examiners, Stephen Clark and Anna Korhonen, provided helpful and thoughtful suggestions which improved the overall quality of this work’s presentation. This research would not have been possible without the financial support of both the Overseas Research Students Awards Scheme and the Poynton Scholarship awarded by the Cambridge Australia Trust in collaboration with the Cambridge Common- wealth Trust. On a personal note I would like to thank my family; my parents and my sister Kathryn, who were always available to talk and provided a great deal of support. Finally, special thanks goes to my partner James who moved across the world to support me during my PhD. He made our home somewhere I didn’t mind working on weekends. 6 Contents 1 Introduction 13 1.1 NaturalLanguageParsing. 13 1.1.1 ProblemDefinition . 13 1.1.2 Corpus-basedEstimation. 14 1.1.3 StatisticalApproaches . 15 1.2 ResearchBackground............................... 18 1.3 AvailableResources............................... 18 1.3.1 Corpora.................................. 18 1.3.2 Evaluation ................................ 23 1.4 ResearchGoals .................................. 26 1.5 ThesisSummary ................................. 26 1.5.1 Contributions of this Thesis . 26 1.5.2 Outline of Subsequent Chapters . 27 2 LR Parsers 28 2.1 Introduction.................................... 28 2.2 FiniteAutomata.................................. 29 2.2.1 NFA.................................... 29 2.2.2 DFA.................................... 31 2.3 LRParsers..................................... 32 2.3.1 LRParsingModel ............................ 33 2.3.2 TypesofLRParsers ........................... 34 2.3.3 ParserActions .............................. 34 2.3.4 LRTable ................................. 35 2.3.5 ParsingProgram ............................. 38 2.3.6 TableConstruction . 38 2.4 GLRParsing ................................... 43 2.4.1 Relationship to the LR Parsing Framework . 43 2.4.2 TableConstruction . 43 2.4.3 Graph-structuredStack. 44 2.4.4 ParseForest................................ 47 2.4.5 LRParsingProgram ........................... 47 2.4.6 Output .................................. 50 2.4.7 Modifications to the Algorithm . 50 2.5 StatisticalGLR(SGLR)Parsing . 50 2.5.1 Probabilistic Approaches . 51 7 8 CONTENTS 2.5.2 Estimating Action Probabilities . .... 51 2.6 RASP ....................................... 53 2.6.1 Grammar ................................. 53 2.6.2 Training.................................. 58 2.6.3 ParserApplication . 58 2.6.4 OutputFormats.............................. 61 3 Part-of-speech Tag Models 65 3.1 PreviousWork................................... 65 3.1.1 PoSTaggersandParsers . 65 3.1.2 TagModels................................ 67 3.1.3 HMMPoSTaggers............................ 68 3.2 RASP’sArchitecture ............................... 70 3.2.1 ProcessingStages. 70 3.2.2 PoSTagger ................................ 70 3.3 Part-of-speechTagModels . 73 3.3.1 Part-of-speechTagFiles . 73 3.3.2 Thresholding over Tag Probabilities . ..... 74 3.3.3 Top-rankedParseTags . 75 3.3.4 HighestCountTags.. .... .... .... .... ... .... ... 76 3.3.5 WeightedCountTags. 77 3.3.6 GoldStandardTags. 77 3.3.7 Summary ................................. 77 3.4 Part-of-speech Tagging Performance . ....... 77 3.4.1 Evaluation ................................ 77 3.4.2 Results .................................. 80 3.5 ParserPerformance ............................... 81 3.5.1 Evaluation ................................ 81 3.5.2 Results .................................. 82 3.6 Discussion..................................... 84 4 Efficient Extraction of Weighted GRs 86 4.1 Inside-Outside Algorithm (IOA) . ..... 87 4.1.1 Background................................ 87 4.1.2 The Standard Algorithm . 88 4.1.3 ExtensiontoLRParsers . 92 4.2 Extracting Grammatical Relations . ..... 94 4.2.1 Modification to Local Ambiguity Packing . .... 94 4.2.2 Extracting Grammatical Relations . 95 4.2.3 Problem: Multiple Lexical Heads . 97 4.2.4 Problem: Multiple Parse Forests . 100 4.3 TheEWGAlgorithm ............................... 101 4.3.1 Inside Probability Calculation and GR Instantiation . .......... 102 4.3.2 Outside Probability Calculation . 105 4.3.3 RelatedWork............................... 107 4.4 EWGPerformance ................................ 107 4.4.1 Comparing Packing Schemes . 108 CONTENTS 9 4.4.2 EfficiencyofEWG ............................ 108 4.4.3 DataAnalysis............................... 109 4.4.4 AccuracyofEWG ............................ 110 4.5 Application to Parse Selection . ..... 110 4.6 Discussion..................................... 111 5 Confidence-based Training 112 5.1 Motivation..................................... 113 5.2 ResearchBackground............................... 114 5.2.1 Unsupervised Training . 114 5.2.2 Semisupervised Training . 114 5.3 Extant Parser Training and Resources . ..... 118 5.3.1 Corpora.................................. 119 5.3.2 ExtantParserTraining . 120 5.3.3 Evaluation ................................ 121 5.3.4 Baseline.................................. 121 5.4 Confidence-based Training Approaches . ..... 121 5.4.1 Framework ................................ 121 5.4.2 ConfidenceMeasures. 124 5.4.3 Self-training ............................... 125 5.5 Experimentation................................. 125 5.5.1 Semisupervised Training . 125 5.5.2 Unsupervised Training . 130 5.6 Discussion..................................... 132 6 Conclusion 135 References 139 List of Figures 1.1 Tree and GR parser output for the sentence The dog barked. .......... 14 1.2 ExamplesentencefromSusanne. .... 19 1.3 Example bracketed corpus training instance from Susanne. ........... 19 1.4 Example annotated corpus training instance from Susanne. ........... 20 1.5 Example annotated training instance from the GDT. .......... 20 1.6 ExamplesentencefromtheWSJ.. 21 1.7 Example bracketed corpus training instance from the WSJ. ........... 21 1.8 Example sentence from PARC 700 Dependency

Load more