Parsing Inside-Out
Total Page:16
File Type:pdf, Size:1020Kb
Parsing Inside-Out The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Goodman, Joshua. 1998. Parsing Inside-Out. Harvard Computer Science Group Technical Report TR-07-98. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:24829603 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Parsing InsideOut Joshua Go o dman TR June Computer Science Group Harvard University Cambridge Massachusetts Parsing InsideOut A thesis presented by Joshua T Go o dman to The Division of Engineering and Applied Sciences in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the sub ject of Computer Science Harvard University Cambridge Massachusetts May c by Joshua T Go o dman All rights reserved iii Abstract Probabilistic ContextFree Grammars PCFGs and variations on them have recently b e come some of the most common formalisms for parsing It is common with PCFGs to compute the inside and outside probabilities When these probabilities are multiplied to gether and normalized they pro duce the probability that any given nonterminal covers any piece of the input sentence The traditional use of these probabilities is to improve the probabilities of grammar rules In this thesis we show that these values are useful for solving many other problems in Statistical Natural Language Pro cessing We give a framework for describing parsers The framework generalizes the inside and outside values to semirings It makes it easy to describ e parsers that compute a wide variety of interesting quantities including the inside and outside probabilities as well as related quantities such as Viterbi probabilities and nb est lists We also present three novel uses for the inside and outside probabilities The rst novel use is an algorithm that gets improved p erformance by optimizing metrics other than the exact match rate The next novel use is a similar algorithm that in combination with other techniques sp eeds DataOriented Parsing by a factor of The third use is to sp eed parsing for PCFGs using thresholding techniques that approximate the insideoutside pro duct the thresholding techniques lead to a times sp eedup at the same accuracy level as conventional metho ds At the time this research was done no state of the art grammar formalism could b e used to compute inside and outside probabilities We present the Probabilistic Feature Grammar formalism which achieves state of the art accuracy and can compute these probabilities iv Acknowledgements To Erica my love my help There are to o many p eople for me to thank and for to o many things Most of all this thesis is for Erica who I met a month after I started graduate school and who I will marry a month after I nish Next I want to thank my committee Stuart Shieb er is an amazing advisor who gave me wonderful supp ort and just the right amount of guidance Barbara Grosz has b een a great acting advisor in my last year putting her fo ot down on imp ortant things but allowing me to o ccassionally split innitives when it did not matter Fernando Pereira has given me very go o d advice and is a terric source of knowledge Thanks also to Leslie Valiant for the almost thankless task of serving on my committee For my rst year or two I shared an oce with Stan Chen and Andy Kehler Id rather share a small oce with them than have a large oce to myself they taught me as much as anyone here and made grad school fun They and the other members of the AIResearch group read endless drafts of my pap ers attended no end of nearlyidentical practice talks and listened to hundreds of bad ideas Wheeler Ruml in particular has done countless small favors Lillian Lee has b een a source of guidance Im happy to call Reb ecca Hwa my friend Others Luke Hunsb erger Ellie Baker Kathy Ryall Jon Christenson Christine Nakatani Nadia Shalaby Greg Galp erin and David Magerman for p oker and rejecting my ACL pap er have all help ed me through and made my time here b etter A year after getting here I developed tendonitis in my hands I needed more help than usual to graduate and I got it b oth from the Division of Engineering and Applied Sciences and from many of the p eople I have named so far A PhD is built on a deep foundation My parents and family have always pushed me and supp orted me and continue to do so My father encouraged me from a ridiculously early age what kind of p erson gives a year old Scientic American while my mother through what she called gentle teasing gave me her own sp ecial kind of encouragement I want to thank Edward Siegfried my mentor for nine years who taught me so much ab out computers As an undergraduate I had many helpful professors among whom Harry Lewis stands out After college I worked at Dragon Systems where Dean Sturtevant Larry In the AI Research Group acknowledgments are traditionally funny This encourages p eople to read the acknowledgements and skip the thesis I have therefore written rather dry acknowledgments and put one joke in this thesis to encourage the reading of this work in its entirety The joke is not very funny so you will have to read the whole thing to b e sure you have found it I will send to the rst p erson each year to nd the joke without help Some restrictions apply v Gillick and Bob Roth initiated me into the black art of sp eech recognition and taught me how to b e a go o d programmer Id also like to thank the National Science Foundation for providing most of my funding with Grant IRI Grant IRI and an NSF Graduate Student Fellowship vi Contents Introduction Background Introduction to Statistical NLP ContextFree Grammars Probabilistic ContextFree Grammars Overview Semiring Parsing Introduction Earley Parsing Overview Semiring Parsing Semiring ItemBased Description The Grammar Conditions for Correct Pro cessing The derivation semirings Ecient Computation of Item Values Item Value Formula Solving the Innite Summation Reverse Values Reverse Values in Noncommutative Semirings Semiring Parser Execution Bucketing Interpreter Grammar Transformations Examples Finite State Automata and Hidden Markov Mo dels Prex Values Beyond ContextFree vii Tomita Parsing Graham Harrison Ruzzo Parsing Previous Work Recent similar work Conclusion A Additional Pro ofs A Viterbinb est is a semiring B Additional Examples B Graham Harrison and Ruzzo GHR Parsing B Beyond ContextFree B Greibach Normal Form C Reverse Value of Noncommutative Semirings C Pair Semirings C Sp ecic Pair Semirings C Derivation of NonCommutative Reverse Value Formulas Maximizing Metrics Introduction Evaluation Metrics Basic Denitions Evaluation Metrics Maximizing Metrics Which Metrics to Use Lab elled Recall Parsing Formulas Pseudo co de Algorithm ItemBased Description Bracketed Recall Parsing Exp erimental Results Grammar Induced by Pereira and Schabes metho d Grammar Induced by Counting NPCompleteness of Bracketed Tree Maximization NPCompleteness of HMM Most Likely String Bracketed Tree Maximization is NPComplete General Recall Algorithm NAry Branching Parse Trees NAry Branching Evaluation Metrics Combined Rate Maximization NAry Branching Exp eriments viii Conclusions A Pro of of Crossing Brackets Theorem B Glossary DataOriented Parsing Introduction Previous Research Reduction of DOP to PCFG Parsing Algorithms Sampling Algorithms Exp erimental Results and Discussion Timing Analysis Analysis of Bo ds Data Conclusion Thresholding Introduction Beam Thresholding Global Thresholding Global Thresholding Algorithm MultiplePass Parsing .