Grammar Induction and Parsing with Dependency-And-Boundary Models
Total Page:16
File Type:pdf, Size:1020Kb
GRAMMAR INDUCTION AND PARSING WITH DEPENDENCY-AND-BOUNDARY MODELS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Valentin Ilyich Spitkovsky December 2013 iv Preface For Dan and Hayden, with even greater variances... Unsupervised learning of hierarchical syntactic structure from free-form natural language text is an important and difficult problem, with implications for scientific goals, such as un- derstanding human language acquisition, or engineering applications, including question answering, machine translation and speech recognition. As is the case with many unsu- pervised settings in machine learning, grammar induction usually reduces to a non-convex optimization problem. This dissertation proposes a novel family of head-outward genera- tive dependency parsing models and a curriculum learning strategy, co-designed to effec- tively induce grammars despite local optima, by taking advantage of multiple views of data. The dependency-and-boundary models are parameterized to exploit, as much as possi- ble, any observable state, such as words at sentence boundaries, which limits the prolif- eration of optima that is ordinarily caused by presence of latent variables. They are also flexible in their modeling of overlapping subgrammars and sensitive to different kinds of input types. These capabilities allow training data to be split into simpler text fragments, in accordance with proposed parsing constraints, thereby increasing the numbers of visible edges. An optimization strategy then gradually exposes learners to more complex data. The proposed suite of constraints on possible valid parse structures, which can be ex- tracted from unparsed surface text forms, helps guide language learners towards linguisti- cally plausible syntactic constructions. These constraints are efficient, easy to implement and applicable to a variety of naturally-occurring partial bracketings, including capitaliza- tion changes, punctuation and web markup. Connections between traditional syntax and v HTML annotations, for instance, were not previously known, and are one of several discov- eries about statistical regularities in text that this thesis contributes to the science linguistics. Resulting grammar induction pipelines attain state-of-the-art performance not only on a standard English dependency parsing test bed, but also as judged by constituent structure metrics, in addition to a more comprehensive multilingual evaluation that spans disparate language families. This work widens the scope and difficulty of the evaluation methodol- ogy for unsupervised parsing, testing against nineteen languages (rather than just English), evaluating on all (not just short) sentence lengths, and using disjoint (blind) training and test data splits. The proposed methods also show that it is possible to eliminate commonly used supervision signals, including biased initializers, manually tuned training subsets, custom termination criteria and knowledge of part-of-speech tags, and still improve performance. Empirical evidence presented in this dissertation strongly suggests that complex learn- ing tasks like grammar induction can cope with non-convexity and discover more correct syntactic structures by pursuing learning strategies that begin with simple data and basic models and progress to more complex data instances and more expressive model parameter- izations. A contribution to artificial intelligence more broadly is thus a collection of search techniques that make expectation-maximization and other optimization algorithms less sen- sitive to local optima. The proposed tools include multi-objective approaches for avoiding or escaping fixed points, iterative model recombination and “starting small” strategies that gradually improve candidate solutions, and a generic framework for transforming these and other already-found locally optimal models. Such transformations make for informed, intelligent, non-random restarts, enabling the design of comprehensive search networks that are capable of exploring combinatorial parameter spaces more rapidly and more thoroughly than conventional optimization methods. vi Dedicated to my loving family. vii Acknowledgements I must thank many people who have contributed to the successful completion of this thesis, starting with the oral examination committee members: my advisor, Daniel S. Jurafsky, for his patience, encouragement and wisdom; Hayden Shaw, who has been a close collaborator at Google Research and a de facto mentor; Christopher D. Manning, from whom I learned NLP and IR; Serafim Batzoglou, with whom I coauthored my earliest academic paper; and Arthur B. Owen, who became my first guide into the world of Statistics. I am also grateful to others — professors, coauthors, coworkers, labmates, other collaborators, anonymous reviewers, recommendation letter writers, graduate program coordinators, friends, well- wishers, as well as the numerous dance, martial arts, and yoga instructors who helped to keep me (relatively) sane through it all. Any attempt to name everyone is doomed to failure. Here are the results of one such effort: Omri Abend, Eneko Agirre, Lauren Anas, Kelly Ariagno, Michael Bachmann, Prachi Balaji, Edna Barr, Alexei Barski, John Bauer, Steven Bethard, Yonatan Bisk, Anna Botelho, Stephen P. Boyd, Thorsten Brants, Helen Buendi- cho, Jerry Cain, Luz Castineiras, Daniel Cer, Nathanael Chambers, Cynthia Chan, Angel X. Chang, Helen Chang, Ming Chang, Pi-Chuan Chang, Jean Chao, Wanxiang Che, Johnny Chen, Renate & Ron Chestnut, Rich Chin, Yejin Choi, Daniel Clancy, John Clark, Ralph L. Cohen, Glenn Corteza, Chris Cosner, Luke Dahl, Maria David, Amir Dembo, Persi Dia- conis, Kathi DiTommaso, Lynda K. Dunnigan, Jason Eisner, Song Feng, Jenny R. Finkel, Susan Fox, Michael Genesereth, Suvan Gerlach, Kevin Gimpel, Andy Golding, Leslie Gor- don, Bill Graham, Spence Green, Sonal Gupta, David L.W. Hall, Krassi Harwell, Cynthia Hayashi, Julia Hockenmaier, John F. Holzrichter, Anna Kazantseva, Carla Murray Ken- worthy, Steven P. Kerckhoff, Sadri Khalessi, Jam Kiattinant, Donald E. Knuth, Daphne Koller, Mikhail Kozhevnikov, Linda Kubiak, Polina Kuznetsova, Cristina N. & Homer G. viii Ladas, Diego Lanau, Leo Landa, Beth Levin, Eisar Lipkovitz, Claudia Lissette, Ting Liu, Sherman Lo, Gabby Magana, Kim Marinucci, Marie-Catherine de Marneffe, Felipe Mar- tinez, Andrea McBride, David McClosky, Ryan McDonald, R. James Milgram, Elizabeth Morin, Rajeev Motwani, Andrew Y. Ng, Natasha Ng, Barbara Nichols, Peter Norvig, In- gram Olkin, Jennifer Olson, Nick Parlante, Marius Pas¸ca, Fernando Pereira, Lisette Perelle, Daniel Peters, Leslie Miller Peters, Slav Petrov, Ann Marie Pettigrew, Keyko Pintz, Daniel Pipes, Igor Polkovnikov, Elias Ponvert, Zuby Pradhan, Agnieszka Purves, Bala Rajarat- nam, Daniel Ramage, Julian Miller Ramil, Marta Recasens, Roi Reichart, David Ro- gosa, Joseph P. Romano, Mendel Rosenblum, Vicente Rubio, Roy Schwartz, Xiaolin Shi, David O. Siegmund, Noah A. Smith, Richard Socher, Alfred Spector, Yun-Hsuan Sung, Mihai Surdeanu, Julie Tibshirani, Ravi Vakil, Raja Velu, Pier Voulkos, Mengqiu Wang, Tsachy Weissman, Jennifer Widom, Verna L. Wong, Adrianne Kishiyama & Bruce Wonna- cott, Lowell Wood, Wei Xu, Eric Yeh, Ayano Yoneda, Alexander Zeyliger, Nancy R. Zhang, and many others of Stanford’s Natural Language Processing Group, the Computer Science, Electrical Engineering, Linguistics, Mathematics, and Statistics Departments, Aikido and Argentine Tango Clubs, as well as the Fannie & John Hertz Foundation and Google Inc. Portions of the work presented in this dissertation were supported, in part, by research grants from the National Science Foundation (award number IIS-0811974) and the Defense Advanced Research Projects Agency’s Air Force Research Laboratory (contract numbers FA8750-09-C-0181 and FA8750-13-2-0040), as well as by a graduate Fellowship from the Fannie & John Hertz Foundation. I am grateful to these organizations, and to the organizers of TAC, NAACL, LREC, ICGI, EMNLP, CoNLL and ACL conferences and workshops. Last but certainly not least, I thank Jayanthi Subramanian in the Ph.D. Program Office and Ron L. Racilis at the Office of the University Registrar, who sprung into action when, at the last moment, one of the doctoral dissertation reading committee members left the state before conferring his approval with a physical signature. It’s been an exciting ride until the very end. I am grateful to (and for) all those who stood by me when life threw us curves. ix Contents Preface v Acknowledgements viii 1 Introduction 1 2 Background 8 2.1 TheDependencyModelwithValence . 12 2.2 EvaluationMetrics .............................. 13 2.3 EnglishData.................................. 15 2.4 MultilingualData ............................... 15 2.5 ANoteonInitializationStrategies . ..... 15 I Optimization 19 3 Baby Steps 21 3.1 Introduction.................................. 21 3.2 Intuition.................................... 22 3.3 RelatedWork ................................. 23 3.4 NewAlgorithmsfortheClassicModel . ... 25 3.4.1 Algorithm #0: Ad-Hoc∗ —AVariationonOriginalAd-HocInitialization . 25 x 3.4.2 Algorithm #1: Baby Steps —AnInitialization-IndependentScaffolding . 26 3.4.3 Algorithm #2: Less is More — Ad-Hoc∗ whereBabyStepsFlatlines . 26 3.4.4 Algorithm #3: Leapfrog —APracticalandEfficientHybridMixture . 26 3.4.5 Reference Algorithms —Baselines,aSkylineandPublishedArt