Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text

Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text by Noah Ashton Smith A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland October, 2006 c Noah Ashton Smith 2006 All rights reserved Abstract This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighbor- hood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimiza- tion problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations) is selected. The latter is far superior, but surprisingly few annotated examples are required. The experimentation focuses on a single dependency grammar induction task, in depth. The aim is to give strong support for the usefulness of the new techniques in one scenario. It must be noted, however, that the task (as defined here and in prior work) is somewhat artificial, and improved performance on this particular task is not a direct contri- bution to the greater field of natural language processing. The real problem the task seeks to simulate—the induction of syntactic structure in natural language text—is certainly of interest to the community, but this thesis does not directly approach the problem of ex- ploiting induced syntax in applications. We also do not attempt any realistic simulation of ii human language learning, as our newspaper text data do not resemble the data encoun- tered by a child during language acquisition. Further, our iterative learning algorithms assume a fixed batch of data that can be repeatedly accessed, not a long stream of data observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply to virtually all existing learning methods in natural language processing, not just the new ones presented here.) Nonetheless, the novel estimation methods presented are, we will argue, better suited to adaptation for real engineering tasks than the maximum likelihood baseline. Our new methods are shown to achieve significant improvements over maximum likelihood estimation and maximum a posteriori estimation, using the EM algorithm, for a state-of-the-art probabilistic model used in dependency grammar induction (Klein and Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences; we follow standard practice and train and test on sequences of ten tags or fewer. Our results are the best published to date for six languages, with supervised model selection: English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative error rate reduction), German (54.4 → 71.8%, a 38% error reduction), Bulgarian (45.6% → 58.3%, a 23% error reduction), Mandarin (50.0% → 58.0%, a 16% error reduction), Turkish (48.0% → 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching baseline, which gives 61.8%), and Portuguese (42.5% → 71.8%, a 51% error reduction). We also demonstrate the success of contrastive estimation at learning to disambiguate part- of-speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more difficult task with less dictionary knowledge (a 35% error rate reduction). The experiments presented in this thesis give one of the most thorough explorations to date of unsupervised parameter estimation for models of discrete structures. Two sides of the problem are considered in depth: the choice of objective function to be optimized during training, and the method of optimizing it. We find that both are important in unsupervised learning. Our best results on most of the six languages involve both improved objectives and improved search. The methods presented in this thesis were originally presented in Smith and Eisner (2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods to other work, presents more experimental results and error analysis, and directly compares the methods to each other. iii Thesis committee (∗readers, †advisor): ∗†Jason Eisner (Assistant Professor, Computer Science, Johns Hopkins University) Dale Schuurmans (Professor, Computing Science, University of Alberta) Paul Smolensky (Professor, Cognitive Science, Johns Hopkins University) ∗David Yarowsky (Professor, Computer Science, Johns Hopkins University) iv For K.R.T. v Acknowledgments I would have written a shorter letter, but I did not have the time. —attributed to Cicero (106–43 BCE), Blaise Pascal (1623–1662), Mark Twain (1835–1910), and T. S. Eliot (1888–1965) First, I would like to thank my advisor, Jason Eisner, for many helpful and insightful technical conversations during the course of my graduate work. He usually knew when to let me tread on my own and when to grab me by the ankles and drag me (kicking and screaming) to the True Path,1 and when to let me go play in the woods. Most importantly he taught me to do well by doing good. If I can emulate in my entire career half of the rhetorical flair, vision, or enthusiasm he’s displayed in the course of my Ph.D., then I’ll count it a success. Thanks also to Debbie and Talia, who on occasion dined sans Jason because of me. My committee, consisting of David Yarowsky, Paul Smolensky, Dale Schuurmans, and, of course, Jason Eisner, have been supportive beyond the call of duty. They have provided valuable insight and held this work to a high standard. (Any errors that re- main are of course my own.) Other current and former members of the CLSP faculty have kindly spent time and shared ideas with me: Bill Byrne, Bob Frank, Keith Hall, Fred Jelinek,2 Damianos Karakos, Sanjeev Khudanpur,2 and Zak Shafran. Other profes- sors have contributed to my grad experience by teaching great classes: Yair Amir, Scott Smith, Rao Kosaraju, and Jong-Shi Pang. I thank researchers from other places for their helpful comments and discussions of many kinds at various times over the years: Eric Brill, Eugene Charniak, Michael Collins, Joshua Goodman, Mark Johnson, Rebecca Hwa, Dan Klein, John Lafferty, Chris Manning, Miles Osborne, Dan Roth, and Giorgio Satta; also 1This statement is not meant to insinuate that Jason ever inappropriately touched my ankles. He is also not to be blamed for my liberal use of footnotes. 2Special thanks to these individuals, who voluntarily read drafts and gave feedback. Any remaining errors are of course my own. vi anonymous reviewers on my papers. Deep gratitude especially to Philip Resnik and Dan Melamed, who provided early encouragement, continued mentoring, sunlight, and water. My work during the academic years from 2001 through 2006 was supported gener- ously by the Fannie and John Hertz Foundation. While I can’t predict the future impact of the work in these pages, I am certain that it would have been significantly reduced without the rare opportunity for unfettered exploration that this fellowship has offered. I am especially grateful for my annual progress meetings with Lowell Wood; special thanks also to John Holzrichter, Barbara Nichols, and Linda Kubiak. Research included in this dissertation was also supported (during the summers of 2003, 2005, and 2006) by grant number IIS-0313193 from the National Science Foundation to Jason Eisner. My student colleagues at Hopkins have made the time fun and the atmosphere con- structively critical and always creative—a never-ending brainstorm. Their exceptional in- telligence need no mention. Thanks to: Radu “Hans” Florian, Silviu Cucerzan, Rich Wicen- towski, Jun Wu, Charles Schafer, Gideon Mann, Shankar Kumar, John Hale, Paola Virga, Ahmed Emami, Peng Xu, Jia Cui, Yonggang Deng, Veera Venkataramani, Woosung Kim, Yu David Liu, Paul Ruhlen (R.I.P.), Elliott Drabek,´ 3 Reza Curtmola, Sourin Das, Geetu Ambwani, David Smith,2 Roy Tromble,2 Arnab Ghoshal, Lambert Mathias, Erin Fitzger- ald, Ali Yazgan, Yi Su, Chal Hathaidharm, Sam Carliles, Markus Dreyer,2 Brock Pytlik, Trish Driscoll, Nguyen Bach, Eric Goldlust, John Blatz, Chris White, Nikesh Garera, Lisa Yung, and visitors David Martinez, Pavel Pecina,ˇ Filip Jurcicek,ˇ and Vaclav´ Novak.´ Com- ments on drafts of my papers and at my seminar talks were always insightful and helpful. David & Cynthia, Roy & Nicole, and Markus: thanks for the camaraderie and the white- board. Thanks generally to the GRO. In the CS Department, Linda Rorke, Nancy Scheeler, Kris Sisson, Conway Benishek, Erin Guest, and Jamie Lurz have made administrivia as painless as they could. In Barton Hall, Laura Graham and Sue Porterfield (the CLSP dream team) are unrivaled in com- petence and warmth. To Steve Rifkin, Steve DeBlasio, and Brett Anderson in CS, Jacob Laderman and Eiwe Lingefors in CLSP, and Hans, David, and Charles (encore une fois) in the NLP lab: thanks for the cycles. Many, many friends have given me fond memories of the extra-lab time during my Baltimore years. These include the Book Smattering book group: Rita Turner, Ben Kle- 3Special thanks to these individuals, who shared knowledge about languages of experimentation.

Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text

Computer Vision Stochastic Grammars for Scene Parsing

UNIVERSITY of CALIFORNIA Los Angeles Human Activity

Using an Annotated Corpus As a Stochastic Grammar

GRAMMAR IS GRAMMAR and USAGE IS USAGE Frederick J

Application of Stochastic Grammars to Understanding Action

W. G. M., a Stochastic Model of Language Change Through Social

The Plasticity of Grammar

Calibrating Generative Models: the Probabilistic Chomsky-Schutzenberger¨ Hierarchy∗

Stochastic Definite Clause Grammars

Stochastic Attribute-Value Grammars

Unsupervised Language Acquisition: Theory and Practice

A Stochastic Grammar of Images