Effective Statistical Models for Syntactic and Semantic Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
EFFECTIVE STATISTICAL MODELS FOR SYNTACTIC AND SEMANTIC DISAMBIGUATION a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy Kristina Nikolova Toutanova September 2005 c Copyright by Kristina Nikolova Toutanova 2005 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Christopher D. Manning (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Andrew Y. Ng I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Daniel Jurafsky Approved for the University Committee on Graduate Studies. iii iv Abstract This thesis focuses on building effective statistical models for disambiguation of so- phisticated syntactic and semantic natural language (NL) structures. We advance the state of the art in several domains by (i) choosing representations that encode domain knowledge more effectively and (ii) developing machine learning algorithms that deal with the specific properties of NL disambiguation tasks { sparsity of training data and large, structured spaces of hidden labels. For the task of syntactic disambiguation, we propose a novel representation of parse trees that connects the words of the sentence with the hidden syntactic struc- ture in a direct way. Experimental evaluation on parse selection for a Head Driven Phrase Structure Grammar shows the new representation achieves superior perfor- mance compared to previous models. For the task of disambiguating the semantic role structure of verbs, we build a more accurate model, which captures the knowledge that the semantic frame of a verb is a joint structure with strong dependencies be- tween arguments. We achieve this using a Conditional Random Field without Markov independence assumptions on the sequence of semantic role labels. To address the sparsity problem in machine learning for NL, we develop a method for incorporating many additional sources of information, using Markov chains in the space of words. The Markov chain framework makes it possible to combine multiple knowledge sources, to learn how much to trust each of them, and to chain inferences together. It achieves large gains in the task of disambiguating prepositional phrase attachments. v Acknowledgements My thesis work would not have been possible without the help of my advisor, other collaborators, and fellow students. I am especially fortunate to have been advised by Chris Manning. Firstly, I am grateful to him for teaching me almost everything I know about doing research and being part of the academic community. Secondly, I deeply appreciate his constant support and advice on many levels. The work in this thesis was profoundly shaped by numerous insightful discussions with him. I am also very happy to have been able to collaborate with Andrew Ng on random walk models for word dependency distributions. It has been a source of inspiration to interact with someone having such far-reaching research goals and being an endless source of ideas. He also gave me valuable advice on multiple occasions. Many thanks to Dan Jurafsky for initially bringing semantic role labeling to my attention as an interesting domain my research would fit in, contributing useful ideas, and helping with my dissertation on a short notice. Thanks also to the other members of my thesis defense committee { Trevor Hastie and Francine Chen. I am also grateful to Dan Flickinger and Stephan Oepen for numerous discussions on my work in hpsg parsing. Being part of the NLP and the greater AI group at Stanford has been extremely stimulating and fun. I am very happy to have shared an office with Dan Klein and Roger Levy for several years, and with Bill McCartney for one year. Dan taught me, among other things, to always aim to win when entering a competition, and to understand things from first principles. Roger was an example of how to be an excellent researcher while maintaining a balanced life with many outside interests. vi I will miss the heated discussions about research and the fun at NLP lunch. And thanks to Jenny for the Quasi-Newton numerical optimization code. Thanks to Aria Haghighi for collaborating on semantic role labeling models and for staying up very early in the morning to finish writing up papers. I would also like to take the chance to express my gratitude to my undergraduate advisor Galia Angelova and my high-school math teacher Georgi Nikov. Although they did not contribute directly to the current effort, I wouldn't be writing this without them. I am indebted in many ways to Penka Markova { most importantly, for being my friend for many years and for teaching me optimism and self-confidence, and additionally, for collaborating with me on the hpsg parsing work. I will miss the foosball games and green tea breaks with Rajat Raina and Mayur Naik. Thanks also to Jason Townsend and Haiyan Liu for being my good friends. Thanks to Galen Andrew for making my last year at Stanford the happiest. Fi- nally, many thanks to my parents Diana and Nikola, my sister Maria, and my nieces Iana and Diana, for their support, and for bringing me great peace and joy. I gratefully acknowledge the financial support through the ROSIE project funded by Scottish Enterprise under the Stanford-Edinburgh link programme and through ARDA's Advanced Question Answering for Intelligence (AQUAINT) program. vii Contents Abstract v Acknowledgements vi 1 Introduction 1 1.1 A New Representation for Parse Trees . 6 1.1.1 Motivation . 6 1.1.2 Representation . 8 1.1.3 Summary of Results . 8 1.2 Mapping Syntactic to Semantic Structures . 9 1.2.1 General Ideas . 9 1.2.2 Summary of Results . 11 1.3 Lexical Relations . 12 1.3.1 General Ideas . 13 1.3.2 Summary of Results . 14 2 Leaf Path Kernels 16 2.1 Preliminaries . 17 2.1.1 Support Vector Machines . 18 2.1.2 Kernels . 19 2.2 The Redwoods Corpus . 21 viii 2.2.1 Data Characteristics . 24 2.3 Related Work . 25 2.3.1 Penn Treebank Parsing Models . 25 2.3.2 Models for Deep Grammars . 27 2.3.3 Kernels for Parse Trees . 29 2.4 Our Approach . 30 2.5 Proposed Representation . 31 2.5.1 Representing HPSG Signs . 31 2.5.2 The Leaf Projection Paths View of Parse Trees . 32 2.6 Tree and String Kernels . 34 2.6.1 Kernels on Trees Based on Kernels on Projection Paths . 34 2.6.2 String Kernels . 36 2.7 Experiments . 40 2.7.1 SVM Implementation . 41 2.7.2 The Leaf Projection Paths View versus the Context-Free Rule View . 43 2.7.3 Experimental Results using String Kernels on Projection Paths 46 2.7.4 A Note on Application to Penn Treebank Parse Trees . 50 2.8 Discussion and Conclusions . 51 3 Shallow Semantic Parsing 52 3.1 Description of Annotations . 53 3.2 Previous Approaches to Semantic Role Labeling . 56 3.3 Ideas of This Work . 66 3.4 Data and Evaluation Measures . 68 3.5 Local Classifiers . 76 3.5.1 Additional Features for Displaced Constituents . 77 ix 3.5.2 Enforcing the Non-overlapping Constraint . 80 3.5.3 On Split Constituents . 83 3.6 Joint Classifiers . 85 3.6.1 Joint Model Results . 92 3.7 Automatic Parses . 95 3.7.1 Using Multiple Automatic Parse Guesses . 96 3.7.2 Evaluation on the CoNLL 2005 Shared Task . 99 3.8 Conclusions . 102 4 Estimating Word Dependency Distributions 105 4.1 Introduction . 105 4.2 Markov Chain Preliminaries . 108 4.3 Related Work on Smoothing . 111 4.3.1 Estimating Distributions as Limiting Distributions of Markov Chains . 113 4.4 The Prepositional Phrase Attachment Task . 114 4.4.1 Dataset . 115 4.4.2 Previous Work on PP Attachment . 116 4.5 The PP Attachment Model . 119 4.5.1 Baseline Model . 122 4.5.2 Baseline Results . 124 4.6 Random Walks for PP Attachment . 126 4.6.1 Formal Model . 130 4.6.2 Parameter Estimation . 134 4.6.3 Link Types for PP Attachment . 140 4.6.4 Random Walk Experiments . 143 4.6.5 Extended Example . 147 x 4.7 Discussion . 152 4.7.1 Relation to the Transformation Model of Jason Eisner . 152 4.7.2 Relation to Other Models and Conclusions . 154 Bibliography 156 xi List of Figures 1.1 An hpsg derivation and a schematic semantic representation corre- sponding to it for the sentence Stanford offers a distance-learning course in Artificial Intelligence. 2 1.2 A syntactic Penn Treebank-style parse tree and Propbank-style se- mantic annotation for the sentence Stanford offers a distance-learning course in Artificial Intelligence. 3 1.3 An example word space with similarities for the relation in(noun1; noun2). 13 2.1 A grammar rule and lexical entries in a toy unification-based grammar. 22 2.2 Native and derived Redwoods representations for the sentence Do you want to meet on Tuesday? | (a) derivation tree using unique rule and lexical item (in bold) identifiers of the source grammar (top), (b) phrase structure tree labelled with user-defined, parameterizable cat- egory abbreviations (center), and (c) elementary dependency graph extracted from the mrs meaning representation (bottom). 23 2.3 Annotated ambiguous sentences used in experiments: The columns are, from left to right, the total number of sentences, average length, and structural ambiguity.