Copyright and Use of This Thesis This Thesis Must Be Used in Accordance with the Provisions of the Copyright Act 1968
Total Page:16
File Type:pdf, Size:1020Kb
COPYRIGHT AND USE OF THIS THESIS This thesis must be used in accordance with the provisions of the Copyright Act 1968. Reproduction of material protected by copyright may be an infringement of copyright and copyright owners may be entitled to take legal action against persons who infringe their copyright. Section 51 (2) of the Copyright Act permits an authorized officer of a university library or archives to provide a copy (by communication or otherwise) of an unpublished thesis kept in the library or archives, to a person who satisfies the authorized officer that he or she requires the reproduction for the purposes of research or study. The Copyright Act grants the creator of a work a number of moral rights, specifically the right of attribution, the right against false attribution and the right of integrity. You may infringe the author’s moral rights if you: - fail to acknowledge the author of this thesis if you quote sections from the work - attribute this thesis to another author - subject this thesis to derogatory treatment which may prejudice the author’s reputation For further information contact the University’s Director of Copyright Services sydney.edu.au/copyright Structured Named Entities Nicky Ringland Supervisor: James Curran A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in the School of Information Technologies at The University of Sydney School of Information Technologies 2016 i Abstract The names of people, locations, and organisations play a central role in language, and named entity recognition (NER) has been widely studied, and successfully incorporated, into natural language processing (NLP) applications. The most common variant of NER involves identifying and classifying proper noun mentions of these and miscellaneous entities as linear spans in text. Unfortunately, this version of NER is no closer to a detailed treatment of named entities than chunking is to a full syntactic analysis. NER, so con- strued, reflects neither the syntactic nor semantic structure of NE mentions, and provides insufficient categorical distinctions to represent that structure. Representing this nested structure, where a mention may contain mention(s) of other entities, is critical for applications such as coreference resolution. The lack of this structure creates spurious ambiguity in the linear approximation. Research in NER has been shaped by the size and detail of the available annotated corpora. The existing structured named entity corpora are either small, in specialist domains, or in languages other than English. This thesis presents our Nested Named Entity (NNE) corpus of named entities and numerical and temporal expressions, taken from the WSJ portion of the Penn Treebank (PTB, Marcus et al., 1993). We use the BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005a) as our basis, manu- ally annotating it with a principled, fine-grained, nested annotation scheme and detailed annotation guidelines. The corpus comprises over 279,000 entities over 49,211 sentences (1,173,000 words), including 118,495 top-level entities. Our annotations were designed using twelve high-level principles that guided the development of the annotation scheme and difficult decisions for annotators. We also monitored the semantic grammar that was being induced during annotation, seeking to identify and reinforce common patterns to main- tain consistent, parsimonious annotations. ii The result is a scheme of 118 hierarchical fine-grained entity types and nesting rules, covering all capitalised mentions of entities, and numerical and temporal expressions. Unlike many corpora, we have developed detailed guidelines, including extensive discussion of the edge cases, in an ongoing dia- logue with our annotators which is critical for consistency and reproducibility. We annotated independently from the PTB bracketing, allowing annotators to choose spans which were inconsistent with the PTB conventions and errors, and only refer back to it to resolve genuine ambiguity consistently. We merged our NNE with the PTB, requiring some systematic and one-off changes to both annotations. This allows the NNE corpus to complement other PTB resources, such as PropBank, and inform PTB-derived corpora for other formalisms, such as CCG and HPSG. We compare this corpus against BBN. We consider several approaches to integrating the PTB and NNE annotations, which affect the sparsity of grammar rules and visibility of syntactic and NE structure. We explore their impact on parsing the NNE and merged variants using the Berkeley parser (Petrov et al., 2006), which performs surprisingly well without specialised NER features. We experiment with flattening the NNE annotations into linear NER variants with stacked categories, and explore the ability of a maximum entropy and a CRF NER system to reproduce them. The CRF performs substantially better, but is infeasible to train on the enormous stacked category sets. The flattened output of the Berkeley parser are almost competitive with the CRF. Our results demonstrate that the NNE corpus is feasible for statistical models to reproduce. We invite researchers to explore new, richer models of (joint) parsing and NER on this complex and challenging task. Our nested named entity corpus will improve a wide range of NLP tasks, such as coreference resolution and question answering, allowing automated systems to understand and exploit the true structure of named entities. Acknowledgements The completion of this thesis has been a long journey, and I am immensely glad to have had wonderful travelling companions. Foremost thanks must go to James Curran, without whose encouragement, support and guidance I would never have embarked on this journey. Thank you for introducing me to computer science, for nurturing (and occasionally rekindling) my love of computational linguistics, and for encouraging my passion for education. Thank you to all members of Schwa Lab past and present who have helped me in so many ways. I have been incredibly fortunate to work with such a wonderful group of friends and colleagues. Thanks especially to Ben Hachey, whose extra guidance at the pointy end of things was particularly appreciated, Matt Honnibal, especially for giving me that first glimmer of hope that a lin- guistics student can add that extra ’computational’ adjective, David Vadas, Tara McIntosh, Jonathan Kummerfeld, Stephen Merity, Tim O’Keefe, Daniel Tse, Will Radford, Dominick Ng, James Constable, Glen Pink, and to Tim Dawborn, without whom I would have neither servers to run my experiments on, nor an NER system to evaluate on. Katie Bell, thank you for helping me debug my very first Python program, and for being an inspiration henceforth. Joel Nothman, thank you for the many stimulating discussions, cups of tea, con- tinued encouragement, and for embodying the epitome of a research student. My work is much the better thanks to the good example you have always set. Kellie Webster, thank you for being my thesis writing buddy, for helping me iii iv find words when I was stuck, for keeping me accountable to my deadlines, for the many hours of annotation and proof-reading and for countless other ways your presence and efforts have improved the past few years. My work rests on that of my annotators: Kellie Webster, Vivian Li, Joanne Yang and Kristy Hughes. Thank you for the many hours of discussion, and for battling through all 50,000 WSJ articles with me. I hope that, in time, you can start to enjoy reading newspaper articles again without too many flashbacks. I gratefully acknowledge the funding received from the Tempe Mann Trav- elling Scholarship, which enabled me to take a mini-sabbatical to Edinburgh, where my research was greatly improved by the efforts of Bonnie Webber, Mark Steedman and the entire computational linguistics group. Thanks also to the research communities of Cambridge, especially Stephen Clark, and the Univer- sity of Texas at Austin, especially Jason Baldridge. Thank you also to the many local academics who have advised, guided and influenced me along the way, especially Alan Fekete and Jon Patrick, and to the support, help navigating various bureaucratic hurdles, and friendship from all the SIT staff, especially Josie Spongberg and Evelyn Riegler. To the women of SIT, especially Georgina Wilcox, Emma Fitzgerald, Shagh- ayegh Sharif Nabavi, Mahboobeh Moghaddam, and Tara Babaie, thank you for always being ready with tea and chocolate. To the National Computer Science School and Girls’ Programming Network tutors, students and teachers, whose encouragement was always appreciated, even when phrased as the question: Have you finished yet? The challenges of a PhD can only be overcome when balanced by incredible friends. Thank you all for helping in more ways than you are probably aware, from much-needed stress relief, to technical support, distractions, laughter and copious amounts of chocolate. Jonathan Usmar, thank you for encouraging me to actually give this computer science thing a try. Lachlan Howe and James v Bailey, thank you for the music and camaraderie. Sophia Di Marco, thank you for always being there, and for your unfaltering confidence in me. Finally, thank you to my family for being constantly supportive, for raising me with a love of learning, and for being encouraging and patient even while not understanding my work. And to Sam, perhaps the most patient of all, who has been by my side throughout every minute of this PhD. Thank you. Statement of compliance I certify that: I have read and understood the University of Sydney Student • Plagiarism: Coursework Policy and Procedure; I understand that failure to comply with the Student Pla- • giarism: Coursework Policy and Procedure can lead to the University commencing proceedings against me for poten- tial student misconduct under Chapter 8 of the University of Sydney By-Law 1999 (as amended); this Work is substantially my own, and to the extent that any • part of this Work is not my own I have indicated that it is not my own by Acknowledging the Source of that part or those parts of the Work.