Screening Procedures Annotation? COMPOUND NOUN
Total Page:16
File Type:pdf, Size:1020Kb
! ! ! ! ! Lexical Semantic Analysis in Natural Language Text Nathan Schneider CMU-LTI-14-001 ! Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu! ! ! Thesis Committee:! Noah A. Smith (chair), Carnegie Mellon University Chris Dyer, Carnegie Mellon University Eduard Hovy, Carnegie Mellon University Lori Levin, Carnegie Mellon University Timothy Baldwin, University! of Melbourne ! ! Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information! Technologies ! ! © 2014, Nathan Schneider Lexical Semantic Analysis in Natural Language Text Nathan Schneider Language Technologies Institute School of Computer Science Carnegie Mellon University ◇ September 30, 2014 Submitted in partial fulfillment of the requirements for the degree of doctor of philosophy in language and information technologies Abstract Computer programs that make inferences about natural language are easily fooled by the often haphazard relationship between words and their meanings. This thesis develops Lexical Semantic Analysis (LxSA), a general-purpose framework for describing word groupings and meanings in context. LxSA marries comprehensive linguistic annotation of corpora with engineering of statistical natural lan- guage processing tools. The framework does not require any lexical resource or syntactic parser, so it will be relatively simple to adapt to new languages and domains. The contributions of this thesis are: a formal representation of lexical segments and coarse semantic classes; a well-tested linguistic annotation scheme with detailed guidelines for identifying multi- word expressions and categorizing nouns, verbs, and prepositions; an English web corpus annotated with this scheme; and an open source NLP system that automates the analysis by statistical se- quence tagging. Finally, we motivate the applicability of lexical se- mantic information to sentence-level language technologies (such as semantic parsing and machine translation) and to corpus-based linguistic inquiry. i I, Nathan Schneider, do solemnly swear that this thesis is my own work, and that everything therein is, to the best of my knowledge, true, accurate, and properly cited, so help me Vinken. ∗ Except for the footnotes that are made up. If you cannot handle jocular asides, please∗ recompile this thesis with the \jnote macro disabled. ii To my parents, who gave me language. § In memory of Charles “Chuck” Fillmore, one of the great linguists. iii Acknowledgments As a fledgling graduate student, one is confronted with obligatory words of advice, admonitions, and horror stories from those who have already embarked down that path. At the end of the day,2 though, I am forced to admit that my grad school experience has been perversely happy. Some of that can no doubt be attributed to my own bizarre priorities in life, but mostly it speaks to the qualities of those around me at CMU, who have been uniformly brilliant, committed, enthusiastic, and generous in their work. First and foremost, I thank my Ph.D. advisor, Noah Smith, for serving as a researcher role model, for building and piloting a re- splendent research group (ARK), and for helping me steer my re- search while leaving me the freedom to develop my own style. I thank the full committee (Noah, Chris, Lori, Ed, and Tim), not just for their thoughtful consideration and feedback on the thesis itself, but for their teaching and insights over the years that have influ- enced how I think about language and its relationship to technology. It was my fellow ARK students that had the biggest impact on my day-to-day experiences at CMU. Mike Heilman, Shay Cohen, Dipanjan Das, Kevin Gimpel, André Martins, Tae Yano, Brendan O’Connor, Dani Yogatama, Jeff Flanigan, David Bamman, Yanchuan 2“day” being a euphemism for 6 years v Sim, Waleed Ammar, Sam Thomson, Lingpeng Kong, Jesse Dodge, Swabha Swayamdipta, Rohan Ramanath, Victor Chahuneau, Daniel Mills, and Dallas Card—it was an honor. My officemates in GHC 5713—Dipanjan, Tae, Dani, Waleed, and (briefly) Sam and Lingpeng— deserve a special shout-out for making it an interesting place to work. It was a privilege to be able to work with several different ARK postdocs—Behrang Mohit, (pre-professor) Chris Dyer, and Fei Liu—and always a pleasure to talk with Bryan Routledge at group meetings. But oh, the list doesn’t stop there. At CMU I enjoyed working with amazing annotators—Spencer Onuffer, Henrietta Conrad, Nora Ka- zour, Mike Mordowanec, and Emily Danchik—whose efforts made this thesis possible, and with star undergraduates including De- sai Chen and Naomi Saphra. I enjoyed intellectual exchanges (in courses, reading groups, collaborations, and hallway conversations) with Archna Bhatia, Yulia Tsvetkov, Dirk Hovy, Meghana Kshirsagar, Chu-Cheng Lin, Ben Lambert, Marietta Sionti, Seza Do˘gruöz, Jon Clark, Michael Denkowski, Matt Marge, Reza Bosagh Zadeh, Philip Gianfortoni, Bill McDowell, Kenneth Heafield, Narges Razavian, Nesra Yannier, Wang Ling, Manaal Faruqui, Jayant Krishnamurthy, Justin Cranshaw, Jesse Dunietz, Prasanna Kumar, Andreas Zollmann, Nora Presson, Colleen Davy, Brian MacWhinney, Charles Kemp, Ja- cob Eisenstein, Brian Murphy, Bob Frederking, and many others. And I was privileged to take courses with the likes of Noah, Lori, Tom Mitchell, William Cohen, Yasuhiro Shirai, and Leigh Ann Sudol, and to have participated in projects with Kemal Oflazer, Rebecca Hwa, Ric Crabbe, and Alan Black. I am especially indebted to Vivek Srikumar, Jena Hwang, and everyone else at CMU and elsewhere who has contributed to the preposition effort described in ch. 5. The summer of 2012 was spent basking in the intellectual warmth vi of USC’s Information Sciences Institute, where I did an internship under the supervision of Kevin Knight, and had the good fortune to cross paths with Taylor Berg-Kirkpatrick (and his cats), Daniel Bauer, Bevan Jones, Jacob Andreas, Karl Moritz Hermann, Christian Buck, Liane Guillou, Ada Wan, Yaqin Yang, Yang Gao, Dirk Hovy, Philipp Koehn, and Ulf Hermjakob. That summer was the beginning of my involvement in the AMR design team, which includes Kevin and Ulf as well as Martha Palmer, Claire Bonial, Kira Griffitt, and several others; I am happy to say that the collaboration continues to this day. The inspiration that pushed me to study computational linguis- tics and NLP in the first place came from teachers and mentors at UC Berkeley, especially Jerry Feldman and Dan Klein. I am happy to have remained connected to that community during my Ph.D., and to have learned a great deal about semantics from the likes of Collin Baker, Miriam Petruck, Nancy Chang, Michael Ellsworth, and Eve Sweetser. Finally, I thank my family for being there for me and putting up with my eccentricities. The same goes for my friends men- tioned above, as well as from the All University Orchestra (Smitha Prasadh, Tyson Price, Deepa Krishnaswamy), from Berkeley linguis- tics (Stephanie Shih, Andréa Davis, Kashmiri Stec, Cindy Lee), and from NLP conferences. This thesis would not have been possible without the support of research funding sources;3 all who have contributed to the smooth functioning of CMU SCS and LTI, especially Jaime Carbonell (our fearless leader) and all the LTI staff; as well as Pittsburgh eating 3The research at the core of this thesis was supported in part by NSF CAREER grant IIS-1054319, Google through the Reading is Believing project at CMU, DARPA grant FA8750-12-2-0342 funded under the DEFT program, and grant NPRP-08-485- 1-083 from the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors. vii establishments, especially Tazza d’Oro, the 61C Café, and Little Asia. There are undoubtedly others that I should have mentioned above, so I hereby acknowledge them. If in this document contains any errors, omissions, bloopers, dangling modifiers, wardrobe malfunctions, etc., it is because I screwed up. It is emphatically not the fault of my committee or reviewers if a bug in my code caused low numbers to round up, high numbers to round down, or expletives to be inserted in phonologi- cally suspect syllable positions. viii Contents Abstract i Acknowledgments v 1 Setting the Stage 1 1.1 Task Definition ....................... 3 1.2 Approach .......................... 5 1.3 Guiding Principles ..................... 6 1.4 Contributions ....................... 8 1.5 Organization ........................ 8 2 General Background: Computational Lexical Semantics 11 2.1 The Computational Semantics Landscape . 12 2.2 Lexical Semantic Categorization Schemes . 14 2.3 Lexicons and Word Sense Classification . 15 2.4 Named Entity Recognition and Sequence Models . 20 2.5 Conclusion ......................... 24 I Representation & Annotation 25 3 Multiword Expressions 27 ix 3.1 Introduction ........................ 28 3.2 Linguistic Characterization ................ 31 3.3 Existing Resources ..................... 39 3.4 Taking a Different Tack . 46 3.5 Formal Representation . 50 3.6 Annotation Scheme .................... 52 3.7 Annotation Process .................... 54 3.8 The Corpus ......................... 58 3.9 Conclusion ......................... 61 4 Noun and Verb Supersenses 63 4.1 Introduction ........................ 64 4.2 Background: Supersense Tags . 64 4.3 Supersense Annotation for Arabic . 67 4.4 Supersense Annotation for English . 72 4.5 Related Work: Copenhagen Supersense Data