Exploiting the Sound Symbolism of Proper Names
Total Page:16
File Type:pdf, Size:1020Kb
STANFORD UNIVERSITY Categorization by Character-Level Models: Exploiting the Sound Symbolism of Proper Names A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Symbolic Systems by Joseph Robert Smarr Committee in charge: Professor Christopher D. Manning, Advisor, Primary Thesis Reader Professor Pat Langley, Second Thesis Reader Copyright © Joseph Robert Smarr 2003 All rights reserved. The Masters Thesis of Joseph Robert Smarr is approved, and it is acceptable in quality and form for a master’s project in Symbolic Systems. Primary thesis reader Secondary thesis reader TABLE OF CONTENTS Abstract............................................................................................................................... v Acknowledgements............................................................................................................ vi Chapter 1: The challenge of unknown words............................................................... 1 1.1. Quantifying unknown words....................................................................... 1 1.2. The importance of recognizing proper noun phrases.................................. 4 1.3. Coping with unknown words...................................................................... 5 1.4. Can names be classified without context? .................................................. 6 Chapter 2: A probabilistic model of proper noun phrases............................................ 8 2.1. Formalization of problem ........................................................................... 8 2.2. Model used for classification...................................................................... 9 Chapter 3: Experiments in classifying proper noun phrases ...................................... 14 3.1. Data sets used in experiments................................................................... 14 3.2. Classifying drugs, companies, movies, places, and people ...................... 15 3.3. Generalizing to new categories................................................................. 16 3.4. Distinguishing biological names (genes, proteins, etc.) ........................... 17 3.5. Distinguishing bands and artists ............................................................... 18 3.6. Distinguishing car models and computer models ..................................... 20 3.7. Comparison to human-level performance................................................. 21 Chapter 4: Analysis of experimental results............................................................... 22 4.1. Sources of erroneous classification........................................................... 22 4.2. Contribution of model features ................................................................. 23 4.3. Impact of other model parameters ............................................................ 24 4.4. Generation of novel PNPs......................................................................... 27 Chapter 5: Combing segmentation and classification ................................................ 29 5.1. Challenges for named entity recognition .................................................. 29 5.2. A character-level HMM............................................................................ 31 5.3. A character-feature based classifier.......................................................... 33 5.4. A character-based CMM........................................................................... 34 5.5. Discussion................................................................................................. 35 Chapter 6: How and why proper names can be distinguished.................................... 37 6.1. Relation to sound symbolism.................................................................... 37 6.2. PNP classification as language identification........................................... 39 6.3. The business of creating names ................................................................ 40 Chapter 7: Conclusions............................................................................................... 42 References......................................................................................................................... 43 Appendix........................................................................................................................... 46 Size and source of each PNP data set ........................................................................... 46 Answers to PNP challenge in Section 4.1..................................................................... 46 iv Abstract This thesis presents a context-, domain-, and language-independent method for classifying proper names into semantic categories. The core observation is that many proper names contain character sequences that are distinctive of their category. There is, in effect, a surprisingly strong sound-symbolic relationship between names and the things they describe, which can be uncovered with machine learning methods. While previous work has exploited a portion of this form-meaning relationship (modeling suffixes, patterns of capitalization or punctuation, etc.), this thesis presents a generalized approach in which character-level subsequences are the basic unit of evidence and every piece of a proper noun can be used as input. The result is a highly general, robust, and accurate classifier that can categorize proper names without surrounding context and straightforwardly handle previously unseen words. We describe a probabilistic generative model based on n-gram character sequences and present experiments classifying a wide variety of names that include companies, people, places, pharmaceutical drugs, movie titles, proteins, car models, musical artists, and diseases. We show how this same model can be used to generate new proper names that mimic a given category. We then generalize this model to perform segmentation alongside classification, resulting in a named-entity recognizer that achieves competitive results in both English and German. Finally, we discuss the relationship of this work to sound symbolism, language identification, and professional brand-name creation, investigating the origins of the self- descriptive naming conventions that our model is able to capture. v Acknowledgements I wish to extend my most sincere appreciation for all the people that have educated, supported, and challenged me throughout my career at Stanford. First, I wish to thank my parents and brother, who have provided me with so many opportunities, and who have always encouraged me and provided perspective. I thank also my dearest beloved girlfriend Michelle, who has patiently and cheerfully stuck by me through many late nights of research and manic bouts of work. Second, I wish to thank my academic peers, most notably Dan Klein, who have spent countless hours debating the details of this research with me (and who pointed out many key insights), and my professors who have taught me and spent considerable time with me outside the classroom, despite my often over-eager personality. In particular, I wish to single out three professors who have had a particularly profound impact on me. Pat Langley took me under his wing as a young student excited by artificial intelligence, and taught me to be passionately dispassionate about following any one school of thought or method du jour. It was a valuable lesson that has lead me to explore a wide variety of approaches to AI problems, and to appreciate the similarities and differences between them. Ivan Sag transformed my interest in language into a passion for linguistics, and specifically for developing precise linguistic theories that accurately describe a broad range of phenomena. It was this perspective that honed my desire to build language-processing computer programs as concrete instantiations of linguistic hypotheses. And, most importantly, Chris Manning acted as my research mentor for two years, echoing my enthusiasm for building practical natural language technology by gaining a deeper understanding of both machine learning and linguistics. Chris gave me supreme flexibility to pursue my interests, and yet he always had valuable insights to share and knowledge of related work to point me to. He funded my coterminal 5th year at Stanford as a research assistant, and despite some early disappointments in getting papers accepted for publication, he never stopped believing in me and always pushed me to do my very best. Finally, I want to thank the Symbolic Systems Program itself for creating such a rare and wonderful framework in which I could pursue my inherently interdisciplinary interests. vi Chapter 1: The challenge of unknown words A major challenge in building robust, wide-coverage natural language processing (NLP) systems is coping with the unrelenting amount of unknown words. Most methods rely on observing repeated occurrences of data in order to measure correlations between relevant pieces of information (either tacitly for building rules by hand or explicitly for machine learning approaches). For example, statistical part of speech taggers usually predict tags by remembering the most common part of speech for each word observed in a large, manually labeled corpus of word-tag pairs. When an unknown word is encountered, no such information is available, and the system is forced to guess (Weischedel et al. 1993). Thus, unknown words represent a significant source of error for many NLP systems