Multilingual Distributional Lexical Similarity

MULTILINGUAL DISTRIBUTIONAL LEXICAL SIMILARITY DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Kirk Baker, B.A., M.A. ***** The Ohio State University 2008 Dissertation Committee: Approved by Chris Brew, Advisor James Unger Mike White Advisor Graduate Program in Linguistics c Copyright by Kirk Baker 2008 ABSTRACT One of the most fundamental problems in natural language processing involves words that are not in the dictionary, or unknown words. The supply of unknown words is virtually unlimited – proper names, technical jargon, foreign borrowings, newly created words, etc. – meaning that lexical resources like dictionaries and thesauri inevitably miss important vocabulary items. However, manually creating and main- taining broad coverage dictionaries and ontologies for natural language processing is expensive and difficult. Instead, it is desirable to learn them from distributional lexical information such as can be obtained relatively easily from unlabeled or sparsely labeled text corpora. Rule-based approaches to acquiring or augmenting repositories of lexical information typically offer a high precision, low recall methodology that fails to generalize to new domains or scale to very large data sets. Classification-based approaches to organizing lexical material have more promising scaling properties, but require an amount of labeled training data that is usually not available on the neces- sary scale. This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when ii they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data. The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus. iii ACKNOWLEDGMENTS I am indebted primarily to my dissertation advisor Chris Brew who supported me for four years as a research assistant on his NSF grant “Hybrid methods for acquisition and tuning of lexical information”. Chris introduced me to the whole idea of statistical machine learning and its applications to large-scale natural language processing. He gave me an enormous amount of freedom to explore a variety of projects as my interests took me, and I am grateful to him for all of these things. I am grateful to James Unger for generously lending his time to wide-ranging discussions of the ideas in this dissertation and for giving me a bunch of additional ideas for things to try with Japanese word processing. I am grateful to Mike White for carefully reading several drafts of my dissertation, each time offering feedback which crucially improved both the ideas contained in the dissertation and their presentation. His questions and comments substantially improved the overall quality of my dissertation and were essential to its final form. I am grateful to my colleagues at OSU. Hiroko Morioka contributed substantially to my understanding of statistical modeling and to the formulation of many of the ideas in the dissertation. Eunjong Kong answered a bunch of my questions about English loanwords in Korean and helped massively with revising the presentation of the material in Chapter 2 for other venues. I am grateful to Jianguo Li for lots of discussion about automatic English verb classification, and for sharing scripts and data. iv VITA 1998 ........................................B.A., Linguistics, University of North Carolina at Chapel Hill 2001 ........................................M.A., Linguistics, University of North Carolina at Chapel Hill 2003 - 2007 .................................Research Assistant, The Ohio State University 2008 ........................................Presidential Fellow, The Ohio State University PUBLICATIONS 1. Kirk Baker and Chris Brew (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of The Sixth International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco. FIELDS OF STUDY Major Field: Linguistics Specialization: Computational Linguistics v TABLE OF CONTENTS Page Abstract ...................................... ii Acknowledgments ................................ iv Vita ........................................ v List of Figures .................................. xi List of Tables .................................. xiii 1 Introduction ................................. 1 1.1 Overview.................................. 1 1.2 GeneralMethodology........................... 2 1.2.1 LoanwordIdentification . 3 1.2.2 Distributional Verb Similarity . 3 1.3 Structure of Dissertation and Summary of Contributions ....... 5 2 Descriptive Analysis of English Loanwords in Korean ....... 8 2.1 ConstructionoftheDataSet. 8 2.1.1 Romanization ........................... 10 2.1.2 PhonemicRepresentation . 12 2.1.2.1 SourceofPronunciations. 12 2.1.2.2 Standardizing Pronunciations . 13 2.1.3 Alignments ............................ 16 2.2 Analysis of English Loanwords in Korean . .. 18 2.3 Conclusion................................. 27 3 English-to-Korean Transliteration .................... 29 3.1 Overview.................................. 29 3.2 Previous Research on English-to-Korean Transliteration........ 29 3.2.1 Grapheme-Based English-to-Korean Transliteration Models . 30 3.2.1.1 LeeandChoi(1998);Lee(1999) . 30 3.2.1.2 KangandChoi(2000a,b) . 32 vi 3.2.1.3 KangandKim(2000) . 34 3.2.2 Phoneme-Based English-to-Korean Transliteration Models . 35 3.2.2.1 Lee(1999);Kang(2001) . 35 3.2.2.2 Jung,Hong,andPaek(2000) . 37 3.2.3 Ortho-phonemic English-to-Korean Transliteration Models . 39 3.2.3.1 OhandChoi(2002) . 39 3.2.3.2 Oh and Choi (2005); Oh, Choi, and Isahara (2006) . 42 3.2.4 SummaryofPreviousResearch . 44 3.3 Experiments on English-to-Korean Transliteration . ........ 47 3.3.1 ExperimentOne ......................... 48 3.3.1.1 Purpose ......................... 48 3.3.1.2 Description of the Transliteration Model . 48 3.3.1.3 ExperimentalSetup . 52 3.3.1.4 ResultsandDiscussion. 52 3.3.2 ExperimentTwo ......................... 54 3.3.2.1 Purpose ......................... 54 3.3.2.2 DescriptionoftheModel. 55 3.3.2.3 ExperimentalSetup . 56 3.3.2.4 ResultsandDiscussion. 56 3.3.3 ExperimentThree ........................ 58 3.3.3.1 Purpose ......................... 58 3.3.3.2 DescriptionoftheModel. 59 3.3.3.3 ExperimentalSetup . 63 3.3.3.4 ResultsandDiscussion. 64 3.3.4 ErrorAnalysis........................... 68 3.3.5 Conclusion............................. 69 4 Automatically Identifying English Loanwords in Korean ..... 73 4.1 Overview.................................. 73 4.2 PreviousResearch............................. 74 4.3 CurrentApproach............................. 76 4.3.1 Bayesian Multinomial Logistic Regression . .. 77 4.3.2 NaiveBayes............................ 86 4.4 Experiments on Identifying English Loanwords in Korean ....... 87 4.4.1 ExperimentOne ......................... 87 4.4.1.1 Purpose ......................... 87 4.4.1.2 ExperimentalSetup . 87 4.4.1.3 Results ......................... 88 4.4.2 ExperimentTwo ......................... 89 4.4.2.1 Purpose ......................... 89 4.4.2.2 ExperimentalSetup . 89 4.4.2.3 Results ......................... 90 4.4.3 ExperimentThree ........................ 91 vii 4.4.3.1 Purpose ......................... 91 4.4.3.2 ExperimentalSetup . 91 4.4.3.3 Results ......................... 92 4.5 Conclusion................................. 93 5 Distributional Verb Similarity ...................... 95 5.1 Overview.................................. 95 5.2 PreviousWork .............................. 97 5.2.1 SchulteimWalde(2000) . 97 5.2.2 MerloandStevenson(2001) . 98 5.2.3 Korhonen,Krymolowski, andMarx(2003) . 100 5.2.4 LiandBrew(2008). .. .. 100 5.3 CurrentApproach............................. 101 5.3.1 Scope and Nature of Verb Classifications . 101 5.3.2 Nature of the Evaluation of Verb Classifications . ... 102 5.3.3 Relation to Other Lexical Acquisition Tasks . 103 5.3.4 AdvantagesoftheCurrentApproach . 104 5.4 Components of Distributional Verb Similarity . ..... 105 5.4.1 Representation of Lexical Context . 106 5.4.2 Bag-of-WordsContextModels . 106 5.4.3 Grammatical Relations Context Models . 107 5.5 Evaluation................................

Load more