Word Sense Disambiguation with Germanet
Total Page:16
File Type:pdf, Size:1020Kb
Word Sense Disambiguation with GermaNet Semi-Automatic Enhancement and Empirical Results Dissertation zur Erlangung des akademischen Grades Doktor der Philosophie in der Philosophischen Fakultät der Eberhard Karls Universität Tübingen vorgelegt von Verena Henrich aus Darmstadt 2015 Gedruckt mit Genehmigung der Philosophischen Fakultät der Eberhard Karls Universität Tübingen Hauptberichterstatter: Prof. Dr. Erhard Hinrichs Mitberichterstatter: Prof. Dr. Gerhard Jäger Dekan: Prof. Dr. Jürgen Leonhardt Tag der mündlichen Prüfung: 29.4.2015 Verlag: TOBIAS-lib, Tübingen Abstract The subject of this dissertation is boosting research on word sense disambigua- tion (WSD) for German. WSD is a very active area of research in computa- tional linguistics, but most of the work is focused on English. One of the factors that has hampered WSD research for other languages such as German is the lack of appropriate resources, particularly in the form of sense-annotated corpus data. Hence, this work inevitably has to start with the preparation of resources before actual WSD experiments can be performed. The work pro- gram is fourfold. Firstly, since sense definitions are necessary to distinguish word senses (both for humans and for automatic WSD algorithms), the Ger- man wordnet GermaNet is (semi-)automatically extended with sense descrip- tions. This is done by automatically mapping GermaNet senses to descrip- tions in the online dictionary Wiktionary. Secondly, since the availability of sense-annotated corpora is a prerequisite for evaluating and developing word sense disambiguation systems, two GermaNet sense-annotated corpora are con- structed. One corpus is automatically constructed and the other corpus is manually sense-annotated. Thirdly, several knowledge-based WSD algorithms are applied and evaluated – using the newly created sense-annotated corpora. These algorithms are based on a suite of semantic relatedness measures, in- cluding path-based, information-content-based, and gloss-based methods. Ex- periments on gloss-based methods also employ the newly harvested definitions from Wiktionary. Fourthly, several supervised machine learning classifiers are applied to the task of German WSD, including rule-based methods, instance- based methods, probabilistic methods, and support vector machines. The classifiers rely on a wide range of machine learning features and their evalua- tion focuses on several aspects, including a comparison of several algorithms, a detailed analysis of the implemented features, and an investigation of the in- fluence of syntax and semantics on the disambiguation performance for verbs. i Für Jens und meine Eltern ii Acknowledgements I would like to acknowledge everyone who supported me during the work on this thesis. Chronologically the first to thank are those who raised my interest in the topic of natural language processing during my master studies in Darm- stadt and Reykjavik: without Prof. Dr. Bettina Harriehausen-Mühlbauer, Dr. Hrafn Loftsson, and Timo Reuter I would not have turned to this research area. I am very grateful to my supervisor Prof. Dr. Erhard Hinrichs in several – conceptual and institutional – respects: of course, for supervising me working on the thesis and providing me with useful remarks, but also for giving me the opportunity to develop and extend my personal knowledge and strengths in many diverse aspects of computational linguistics and of the university’s daily work. He was always patient with me and gave me enough space to tackle those research questions I am interested in, while, at the same time, guided me into the right direction. I would like to thank my second reviewer Prof. Dr. Gerhard Jäger and the other members of my dissertation committee, Prof. Dr. Fritz Hamm, PD Dr. Helmut Schmid, and Prof. Dr. Andrea Weber, for their valuable feedback and comments. Due to my computer science background I am thankful to my ‘linguistic’ colleagues: to Reinhild Barkey mainly for GermaNet-related collaborations and discussions and to Kathrin Beck and Dr. Heike Telljohann for TüBa-D/Z- related linguistic input. Many thanks to Dr. Yannick Versley for making it possible to reuse his annotation tool for sense annotation in the TüBa-D/Z treebank, to Corina Dima, Christina Hoppermann, and Jianqiang Ma for fruitful discussions during iii our ‘PhD meetings’, and to all anonymous reviewers for their comments on papers that were published as part of this thesis. I would like to thank my SfS colleagues Dr. Chris Culy, Marie Hinrichs, Dr. Daniël de Kok, Jochen Saile, Daniil Sorokin, Johannes Wahle, Dr. Holger Wunsch, Dr. Thomas Zastrow, and Ramon Ziai and my external colleagues Prof. Dr. Chris Biemann, Dr. Christian Meyer, Tristan Miller, and Prof. Dr. Torsten Zesch for valuable support and feedback on several aspects of my thesis. I am thankful to the many student assistants who were part of the Germa- Net project at certain times – both on the programmatic as well as on the lexicographic side. In particular, thanks to Tatiana Vodolazova, Anne Brock, Agnia Barsukova, Edo Collins, and Steffen Tacke for their implementation contributions and Johannes Wahle, Sarah Schulz, Valentin Deyringer, and Annabell Grasse for helping with manual annotations. Since English is not my native language I am grateful to Siân Alsop, Dr. Scott Martens, and Tristan Miller for proof-reading some of my chapters. This thesis was written using the LATEX thesis template provided by the Engineering Department of the University of Cambridge.1 Mein besonderer Dank gilt meinen Schwiegereltern, die mir jederzeit viel Verständnis und Geduld entgegengebracht haben. Am allermeisten und von ganzem Herzen möchte ich Jens und meinen Eltern für ihr Verständnis und für ihre bedingungslose Unterstützung danken. Ich widme euch aus Dankbarkeit diese Arbeit, denn ohne euch hätte ich es nicht geschafft. 1http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/ThesisStyle/ iv Table of Contents Abstract i Acknowledgements iii Table of Contents v Citation Conventions xi I Introduction 1 1 Introduction 3 1.1 Goals . .3 1.2 Word Senses to be Disambiguated . .5 1.3 Motivation . 10 1.4 Contributions . 12 1.5 Chapter Guide . 14 2 Fundamentals and Related Work on WSD 19 2.1 Introduction to the Task of WSD . 19 2.2 Evaluating WSD Systems . 23 2.2.1 Evaluation Measures . 23 2.2.2 Evaluation Procedure . 25 2.2.3 Baselines and Bounds . 27 2.2.4 SensEval and SemEval Competitions . 28 2.3 WSD Approaches and State of the Art . 29 2.3.1 Knowledge-Based Approaches to WSD . 32 v TABLE OF CONTENTS 2.3.2 Supervised Machine Learning Approaches to WSD . 39 2.3.3 Related Work on German WSD . 50 3 Fundamentals of GermaNet 55 3.1 Comparing GermaNet with WordNet . 56 3.2 Lexical Units and Synsets . 58 3.3 Lexical Semantic Relations . 60 3.4 The Hierarchy . 64 3.5 Interlingual Links to Princeton WordNet . 65 3.6 Nominal Compounds . 66 3.7 Verbal Frames . 68 3.8 Data Formats for GermaNet . 72 3.9 Coverage of GermaNet . 73 II Preparation of the Resources 75 4 Aligning GermaNet with Wiktionary 77 4.1 Wiktionary . 80 4.2 The Idea of the Alignment Algorithm . 81 4.3 Implementation of the Alignment Algorithm . 84 4.4 Evaluation . 87 4.5 Results . 90 4.6 Related Work on Aligning Wordnets . 95 4.7 Conclusion and Continuing Work . 97 5 Creating Sense-Annotated Corpora 99 5.1 Related Work on Sense-Annotated Corpora . 101 5.1.1 Manually Sense-Annotated Corpora for English . 102 5.1.2 Manually Sense-Annotated Corpora for German . 106 5.1.3 Automatically Sense-Annotated Corpora . 108 5.2 Automatically Constructed WebCAGe . 111 5.2.1 Creation of a Web-Harvested Corpus . 112 5.2.2 Automatic Detection of Target Words . 115 vi TABLE OF CONTENTS 5.2.3 Evaluation . 117 5.2.4 Future Directions . 119 5.3 Manually Sense-Annotated TüBa-D/Z . 120 5.3.1 Linguistic Annotations in the Treebank . 121 5.3.2 Selection of Words to be Sense-Annotated . 126 5.3.3 Annotation Process . 134 5.3.4 Inter-Annotator Agreement . 138 5.4 Comparison of WebCAGe and TüBa-D/Z . 141 5.5 Conclusion and Continuing Work . 144 6 Gold Standard Corpora 145 6.1 Creating Training and Test Sets . 146 6.2 Treatment of Annotations with No Sense or Multiple Senses . 150 6.3 Updating to WebCAGe 3.0 . 153 6.3.1 WebCAGe 3.0 Overall Statistics . 153 6.3.2 WebCAGe Gold Standard for Supervised WSD . 155 6.4 Sense-Annotated TüBa-D/Z Treebank . 156 6.4.1 Updating to TüBa-D/Z 9.1 . 156 6.4.2 TüBa-D/Z 9.1 Overall Statistics . 158 6.4.3 TüBa-D/Z Gold Standard for Supervised WSD . 159 6.5 Sense-Annotated deWaC . 160 6.5.1 Reasons for Choosing deWaC . 160 6.5.2 Reuse of an Existing Sense-Annotated Corpus . 161 6.5.3 deWaC Overall Statistics . 162 6.5.4 deWaC Gold Standard for Supervised WSD . 162 6.6 Automatic Linguistic Preprocessing . 163 III Word Sense Disambiguation (WSD) 167 7 Knowledge-Based Word Sense Disambiguation 169 7.1 Semantic Relatedness Measures . 172 7.1.1 Terminology . 173 7.1.2 Path-Based Measures . 174 vii TABLE OF CONTENTS 7.1.3 Information-Content-Based Measures . 176 7.1.4 Gloss-Based Measures . 180 7.2 Semantic Relatedness for WSD . 183 7.2.1 Context Window . 183 7.2.2 Random Sense Baseline . 185 7.2.3 Combined WSD Algorithms . 186 7.2.4 Purely Knowledge-Based Setup . 187 7.3 Evaluating Knowledge-Based WSD . 188 7.3.1 Profiling WSD Results for Nouns . 189 7.3.2 Profiling WSD Results for Verbs . 197 7.3.3 Profiling WSD Results for Adjectives . 203 7.3.4 Profiling Context Window Sizes . 207 7.4 Summary and Conclusion . 210 8 WSD Using Supervised Machine Learning Methods 215 8.1 Machine Learning Features .