Definition Extraction for Glossary Creation

Definition extraction for glossary creation A study on extracting definitions for semi-automatic glossary creation in Dutch Published by LOT Phone: +31 30 253 6006 Janskerkhof 13 Fax: +31 30 253 6000 3512 BL Utrecht e-mail: [email protected] The Netherlands http://www.lotschool.nl/ Cover illustration: Pieter Bruegel the Elder, The Tower of Babel, oil on panel, c. 1563, Kunsthistorisches Museum Vienna. ISBN 978-94-6093-034-8 NUR 616 Copyright © 2010 Eline Westerhout. All rights reserved. Definition extraction for glossary creation A study on extracting definitions for semi-automatic glossary creation in Dutch Definitie-extractie voor het creëren van glossariums Een onderzoek naar de extractie van definities voor het semi-automatisch creëren van Nederlandse glossariums (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. J.C. Stoof, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op vrijdag 2 juli 2010 des morgens te 12.45 uur door Elizabeth Nicoline Westerhout geboren op 21 september 1983 te Lienden Promotor: Prof.dr. J.E.J.M Odijk Co-promotor: Dr. P. Monachesi The beginning of wisdom is the definition of terms (Socrates) Contents 1INTRODUCTION 13 1.1 Definitions 14 1.2 The LT4eL project 17 1.3 Definitions in eLearning 20 1.4 Definitions in other applications 21 1.5 The extraction of definitions 24 1.6 Results 26 1.7 Outline of the thesis 27 2DEFINITIONS 29 2.1 Introduction 29 2.2 Classification of definitions 30 2.2.1 Real and nominal definitions 30 2.2.2 Purpose-based classification 33 2.2.3 Method-based classification 35 2.2.4 Pattern-based classification 41 2.3 Classifying corpus definitions 44 2.3.1 Is definitions 46 2.3.2 Verb definitions 47 2.3.3 Punctuation definitions 51 2.3.4 Pronoun definitions 55 2.4 Conclusions 61 3PATTERN-BASED DEFINITION EXTRACTION 63 3.1 Introduction 63 3.2 Evaluation metrics 65 3.3 State of the art 66 3.4 Pre-processing the documents 73 3.4.1 Document format 74 3.4.2 Linguistic annotation 76 3.5 Grammars of definitions 79 3.5.1 General regular expressions 81 3.5.2 Grammar for is definitions 86 3.5.3 Grammar for verb definitions 88 3.5.4 Grammar for punctuation definitions 89 8 CONTENTS 3.5.5 Grammar for pronoun definitions 89 3.6 Lxtransduce 91 3.7 Results 94 3.7.1 Results for is definitions 95 3.7.2 Results for verb definitions 97 3.7.3 Results for punctuation definitions 99 3.7.4 Results for pronoun definitions 100 3.7.5 Overall results 102 3.7.6 Results with basic grammars 104 3.7.7 Results for other languages 105 3.8 Qualitative evaluation 106 3.9 Conclusions 107 4MACHINE LEARNING 109 4.1 Introduction 109 4.2 Machine learning components 110 4.2.1 Identification of data 111 4.2.2 Data pre-processing 113 4.2.3 Feature selection 113 4.2.4 Algorithm selection 114 4.2.5 Training and test set 116 4.2.6 Evaluation metrics 117 4.2.7 Parameter tuning 119 4.3 Related research 119 4.4 Machine learning for glossary creation 123 4.4.1 Identification of data 124 4.4.2 Data pre-processing 125 4.4.3 Feature selection 125 4.4.4 Algorithm selection 156 4.4.5 Training and test set 159 4.4.6 Evaluation metrics 159 4.4.7 Parameter tuning 159 4.5 Conclusions 162 5MACHINE LEARNING RESULTS 165 5.1 Introduction 165 5.2 Individual settings 166 CONTENTS 9 5.2.1 Sub settings within individual settings 166 5.2.2 Comparing the individual settings 176 5.2.3 Ranking the individual settings 183 5.3 Combined settings 184 5.3.1 Is definitions 184 5.3.2 Verb definitions 186 5.3.3 Punctuation definitions 187 5.3.4 Pronoun definitions 188 5.3.5 All definitions 188 5.4 Adding the bigram settings 189 5.4.1 Adding bigrams to the individual settings 190 5.4.2 Adding bigrams to the combined settings 195 5.5 Conclusions 198 6CONCLUSIONS AND DISCUSSION 201 6.1 Introduction 201 6.2 Conclusions 201 6.2.1 Definition extraction on the basis of patterns 202 6.2.2 Definition classification using machine learning techniques 203 6.2.3 Definition extraction for semi-automatic glossary creation 207 6.3 Discussion 208 6.3.1 The extraction approach 208 6.3.2 The features 209 6.3.3 The results 211 6.4 Main contributions 212 6.4.1 Linguistic perspective 213 6.4.2 eLearning perspective 213 6.4.3 Development perspective 213 6.5 Future research 214 APPENDICES 217 ALT4ELANA DTD 219 BNON-DETECTED SENTENCES 223 10 CONTENTS CWEKA INPUT: ARFF FILE 235 DBIGRAM PROPERTIES 237 ECONNECTOR PHRASES 243 FPARAMETER TUNING EXPERIMENTS 245 GRESULTS FOR THE INDIVIDUAL SETTINGS 247 HMACHINE LEARNING RESULTS 251 BIBLIOGRAPHY 257 SAMENVATTING IN HET NEDERLANDS 267 Acknowledgements Writing a dissertation often means that you are working alone and sometimes you can get the feeling ‘I’m a poor lonesome researcher’1. Luckily, there were many people who accompanied me during my research journey, which made my research and writing process less lonesome. It’s the invaluable support of these people that made it possible to write my dissertation. First and foremost I would like to thank my supervisor and co- promotor Paola Monachesi. I probably would never have written a PhD thesis without her support and guidance. She supervised my Mas- ter’s thesis and asked me to join the Language Technology for eLearn- ing (LT4eL) project as a researcher in December 2005. I always liked doing research, but I never thought of writing a PhD thesis. During the project, I became more and more interested in the topic of definition extraction for glossary creation. I am grateful to UiL OTS for giving me the opportunity to work out my ideas and experiments as a PhD student. I am also indebted to my promotor Jan Odijk for his contributions, insights and detailed comments. I would like to thank all the members of the LT4eL project. A spe- cial thanks to Lothar Lemnitzer, who guided the definition extraction work within the project, and to Claudia Borg, Rosa Del Gaudio, and Lukasz Degórski from the machine learning group. It always has been inspiring and great to work with you! I am grateful to Eelco Mossel, Miroslav Spousta, and Lukasz Kobyliński for their help in writing the scripts that were needed for conducting the experiments. Nicole Grégoire, Eelco Mossel, Claudia Borg and Marieke Wester- hout, thank you very much for reading (parts of) my thesis and giving suggestions for improving the content, style and structure of the thesis. The last people I would like to thank are my family and friends. You were always willing to offer distraction and moral support when I needed it. Taking the time to relax and to do completely ‘thesis- unrelated’ things was essential and you pushed (or sometimes even forced) me to do this regularly. 1Adapted version of Lucky Lukes quote (‘I’m a poor lonesome cowboy’) Plato defined man thus: “Man is a two-footed, featherless animal;” and was much praised for the definition; so Diogenes plucked a cock and brought it into his school, and said, “This is Plato’s man.” Laertius 1 Introduction It happens to everyone: you are reading a book and you encounter a word you do not know. Generally, you try to infer the meaning of the word from the context it is in, but if this is not enough you must then turn to a dictionary or glossary if available. Glossaries are com- monly found in technical literature, manuals and textbooks. They provide definitions for a list of terms that are discussed in the book. In an online learning environment, teachers provide digital learning ma- terials to their students. More often than not, these documents do not contain a glossary. Since teachers do not have the time to create them, the learner needs to search for the definitions of the important terms himself. Some definitions may be mentioned somewhere in the book, but it would take the learner a lot of time and effort to locate them. An alternative and probably quicker solution would be to search for the definition in an external source, such as an encyclopaedia. In this case, some efforts are required from the learner as well. An additional prob- lem is that for many terms there exists more than one definition. The learner has to be able to select the correct explanation that matches the meaning intended in the text. A method to automatically retrieve definitions from texts is of great value to the learner. He can use it to compile a list of definitions on the basis of the text he has to study. This assures him that the definitions capture the correct meaning of the terms and it enables him to spend more time to the actual learning process. To create such a tool, it is nec- essary to understand first what constitutes a good definition. Humans are very good at distinguishing definitions from other sentences. Even when they do not understand the content of the definition, they can 14 CHAPTER 1. INTRODUCTION nevertheless judge whether or not a sentence is a definition. The chal- lenge we are facing is to develop a method that is able to distinguish definitions from non-definitions automatically.

Load more