Automatic Acquisition of Semantic Classes for Adjectives
Total Page:16
File Type:pdf, Size:1020Kb
Automatic acquisition of semantic classes for adjectives Ph. D. thesis by Gemma Boleda Torrent Under the supervision of Toni Badia Cardús and Sabine Schulte im Walde Pompeu Fabra University Barcelona, December 2006 Dipòsit legal: B.53873-2007 ISBN: 978-84-691-1206-9 Abstract This thesis concerns the automatic acquisition of semantic classes for adjectives. Our work builds on two hypotheses: first, that some aspects of the semantics of adjectives are not to- tally unpredictable, but correspond to a set of denotational types (semantic classes). Therefore, adjectives can be grouped together according to their semantic class. Second, that the seman- tic class of an adjective can be traced in more than one linguistic level. In particular, the morphology-semantics and syntax-semantics interfaces are explored for clues that lead to the acquisition of the targeted semantic classes. Since we could not rely on a previously established classification, a major effort is devoted to defining an adequate classification. The classification proposal is reached through an iterative methodology. By combining deductive and inductive approaches, we evolve from an initial classification based on literature review to a final classification proposal that takes advantage of the insight gained through a set of experiments. Each iteration consists of three steps: (a) a classification proposal is made; (b) a number of clas- sification experiments, involving human subjects and machine learning techniques, are carried out; (c) through the analysis of the experiments, advantages and drawbacks of the classification proposal are identified. We present a total of three iterations. The first one starts with a classification based on liter- ature review. A Gold Standard is built and tested against results obtained through a series of unsupervised experiments. The analysis suggests a refinement of the classification. The second iteration uses the refined classification and validates it through a new annotation task performed by human subjects and a further set of unsupervised experiments. The third iteration incorporates three significant modifications: first, a large-scale human anno- tation task with 322 human subjects is carried out. For this task, the estimated agreement (K 0.31 to 0.45) is very low for academic standards. Thus, the achievement of reliable linguistic data is revealed as a major bottleneck. Second, the architecture used for the automatic classifi- cation of adjectives is modified so that it allows for the acquisition of multiple classes, so as to account for polysemy. The best results obtained, 84% accuracy, improve upon the baseline by a raw 33%. Third, a systematic comparison between different levels of linguistic description is performed to assess their role in the acquistion task at hand. ii Acknowledgements I am grateful to my supervisors, Toni Badia and Sabine Schulte im Walde, for their support and advice during the development of this thesis. Toni has wisely guided me through the whole process, and, even in his busiest times, has always been there when I needed him. His insightful comments have provided keys to analysis that I would not have found on my own. Sabine has provided valuable advice in methodological and technical issues and has given me a lot of emotional support. She has always had a global vision of this piece of research and its value that has helped me make sense of it. Louise McNally is a model to me, as a researcher but also as a person. I am grateful to her for many discussions about semantics, and for leading me into fruitful joint research. I am also thankful to Stefan Evert for many discussions about several aspects of my work, and particularly for thorough discussions and many ideas regarding the assessment of interrater agreement. The first piece of research I was involved in was a joint paper with Laura Alonso. Many of her ideas, such as using clustering for the classification of adjectives, shaped the initial steps of this thesis. During the PhD, I was on two occasions a visiting researcher at the Department of Computa- tional Linguistics and Phonetics (CoLi) in Saarland University. I am very grateful to Manfred Pinkal and the SALSA team for letting me collaborate in their project. In particular, I learned a lot from Sebastian Padó and Katrin Erk. Alissa Melinger provided me with very useful advice for the design of the web experiment reported in this thesis. The work reported has also ben- efited from discussions with Eloi Batlle, Nedjet Bouayad, Gretel De Cuyper, Teresa Espinal, Adam Kilgarrif, Alexander Koller, Maite Melero, Oana Postolache, and Suzanne Stevenson. I also received helpful feedback when presenting my work at several seminars and conferences. I am grateful to the GLiCom members for providing a nice working atmosphere and for help in many aspects of the PhD. Eva Bofias introduced me into Perl –her 4-page manual was all I needed for my scripts for many months– and into LaTeX. Stefan Bott has helped me many times with programming difficulties. Àngel Gil, Laia Mayol, and Martí Quixal have annotated data and have entered into useful discussion about the difficulties involved in the semantic classification of adjectives. Outside GLiCom, the “Barcelona adjective working group” members, Roser Sanromà and Clara Soler, have also participated in annotation. I am specially indebted to Roser for lend- ing me the manually annotated database of adjectives she spent so many hours developing, and for spending three afternoons as part of the expert committee annotating the last Gold Stan- dard. I also want to express my gratitude to the (over 600!) people who took part in the web experiment. I am infinitely grateful for the existence of much freely available software that has made pro- ducing, analysing, and reporting the results of this thesis much easier –or even possible. Some of the tools I find most useful are CLUTO, the IMS Corpus WorkBench, R, TeXnicCenter, Weka, and, of course, Perl. The research reported here has also received the support of some institutions. Special thanks are due to the Institut d’Estudis Catalans for lending the research corpus to the Pompeu Fabra Uni- iii Acknowledgements versity. The financial support of the Pompeu Fabra University and its Translation and Philol- ogy Department, the Generalitat de Catalunya and the Fundación Caja Madrid is gratefully acknowledged. The administrative staff of the Translation and Philology Department, and in particular Susi Bolós, have helped me solve so many burocratic issues over the last six years that it is impossible to list them here. On a personal note, I thank my friends and family for their love and support. The Associació 10 de cada 10 has hosted very entertaining Wednesday evenings for over a decade now. My parents have helped me in my PhD not only as most parents do, by being enormously proud of me. My father taught me how to program when I still was a Philology student, and has even helped in implementation work for the PhD. My mother’s experience as a researcher has made it easier to make my way through the “research way of life”, and she has helped me overcome many a bottleneck in the development of the thesis. I am very lucky to have met Martí, my husband. He has had to suffer endless explanations and doubts about almost every aspect of the thesis, and has always provided constructive criticism, even late at night when he was hardly awake. And finally, my daughter Bruna is the light of my life. She has also contributed to this thesis, by regularly (sometimes, even irregularly) switching my mind off work. iv Contents Abstract...................................... ii Acknowledgements . iii Contents...................................... v Listoffigures................................... viii Listoftables ................................... ix List of acronyms and other abbreviations . xi 1 Introduction 1 1.1 Motivation..................................... 1 1.2 Goalsofthethesis................................. 3 1.3 Structure of the thesis . 4 1.4 Contributions ................................... 5 2 Tools and resources 7 2.1 Data........................................ 7 2.1.1 Corpus .................................. 7 2.1.2 Adjectivedatabase ............................ 9 2.1.3 Adjectives in the working corpus . 10 2.2 Tools........................................ 12 2.2.1 Shallowparser .............................. 12 2.2.2 CLUTO.................................. 14 2.2.3 Weka ................................... 14 2.2.4 Other tools . 15 3 Adjective semantics and Lexical Acquisition 17 3.1 Adjectivesasapartofspeech. 19 3.1.1 Morphology ............................... 19 3.1.2 Syntax .................................. 19 3.1.3 Delimitation problems . 21 3.2 Adjectives in descriptive grammar . 24 3.2.1 Qualitative adjectives . 25 3.2.2 Relational adjectives . 27 3.2.3 Adverbial adjectives . 28 3.2.4 Summary................................. 29 3.3 Adjectives in formal semantics . 29 3.3.1 Lexical semantics within formal semantics . 29 3.3.2 Adjective classes from entailment patterns . 30 3.3.3 Summary................................. 37 3.4 Adjectives in ontological semantics . 38 3.4.1 Generalframework............................ 38 3.4.2 Adjectiveclasses ............................. 40 3.4.3 Summary................................. 41 3.5 Adjective semantics in NLP . 41 3.5.1 WordNet ................................. 41 3.5.2 SIMPLE ................................. 43 3.5.3 MikroKosmos............................... 44 v Contents 3.5.4 Summary................................