Cross-Language Ontology Learning
Total Page:16
File Type:pdf, Size:1020Kb
Cross-language Ontology Learning Cross-language Ontology Learning Incorporating and Exploiting Cross-language Data in the Ontology Learning Process Hans Hjelm c Hans Hjelm, Stockholm 2009 ISBN 978-91-7155-806-0 Printed in Sweden by US-AB, Stockholm 2009 Distributor: Department of Linguistics, Stockholm University Abstract An ontology is a knowledge-representation structure, where words, terms or concepts are defined by their mutual hierarchical relations. Ontologies are becoming ever more prevalent in the world of natural language processing, where we currently see a tendency towards using semantics for solving a variety of tasks, particularly tasks related to information access. Ontologies, taxonomies and thesauri (all related notions) are also used in various variants by humans, to standardize business transactions or for finding conceptual relations between terms in, e.g., the medical domain. The acquisition of machine-readable, domain-specific semantic knowledge is time consuming and prone to inconsistencies. The field of ontology learning therefore provides tools for automating the construction of domain ontologies (ontologies describing the entities and relations within a particular field of interest), by analyzing large quantities of domain-specific texts. This thesis studies three main topics within the field of ontology learning. First, we examine which sources of information are useful within an ontology learning system and how the information sources can be combined effectively. Secondly, we do this with a special focus on cross-language text collections, to see if we can learn more from studying several languages at once, than we can from a single-language text collection. Finally, we investigate new approaches to formal and automatic evaluation of the quality of a learned ontology. We demonstrate how to combine information sources from different languages and use them to train automatic classifiers to recognize lexico-semantic relations. The cross-language data is shown to have a positive effect on the quality of the learned ontologies. We also give theoretical and experimental results, showing that our ontology evaluation method is a good complement to and in some aspects improves on the evaluation measures in use today. 5 Sammanfattning En ontologi är en struktur som används för kunskapsrepresentation i maskinläsbart format. Ontologier innehåller koncept, termer eller ord som definieras genom inbördes hierarkiska relationer. De ges en alltmer framskjutande roll inom dagens språkteknologi, där vi ser tendenser till att integrera semantik i ett allt större antal applikationer, främst inom områden relaterade till informationsåtkomst. Ontologier, taxonomier och tesaurer (alla starkt förknippade med varandra) används också av människor för att exempelvis standardisera affärsprocesser eller för att hitta relationer mellan koncept eller termer inom medicinen. Att skapa maskinläsbara, domänspecifika, semantiska resurser är en tidsö- dande och krävande process, där det är svårt för människor att alltid fatta konsekventa beslut. Därför försöker man inom fältet för ontologiinlärning utveckla verktyg för att automatisera skapandet av domänspecifika ontologier. Inlärningen görs huvudsakligen genom automatisk analys av stora samlingar domänspecifik text. Denna avhandling studerar tre huvudsakliga frågor inom området för on- tologiinlärning. För det första undersöker den vilka informationskällor som är användbara inom ett ontologiinlärningssystem, samt hur dessa källor kan kombineras på ett effektivt sätt. För det andra gör den detta med fokus på tvärspråkliga textsamlingar, för att se om det finns fördelar med att studera flera språk parallellt, jämfört med att bara använda ett språk i taget. För det tredje utforskar avhandlingen nya ansatser till formell automatisk utvärdering av kvaliteten hos en automatiskt inlärd ontologi. Vi visar hur information från olika språk kan kombineras och användas av automatiska klassificerare för att känna igen lexikala semantiska relationer. Tvärspråkliga data visas ha en positiv effekt på kvaliteten hos de inlärda on- tologierna. Vi ger också teoretiska och experimentella resultat där vår metod för utvärdering av inlärda ontologier visas vara ett gott komplement till, och i vissa fall förbättra, de utvärderingsmetoder som används idag. 6 Preface During my first two years as a PhD student, I worked half-time to finance my studies. I worked mainly for Intrafind Software AG, based in Munich, Germany, and they showed a great deal of flexibility towards having me work out of Sweden and allowing me to work on my thesis on the side. I would especially like to thank Intrafind’s Christoph Goller, also for valuable discussions and for commenting on the pre-final version of my thesis. By my third year, I was able to start working full-time as a PhD student, thanks to a scholarship from the GSLT. I am glad to have been a part of the GSLT for many reasons, not the least of which has been the ability to meet so many nice and interesting people/colleagues. Thank you all – I don’t dare to start making a list of people to thank, it would fill up the page! A scholarship from the DAAD (Deutsher Akademischer Austausch Dienst) enabled me to go to Saarbrücken for four months, as a guest researcher at the DFKI, under the knowing supervision of Paul Buitelaar. This was a great experience, during which I learned a lot and made many friends. I have had the pleasure of sharing offices with some fantastic people here at Stockholm university: Henrik Oxhammar, Kristina Nilsson and Yvonne Samuelsson, all fellow PhD students. Thank you for all discussions and all coffee breaks; a special thanks to Henrik for continuously filling our white- board with obscure pictures of data structures. Another special thanks to Kina for many discussions regarding this thesis. I have also much enjoyed sharing offices with Magdalena Mikolajczyk, David Hagstrand, Britt Hartmann, Jör- gen Aasa, Per Weijnitz and Joakim Lundborg, for varying lengths of time. I appriciate having had Jennifer Spenader, Sofia Gustafson-Capkova and Mag- nus Sahlgren as colleagues in the Stockholm CL-group. Special thanks to all proof-readers: Kristina Nilsson, Britt Hartmann, Ma- rina Santini and Yvonne Samuelsson. I really appreciate your help. Tomas Einarsson has very patiently but enthusiastically explained various mathematical issues to me, I am very grateful for that. Per Sundqvist intro- duced me to MATLAB and talked to me at length about matrices. Christoph Schwarz, Mikael Parkvall and Eva Lindström have all kindly answered my questions on linguistic issues. Magnus Rosell taught me about clustering eval- uation – much appreciated! My two supervisors, Martin Volk and Joakim Nivre both deserve special mention. Martin has kept me on track during this process, making sure I fo- cused on the right things, in the right order. I appreciate him giving me the chance to start working as a PhD student, even though I did not have official funding. Joakim has been invaluable to me in helping me sort out various sta- 7 tistical matters, among other things. Both have provided me with quick and precise feedback during the writing of my thesis. For this I am truly thankful. Most importantly, I want to thank my family. You have been a constant moral support during these years, I am very happy to have you. 8 Contents 1 Introduction .......................................... 13 1.1 Ontologies .............................................. 14 1.2 Ontology Learning ......................................... 17 1.3 Research Questions ....................................... 18 1.4 Thesis Outline ........................................... 19 2 Ontology Learning Perspectives ............................ 21 2.1 Primary Uses of Ontologies .................................. 21 2.2 Cross-language Ontologies .................................. 24 2.3 Terms – Building Blocks of the Ontology .......................... 27 2.4 Distributional Similarity ...................................... 28 2.4.1 Distributional similarity model parameters ..................... 30 2.4.2 Is there Structure in Word-space? .......................... 31 2.5 Ontology Learning Approaches ................................ 32 2.5.1 Term clustering based on distributional similarity ................ 33 2.5.2 Intra-term information ................................... 35 2.5.3 Lexico-syntactic patterns ................................ 36 2.5.4 Formal concept analysis ................................. 37 2.5.5 Statistical measures of term specificity ....................... 38 2.5.6 Hybrid ontology learning approaches ........................ 39 2.5.7 Probabilistic approaches ................................. 40 2.6 Translational Equivalence for Terms ............................. 41 2.6.1 Cross-language distributional similarity ....................... 42 2.6.2 Statistical machine translation ............................. 43 2.6.3 Hybrid approaches for translation ........................... 43 2.6.4 Working with comparable corpora .......................... 44 2.7 Exploiting Cross-language Data ............................... 46 2.8 Summary ............................................... 48 3 Resources ............................................ 49 3.1 Pre-processing ........................................... 50 3.1.1 Morphological analysis .................................. 50 3.1.2 Term spotting ........................................ 50 3.2 Corpora ................................................ 51 3.2.1 JRC-ACQUIS Multilingual Parallel