Annotating the Semantic Web

ANNOTATING THE SEMANTIC WEB By Alexiei Dingli SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT THE UNIVERSITY OF SHEFFIELD DEPARTMENT OF COMPUTER SCIENCE REGENT COURT, 211 PORTOBELLO STREET, SHEFFIELD, S1 4DP. UNITED KINGDOM JULY 2004 © Copyright by Alexiei Dingli, 2004 THE UNIVERSITY OF SHEFFIELD DEPARTMENT OF DEPARTMENT OF COMPUTER SCIENCE The undersigned hereby certify that they have read and recommend to the Faculty of Science for acceptance a thesis entitled "Annotating the Semantic Web" by Alexiei Dingli in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Dated: July 2004 External Examiner: Professor Nigel Shadbolt Research Supervisors: Professor Yorick Wilks Professor Fabio Ciravegna Examining Committee: ________________ Dr Mark Hepple 11 THE UNIVERSITY OF SHEFFIELD Date: July 2004 Author: Alexiei Dingli Title: Annotating the Semantic Web Department: Department of Computer Science Degree: Ph.D. Convocation: August Year: 2004 Permission is herewith granted to the University of Sheffield to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR'S WRITTEN PERMISSION. THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. III To my dear parents and girlfriend, who have always been the beam that supported me and to God who was always there to sustain that beam and me. IV Table of Contents Table of Contents v Abstract viii Acknowledgements x 1 Introd uction 1 1.1 Motivation . 1 1.2 Goal .... 16 1.3 Thesis Structure. 20 2 Annotation 21 2.1 Motivation. 21 2.2 Previous Work 24 2.2.1 Alembic 28 2.2.2 Annotation System 30 2.2.3 Annotea 31 2.2.4 CritLink . 34 2.2.5 iMarkup . 35 2.2.6 MnM ... 37 2.2.7 S-CREAM. 38 2.2.8 SHOE Knowledge Annotator. 40 2.2.9 SMORE ........... 41 2.2.10 The Gate Annotation Tool . 42 2.2.11 Trellis Web 43 2.2.12 Yawas 44 2.3 Conclusion ..... 46 v 3 Information Extraction and Integration 51 3.1 Motivation for using AlE . 52 3.2 Previous Work in AlE . 54 3.2.1 Information Extraction Terminology 54 3.2.2 Shallow Approaches. 60 3.2.3 Deeper Approaches 87 3.3 Motivation for using II 94 3.4 Previous Work in II . 95 3.4.1 Ariadne 95 3.4.2 KIND ... 97 3.4.3 PROM ... 98 3.4.4 N ame-Matching 99 3.4.5 StatMiner 100 3.5 Conclusion ...... 101 4 Melita: A Semi-automatic Annotation Methodology 103 4.1 Introduction ..................... 103 4.2 Melita: A Semi-automatic Annotation Methodology 107 4.2.1 Dealing with timeliness. 110 4.3 The methodology at work 113 4.3.1 Intrusiveness 116 4.3.2 Timeliness..... 121 4.4 Melita Evaluation . 123 4.4.1 CMU Seminar Announcements Task 123 4.4.2 The PASTA task .......... 132 4.4.3 User Oriented Evaluation ..... 140 4.5 Future Development: Adapting to different user groups 144 4.5.1 NaIve Users ..... 145 4.5.2 Application Experts 146 4.5.3 IE Experts. 147 4.6 Conclusion ......... 148 5 Armadillo: An Automated Annotation Methodology 151 5.1 Introduction ....................... 151 5.2 Armadillo: A Generic Architecture for Automatic Annotation 153 5.2.1 Set of Strategies . 155 5.2.2 Group of Oracles 156 5.2.3 Ontologies 160 5.2.4 Database .... 161 VI 5.2.5 The Armadillo methodology 162 5.2.6 The methodology at work 164 5.2.7 Future Development .... 175 5.3 The Computer Science Department Task: A case study. 178 5.4 Armadillo Evaluation . 190 5.4.1 Finding People's Names 190 5.4.2 Paper Discovery .... 202 5.4.3 Other domains .... 207 5.4.4 Analysis and Observations 214 5.5 Conclusion ............ 217 6 Conclusion 218 6.1 Research Contributions. 218 6.2 Future Directions .... 226 6.2.1 IE and the web 227 6.2.2 IE and the SW challenges 229 6.2.3 II and the SW challenges . 230 6.2.4 Armadillo: Agents over the Grid 231 6.2.5 The Semantic Annotation Engine 237 6.2.6 The Semantic Web Proxy 240 6.2.7 Semantic Plugins 242 6.3 Summary ............ 246 Vll Abstract The web of today has evolved into a huge repository of rich Multimedia content for human consumption. The exponential growth of the web made it possible for information size to reach astronomical proportions; far more than a mere human can manage, causing the problem of information overload. Because of this, the creators of the web(lO) spoke of using computer agents in order to process the large amounts of data. To do this, they planned to extend the current web to make it understandable by computer programs. This new web is being referred to as the Semantic Web. Given the huge size of the web, a collective effort is necessary to extend the web. For this to happen, tools easy enough for non-experts to use must be available. This thesis first proposes a methodology which semi-automatically labels semantic entities in web pages. The methodology first requires a user to provide some initial examples. The tool then learns how to reproduce the user's examples and generalises over them by making use of Adaptive Information Extraction (AlE) techniques. When its level of performance is good enough when compared to the user, it then takes over the process and processes the remaining documents autonomously. The second methodology goes a step further and attempts to gather semantically typed information from web pages automatically. It starts from the assumption that semantics are already available all over the web, and by making use of a number of freely available resources (like databases) combined with AlE techniques, it is possible to extract most information automatically. These techniques will certainly not provide all the solutions for the problems Vlll IX brought about with the advent of the Semantic Web. They are intended to provide a step forward towards making the Semantic Web a reality. Acknow ledgernents I would like to thank my supervisor Professor Yorick Wilks for giving me this op portunity and also for his support and comments through out these years. I would also like to thank Professor Fabio Ciravegna for the many times I disturbed him with questions, for the patience he has whenever I turn around and for the helpful com ments he always gives me. Finally, but not least, I would like to thank Professor Niranjan for accepting to be the chair of my PhD panel. Of course, I am grateful to my parents and girlfriend for their patience and love. Without them this work would never have come into existence (literally). Sheffield, United Kingdom Alexiei Dingli July 1, 2004 x Chapter 1 Introduction 1.1 Motivation Research on an experimental computer network driven by the fear of a nuclear war has become, many decades later, one of the most popular and important communication media in the world. It originally had the form of a wide area network referred to at the time as ARPANet (Advanced Research Projects Agency Network). Its purpose was to exchange military information efficiently in the eventuality that a nuclear attack would have made specific parts of the network unusable. Later on, universities recognised the power of such a network and started hooking themselves to it and also providing access to their students. Today this huge network has grown by many orders of magnitude, spans every corner of the earth and is more commonly referred to as the Internet. In the past years, the web started growing at a very fast rate. The main catalyst for this increase in information 1 is the move (which happened some years ago) towards an information society succeeded by the present drive towards a knowledge2 society. Globalisation is another factor that is bringing about huge information strains and demands on several organisations since organisations do not have to compete just 1 Information is data which is interpreted. 2Knowledge is information transformed for effective use. 1 2 with other organisations in their own neighbourhood but they must compete with organisations all over the globe. For these to survive, in this fast changing world, it is imperative that they restructure themselves and move towards modern techniques of Knowledge Management (KM) in order to be able to handle all the knowledge available. KM is an emerging, interdisciplinary business model dealing with all aspects of knowledge within the context of the firm. It seeks to improve the performance of organisations by helping users to capture, share and apply their collective knowledge with the aim of leading the organisation towards making optimal decisions in real time. These techniques make effective use and reuse of the knowledge within the organisation with the effect of increasing efficiency, productivity and service quality. KM is not a separate process performed continuously by a particular department but it should happen in the background by everyone as part of the day-to-day run ning of the business. In this era where companies are moving towards installing in their company a Digital Nervous System(55), it is imperative that the information is available where, when and to whoever needs it, rather than hidden in the senior management offices.

Load more