Application of Link Integrity Techniques from Hypermedia to the Semantic Web
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF SOUTHAMPTON Faculty of Engineering and Applied Science Department of Electronics and Computer Science A mini-thesis submitted for transfer from MPhil to PhD Supervisor: Prof. Wendy Hall and Dr Les Carr Examiner: Dr Nick Gibbins Application of Link Integrity techniques from Hypermedia to the Semantic Web by Rob Vesse February 10, 2011 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING AND APPLIED SCIENCE DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE A mini-thesis submitted for transfer from MPhil to PhD by Rob Vesse As the Web of Linked Data expands it will become increasingly important to preserve data and links such that the data remains available and usable. In this work I present a method for locating linked data to preserve which functions even when the URI the user wishes to preserve does not resolve (i.e. is broken/not RDF) and an application for monitoring and preserving the data. This work is based upon the principle of adapting ideas from hypermedia link integrity in order to apply them to the Semantic Web. Contents 1 Introduction 1 1.1 Hypothesis . .2 1.2 Report Overview . .8 2 Literature Review 9 2.1 Problems in Link Integrity . .9 2.1.1 The `Dangling-Link' Problem . .9 2.1.2 The Editing Problem . 10 2.1.3 URI Identity & Meaning . 10 2.1.4 The Coreference Problem . 11 2.2 Hypermedia . 11 2.2.1 Early Hypermedia . 11 2.2.1.1 Halasz's 7 Issues . 12 2.2.2 Open Hypermedia . 14 2.2.2.1 Dexter Model . 14 2.2.3 The World Wide Web . 16 2.2.3.1 Search Engine . 17 2.3 Link Integrity . 18 2.3.1 Link Integrity in Hypermedia . 18 2.3.1.1 Microcosm - LinkEdit . 19 2.3.1.2 HyperG - p-flood . 19 2.3.2 Link Integrity on the World Wide Web . 20 2.3.2.1 Permanent Identifier Services . 21 2.3.2.2 Replication & Versioning . 22 2.3.2.3 Robust Hyperlinks . 23 2.3.3 Link Integrity for the Semantic Web . 23 2.3.3.1 Replication & Versioning . 24 2.3.3.2 The Co-Reference Problem . 25 2.3.3.3 Link Maintenance . 25 2.3.3.4 Vocabularies . 26 3 Method 27 3.1 Recovery . 28 3.1.1 Expansion Algorithm . 28 3.1.1.1 Expansion Profiles . 29 3.1.1.2 Algorithm Design . 31 3.1.1.3 Default Profile . 33 ii CONTENTS iii 3.2 Preservation . 34 3.2.1 All About That . 34 3.2.1.1 Schema . 35 3.2.1.2 Profile Creation & Update . 36 3.2.1.3 Change Reporting . 37 3.2.1.4 Architecture & Scalability . 39 4 Results 41 4.1 Expansion Algorithm . 41 4.1.1 Initial Testing . 41 4.1.2 Large-Scale Experiments . 43 4.1.2.1 BBC Programmes . 43 4.1.2.2 DBPedia Countries . 45 4.1.2.3 ECS People . 45 4.2 Preservation . 48 4.2.1 Initial Experiment - BBC Programmes . 48 4.2.2 Expanded Experiment - BBC Programmes . 50 5 Conclusions and Future Work 56 5.1 Conclusions . 56 5.2 Future Work . 57 5.2.1 Refining the Expansion Algorithm . 57 5.2.2 Improving AAT . 59 5.2.3 Aims . 60 5.2.4 Potential Applications . 60 Bibliography 62 Raw Results 67 BBC Programmes Expansion Profiles 208 List of Figures 1.1 Linked Data Cloud October 2007 . .4 1.2 Linked Data Cloud September 2008 . .5 1.3 Linked Data Cloud July 2009 . .6 1.4 Linked Data Cloud September 2010 . .7 3.1 LOD Cloud extract - What if DBPedia were to disappear? . 29 3.2 Lookup endpoint processing model . 31 3.3 Discovery endpoint processing model . 33 3.4 Original Triple . 36 3.5 Triple transformed to AAT Annotated Form . 37 3.6 All About That Architecture . 40 4.1 BBC 1 Programmes with Domain-specific Profile - Number of Graphs vs. Number of Triples . 44 4.2 DBPedia Countries - Number of Graphs vs. Number of Triples (Triples < 30,000) . 46 4.3 DBPedia Countries - Number of Graphs vs. Number of Triples (Triples > 500,000) . 46 4.4 ECS People - Number of Graphs vs. Number of Triples . 47 4.5 BBC Programmes Demonstration Application built on top of data from AAT....................................... 51 4.6 BBC 1 Programmes Dataset - Changed Profiles over Time . 54 4.7 BBC 1 Programmes Dataset - Total Changes over Time . 54 4.8 BBC 1 Programmes Dataset - Average Changes over Time . 55 4.9 BBC 1 Programmes Dataset - Total Changes vs Changed Profiles . 55 iv List of Tables 4.1 Sample Expansion Algorithm Results . 42 4.2 BBC 1 Programmes Dataset - Changes over 1 week . 49 4.3 BBC 1 Programmes Dataset - Changes over 1 month . 52 1 Expansion Algorithm Results for BBC 1 Programmes with Default Profile 69 2 Expansion Algorithm Results for BBC 1 Programmes with BBC Domain- specific Profile . 74 3 Expansion Algorithm Results for DBPedia Countries . 80 4 Expansion Algorithm Results for ECS People . 107 v Chapter 1 Introduction Hypermedia is a technology which evolved from an idea first proposed by Vannevar Bush that `the human mind does not work by alphabetical or numerical linking but through association of thoughts' Bush (1945). In his article he presented an idea for a device called the `memex' which would allow a person to browse large collections of information and link items together and add annotations to items. From this concept the idea of hypermedia was born with its aim being to provide a mechanism to link together collections of accumulated knowledge in an interesting and useful way in order to improve access to them and express the relationships between information. Thus there is an obvious problem in hypermedia with regards to what happens when links do not work as intended. Links are unfortunately susceptible to becoming `broken' in a number of different ways and this has become an open research question, particularly since the 1990s and the advent of large scale hypermedia systems like the World Wide Web (Berners-Lee et al., 1992). This problem is known as Link Integrity and can be divided into two main problems a) the `dangling-link' problem and b) the editing problem . A limitation of hypermedia is that it tends to lead to a very document-centric interaction and navigation model as seen with the Web. As a result content is very much aimed at humans and information discovery often relies on users searching for topics they are interested in. The Semantic Web is an extension to the existing document web inspired by ideas voiced by Tim Berners-Lee (Berners-Lee, 1998) and aims to augment the existing web with machine-readable data. The value of this is that it allows machines to retrieve and process structured data from the Web without relying on complex and somewhat inaccurate data extraction techniques such as natural language processing. The standard model for data on the Semantic Web is RDF which is a syntax independent abstract model specified by the W3C (Klyne and Carroll, 2004) which represents data in the form of graphs. Each relationship between two nodes in the graph is a triple 1 Chapter 1 Introduction 2 formed of a subject, predicate and object where the subject and objects are nodes representing some resource/value and the predicate represents the relationship between them. Resources and relationships are identified using URIs which means that on the semantic web every individual triple represents a link between two resources. As a result it becomes more important than ever to be able to maintain and preserve links and to be able to recover useful data in the event of failure. Due to its nature the semantic web introduces a couple of additional link integrity problems that need to be dealt with. Since a URI can be minted (see Definition 1.1) by anyone and used to refer to any concept they want how do you determine what the meaning of a URI is? Additionally it is possible to say whether the concept identified by some URI A is the same as the concept identified by some URI B and what does it mean to actually say this? These two problems are known respectively as a) URI identity & meaning and b) co-reference. Definition 1.1. Minting a URI is the act of establishing the association between the URI and the resource it denotes. A URI MUST only be minted by the URI's owner or delegate. Minting a URI from someone else's URI space is known as URI squatting (Booth, 2009) The biggest segment of the semantic web at the current time is Linked Data1 which is a project that started as a community movement aimed at bootstrapping the semantic web by getting large data sources out on the web in the form of RDF and making links between them. It started out by converting a number of large freely available data sources such as Wikipedia2 into RDF and gradually many smaller datasets have grown up around these initial hubs as seen in Figures 1.1-1.43. The Linked Data project is of particular interest since it provides a large amount of RDF data where the applications built upon it are heavily reliant on the interlinkings between different datasets. This provides a comprehensive selection of sources of real world data to test possible solutions against and is a domain where link integrity tools would benefit end users. 1.1 Hypothesis My hypothesis in this research is that link integrity is an issue of great importance to the semantic web.