A Framework to Support Developers in the Integration and Application of Linked and Open Data
Total Page:16
File Type:pdf, Size:1020Kb
This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognize that its copyright rests with its author and that no quota- tion from the thesis and no information derived from it may be published without the author’s prior consent. A FRAMEWORK TO SUPPORT DEVELOPERS IN THE INTEGRATION AND APPLICATION OF LINKED AND OPEN DATA by TIMM HEUSS A thesis submitted to Plymouth University in partial fulfilment of the degree of DOCTOR OF PHILOSOPHY Faculty of Science and Engineering April 2016 Abstract Timm Heuss: A Framework to Support Developers in the Integration and Application of Linked and Open Data In the last years, the number of freely available Linked and Open Data datasets has multiplied into tens of thousands. The numbers of applications taking advantage of these, however, have not. Thus, large portions of potentially valuable data remain unexploited and are inaccessible for lay users. Therefore the upfront investment in releasing data in the first place is hard to justify. The lack of applications needs to be addressed in order not to undermine efforts put into Linked and Open Data. In existing research, strong indicators can be found that the dearth of applications is due to a lack of pragmatic, working architectures supporting these applications and guiding developers. In this thesis, a new architecture for the integration and application of Linked and Open Data is presented. Fundamental design decisions are backed up by two studies: firstly, based on real-world Linked and Open Data samples, characteristic properties are identified. A key finding is the fact that large amounts of structured data display tabular structures, do not use clear licensing and involve multiple different file formats. Secondly, following on from that study, a comparison of storage choices in relevant query scenarios is made. It includes the de-facto standard storage choice in this domain, triples stores, as well as relational and NoSQL approaches. Results show significant performance deficiencies of some technologies in certain scenarios. Consequently, when integrating Linked and Open Data in scenarios with application-specific entities, the first choice of storage is relational databases. Combining these findings and related best practices of existing research, a prototype framework is implemented using Java 8 and Hibernate. As a proof-of-concept it is employed in an existing Linked and Open Data integration project. Thereby, it is shown that a best practice architectural component is introduced successfully, while development effort to implement specific program code can be simplified. Thus, the present work provides an important foundation for the development of se- mantic applications based on Linked and Open Data and potentially leads to a broader adoption of such applications. 3 Table of Contents Abstract 3 List of Figures 14 List of Tables 16 List of Listings 17 Acknowledgements 19 Author’s Declaration 21 Abbreviations 25 1 Introduction 29 1.1 Topics and Motivation . 30 1.1.1 World Wide Web . 30 1.1.2 Semantic Web . 32 1.1.3 Open Data . 33 1.1.4 Linked Data . 34 1.1.5 Term Usage in the present Work . 36 1.1.6 Research Project Mediaplatform . 36 1.2 Problem Statement . 40 1.3 Research Question . 45 1.4 Outline of Thesis . 46 2 Background and Literature Review 47 2.1 Reference Architectures . 48 2.1.1 Semantic Web Layered Cake, 2000-2006 . 48 2.1.2 Comprehensive, Functional, Layered Architecture (CFL), 2008 . 49 2.1.3 5-Star LOD Maturity Model, 2006 . 50 2.1.4 Linked Data Application Architectural Patterns, 2011 . 51 5 TABLE OF CONTENTS 2.2 Characteristics of Linked Data and RDF . 52 2.2.1 Uniformity, Referencibility, Timeliness . 52 2.2.2 RDF is Entity-Agnostic . 53 2.2.3 Not Consistently Open, Not Consistently Linked . 53 2.2.4 Quality Issues . 54 2.2.5 Completeness . 54 2.2.6 Simple Constructs . 55 2.2.7 JSON-LD and SDF . 55 2.2.8 Pay-as-you-go Data Integration . 56 2.2.9 Data Portal Analysis . 57 2.3 Storage . 57 2.3.1 Typical Triple Stores . 57 2.3.2 NoSQL and Big Data . 58 2.3.3 Translation Layers . 58 2.4 Storage Benchmarks . 59 2.4.1 Lehigh University Benchmark (LUBM) . 59 2.4.2 SPARQL Performance Benchmark (SP2Bench) . 60 2.4.3 Berlin SPARQL Benchmark (BSBM) . 60 2.4.4 DBpedia SPARQL Benchmark (DBPSB) . 61 2.4.5 Linked Data Benchmark Council Semantic Publishing Benchmark (LDBC SPB) . 62 2.5 Conceptual Data Integration . 62 2.5.1 Knowledge Discovery in Databases Process . 62 2.5.2 Data Warehouse Architecture . 64 2.5.3 News Production Content Value Chain . 65 2.5.4 Multimedia / Metadata Life-Cycles . 66 2.5.5 Linked Media Workflow . 67 2.5.6 Linked Data Value Chain . 67 2.6 Data Integration Frameworks . 68 2.6.1 Open PHACTS . 69 2.6.2 WissKI . 70 2.6.3 OntoBroker . 72 2.6.4 LOD2 Stack . 73 6 TABLE OF CONTENTS 2.6.5 Catmandu . 74 2.6.6 Apache Marmotta . 75 2.7 Linked Data Applications . 76 2.7.1 Building Linked Data Applications . 76 2.7.2 Search Engines . 77 2.7.3 BBC . 77 2.7.4 Wolters Kluwer’s LOD2 Stack Application . 78 2.7.5 KDE Nepomuk . 80 2.7.6 Cultural Heritage . 81 2.7.7 Natural Language Processing . 81 2.8 Development Paradigms . 82 2.8.1 Speed . 82 2.8.2 Data Binding . 83 2.8.3 Polyglot Persistence . 83 2.8.4 New Web Development Approaches . 84 2.9 Summary of Findings . 85 2.9.1 Entity-Agnostic RDF versus Developer APIs . 85 2.9.2 Storage in Triple Stores . 86 2.9.3 RDF-centred Architectures versus Application-specific Architec- tures . 88 2.9.4 Essential Steps Not Covered by Existing Architectures . 89 2.9.5 Warehousing Data . 90 2.10 Conclusions . 91 3 Research Approach 93 3.1 Research Methodology . 94 3.1.1 Field Study . 94 3.1.2 Dynamic Analysis . 95 3.1.3 Architecture Analysis and Informed Argument . 95 3.1.4 Case Study . 95 3.2 Research Design . 96 3.2.1 Field Study on Data at CKAN-based repositories . 97 3.2.2 Benchmark of Storage Technologies . 106 7 TABLE OF CONTENTS 3.2.3 Architecture of a Linked and Open Data Integration Chain . 114 3.2.4 Proof of concept Framework Application . 118 4 Framework for the Integration of Linked and Open Data 121 4.1 Field Study On Linked and Open Data at Datahub.io . 122 4.1.1 Common Formats . 123 4.1.2 Popular Formats . 124 4.1.3 Licences . 125 4.1.4 Ages . 125 4.1.5 In-Depth RDF Analysis Results . 126 4.1.6 Summary of the Findings . 132 4.2 Storage Evaluation for a Linked and Open Data Warehouse . 134 4.2.1 Selection of Databases . 135 4.2.2 Installation . 136 4.2.3 Setup . 139 4.2.4 Storage Decisions . 139 4.2.5 MARC versus RDF . 140 4.2.6 Overview of Measurement Results . 141 4.2.7 Stability of Operation . 146 4.2.8 Loading Times . 146 4.2.9 Entity Retrieval . 148 4.2.10 Caching . 150 4.2.11 Conditional Table Scan . 150 4.2.12 Aggregation . 152 4.2.13 Graph Scenarios . 154 4.2.14 Change Operations . 155 4.2.15 Summary of the Findings . 157 4.3 Development of a Linked and Open Data Integration Framework . 159 4.3.1 Comparison of Existing Approaches . 160 4.3.2 Gap Analysis . 165 4.3.3 Requirements . 166 4.3.4 The LODicity Framework . 167 4.3.5 Design Decisions . 175 8 TABLE OF CONTENTS 4.3.6 Main Components . 177 4.3.7 Execution Process . ..