A Semantic Framework for Integrating and Publishing Linked Data on the Web

A Semantic Framework for Integrating and Publishing Linked Data on the Web by Aparna Padki A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved June 2016 by the Graduate Supervisory Committee: Srividya Bansal, Chair Ajay Bansal Timothy Lindquist ARIZONA STATE UNIVERSITY August 2016 ABSTRACT Semantic web is the web of data that provides a common framework and technologies for sharing and reusing data in various applications. In semantic web terminology, linked data is the term used to describe a method of exposing and connecting data on the web from different sources. The purpose of linked data and semantic web is to publish data in an open and standard format and to link this data with existing data on the Linked Open Data Cloud. The goal of this thesis to come up with a semantic framework for integrating and publishing linked data on the web. Traditionally integrating data from multiple sources usually involves an Extract-Transform-Load (ETL) framework to generate datasets for analytics and visualization. The thesis proposes introducing a semantic component in the ETL framework to semi-automate the generation and publishing of linked data. In this thesis, various existing ETL tools and data integration techniques have been analyzed and deficiencies have been identified. This thesis proposes a set of requirements for the semantic ETL framework by conducting a manual process to integrate data from various sources such as weather, holidays, airports, flight arrival, departure and delays. The research questions that are addressed are: (i) to what extent can the integration, generation, and publishing of linked data to the cloud using a semantic ETL framework be automated; (ii) does use of semantic technologies produce a richer data model and integrated data. Details of the methodology, data collection, and application that uses the linked data generated are presented. Evaluation is done by comparing traditional data integration approach with semantic ETL approach in terms of effort involved in integration, data model generated and querying the data generated. i DEDICATION This thesis is dedicated to my husband, Srikar Deshmukh for his constant support, motivation and love. I also dedicate this thesis to my family for their blessings and support. ii ACKNOWLEDGMENTS I would like to thank my supervisor, Dr. Srividya Bansal, for her constant support, guidance and motivation during the course of the development of this thesis and through the entire coursework of my Master’s degree. My appreciation and thanks also goes to Dr. Ajay Bansal and Dr. Timothy Lindquist for their support and encouragement throughout my time in Arizona State University. I would also like to acknowledge and thank Jaydeep Chakraborty for his contribution in the evaluation process and Neha Singh for her contribution in researching various traditional ETL tools. iii TABLE OF CONTENTS Page LIST OF TABLES ................................................................................................................... vi LIST OF FIGURES ............................................................................................................... vii CHAPTER 1 INTRODUCTION ………………………………………………………..1 1.1 Motivation ………………………………………………………...1 1.2 Problem Statement………………………………………………...4 1.3 Scope …………………………………………………………..….4 2 BACKGROUND…………………………………………………….........6 2.1 Semantic Technologies……………………………........................6 2.2 Extract Transform Load Frameworks……………………………11 2.3 Linked Data………………………………………………………17 3 RELATED WORK………………………………………………………20 3.1 Data Integration………………………………………………….20 3.2 Linked Data Generation………………………………………….23 3.3 Semantic ETL…………………………………………………....25 4 DATA EXTRACTION…………………………………………………..28 4.1 Flights Delays Data Set………………………………………….29 iv CHAPTER Page 4.2 Weather Data Set…………………………………………..........29 4.3 Holiday Data Set…………………………………………...........30 5 SEMANTIC DATA INTEGRATION – METHODOLOGY ..…………31 5.1 Semantic Data Model……………………………………………31 5.2 Linked Data Integration………………………………………….32 6 CASE STUDY IMPLEMENTATION……...…………………………...35 6.1 High Level Design ………………………………………………35 6.2 Implementation Details…………………………..........................37 7 EVALUATION AND RESULTS…………………………...…………...41 7.1 Requirements For Semantic ETL Framework ...………………...41 7.2 Comparison with Traditional ETL…………….………………...49 8 CONCLUSION AND FUTURE WORK………………………………..55 8.1 Conclusion……………………………………………………….55 8.2 Future Work……………………………………………………...55 REFERENCES....................................................................................................................... 57 APPENDIX A WEB APPLICATION SOURCE CODE ............................................................... 62 B PROOF OF CONCEPT SOURCE CODE ............................................................. 74 v LIST OF TABLES Table Page 1. Comparison of ETL Tools ................................................................................... 15 2. Number of Entities in the Ontology…………………………………………...31 3. Size of Individual Datasets……………………………………………………33 4. Query Execution time Traditional vs Semantic……………………………….52 vi LIST OF FIGURES Figure Page 1. Data warehouse Architecture ........................................................................... 2 2. Statement in Form of a Simple Graph ............................................................. 7 3. Hierarchy in the Ontology ............................................................................... 9 4. Clover ETL Sample Workflow ...................................................................... 12 5. LOD Cloud Diagram ..................................................................................... 18 6. Relationship Between Datasets Used and their Sources ............................... 28 7. Semantic Data Model ..................................................................................... 32 8. Data Model in Karma Tool ............................................................................ 33 9. Landing Page of Web Application ................................................................ 36 10. Result of First Search .................................................................................... 36 11. Architecture for Case Study ......................................................................... 37 12. Semantic ETL Process .................................................................................. 43 13. Landing Page ................................................................................................. 48 14. Page After Uploading Ontology File and Selecting a Class ........................ 48 15. Extending Linked Vocabulary - Architecture ............................................. 49 16. Database Schema .......................................................................................... 52 17. Data integration and linking using Semantic ETL ....................................... 53 18. Data integration and linking using Traditional ETL .................................... 54 vii CHAPTER 1 INTRODUCTION 1.1 Motivation With the widespread usage of internet, social media, technology devices and mobile applications, big data is only getting bigger and the sources of this data, types of format of data (structured and unstructured) are increasing as well. In order to be able to utilize this big data, a data integration process from these aforementioned heterogeneous sources is imperative. A wide variety of data warehouses exist that store these huge datasets to be analyzed and visualized for various business/research needs. The definition of a data warehouse is a central repository for all or significant parts of data that an enterprise’s various business systems collect. From this definition, we can understand two important concepts in a data warehouse: (1) The source(s) of data populated in this repository is usually multiple (2) the repository should hold this large amount of data in a coherent/useful format, irrespective of the disparate sources. The general architecture of a data warehouse is shown in Figure 1. 1 Figure 1 Data warehouse Architecture In order to store data from a wide range of sources into a central repository, Extract- Transform-Load (ETL) tools are widely used [1]. Each stage in an ETL process can be defined as: Extract the data – Obtain data from different sources and convert it into an intermediate format. Transform the data - Business rules are applied to the data. Also, to make the data more efficient columns are split or merged. This is done in order to combine data from the extract phase to make it more meaningful and useful. This phase also involves cleaning of the data. Load the data – The transformed data is loaded to a data warehouse for a single access of this data for analytics and visualizations. 2 In the semantic web world, we can draw a parallel to this process of loading a data warehouse for publishing linked data to the linked open data cloud [2]. A Semantic component (i.e., richer representation of the data model to provide more meaning, context, and relationships) in the transform phase is the major missing component in the traditional ETL tools if it has to be used for generating linked data. By introducing this semantic element, we can associate a domain-specific ontology to the key elements/fields in the dataset that is under integration. This is essential in generating linked data and connecting it to the linked open data (LOD) cloud. Using a semantic ETL framework, a richer

A Semantic Framework for Integrating and Publishing Linked Data on the Web

Normalized Form Snowflake Schema

Data Warehouse: an Integrated Decision Support Database Whose Content Is Derived from the Various Operational Databases

The Design of Multidimensional Data Model Using Principles of the Anchor Data Modeling: an Assessment of Experimental Approach Based on Query Execution Performance

Chapter 7 Multi Dimensional Data Modeling

ER/Studio Enterprise Data Modeling

Advantages of Dimensional Data Modeling

Bio-Ontologies Submission Template

Round-Trip Database Support Gives ER/Studio Data Architect Users The

Data Model Transformations: Relational to Dimensional

Object Triple Mapping Bridging Semantic Web and Object- Oriented Programming

Query Optimizing for On-Line Analytical Processing Adventures in the Land of Heuristics

Data Modeling Immersion 3NF DV & Dimensional