Strategies for a Semantified Uniform Access to Large And
Total Page:16
File Type:pdf, Size:1020Kb
Strategies for a Semantified Uniform Access to Large and Heterogeneous Data Sources Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn von Mohamed Nadjib Mami aus Médéa - Algeria Bonn, 20.07.2020 Dieser Forschungsbericht wurde als Dissertation von der Mathematisch-Naturwissenschaftlichen Fakultät der Universität Bonn angenommen und ist auf dem Hochschulschriftenserver der ULB Bonn https://nbn-resolving.org/urn:nbn:de:hbz:5-61180 elektronisch publiziert. 1. Gutachter: Prof. Dr. Sören Auer 2. Gutachter: Prof. Dr. Jens Lehmann Tag der Promotion: 28.01.2021 Erscheinungsjahr: 2021 Acknowledgments "Who does not thank people does not thank God." Prophet Muhammad PBUH To the two humans who brought me to existence and afforded life’s most severe scarifies so I can stand today in dignity and pride, to my dear mother Yamina and my father Zakariya. There is no amount of words that can suffice to thank you. I am eternally in your debt. To my two heroines, my two sisters Sarah and Mounira, I cannot imagine living my life without you, thank you for the enormous unconditional love that you overwhelmed me with. To my wife, to my self, Soumeya, to the pure soul that held me straight and high during my tough times. You were and continue to be a great source of hope and regeneration. To my non- biological brother and life-mate Oussama Mokeddem, it has been such a formidable journey together. To my other brother, Mohamed Amine Mouhoub, I’ve lost count of your favors to me in easing the transition to the European continent. I am so grateful for having met both of you. To the person who gave me the opportunity that changed the course of my life. The person who provided me with every possible facility so I can pursue this lifelong dream. Both in your words and silence wisdom, I learned so much from you; to you my supervisor Prof. Dr. Sören Auer, a thousand Thank You! To my co-supervisor who supported me from the very first day to the very last, Dr. Simon Scerri. You are the person who set me in the research journey and showed me its right path. You will be the person who I will miss the most. I so much appreciate everything you did for me! To my other co-supervisor, Dr. Damien Graux, my journey flourished during your presence; a pure coincidence?! No. You are a person who knows how to get the most and best out of people. You are a planning master. Merci infiniment! Thank you Prof. Dr. Maria-Esther Vidal, I will not be complimenting to say that the energy of working with you is incomparable. You have built and continue to build generations of researchers. The Semantic Web community will always be thankful for having you among its members. Thank you Prof. Dr. Jens Lehmann, I have the immense pleasure to collaborate with you and have your name on my papers. Working on your project SANSA marked my journey and gave it a special flavor. Thank you Dr. Hajira Jabeen, I am so grateful for your advice and orientations, you are the person with whom I wish if I could work more. Thanks, Dr. Christoph Lange and Dr. Steffen Lohmann, you were always there and supportive in the difficult situations. To Lavdim, Fathoni, Kemele, Irlan, and many others, the journey was much nicer with your colleagueship. To my past Algerian university respectful teachers who have equipped me with the necessary fundamentals. To you go all the credits. Thank you! To this great country Germany, "Das Land der Ideen" (the country of ideas). Danke! iii Abstract The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner. In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation. In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the logical approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices. v Contents 1 Introduction 1 1.1 Motivation.......................................2 1.2 Problem Definition and Challenges..........................3 1.3 Research Questions...................................6 1.4 Thesis Overview....................................7 1.4.1 Contributions..................................7 1.4.2 List of Publications..............................9 1.5 Thesis Structure.................................... 11 2 Background and Preliminaries 13 2.1 Semantic Technologies................................. 13 2.1.1 Semantic Web................................. 13 2.1.2 Resource Description Framework....................... 14 2.1.3 Ontology.................................... 15 2.1.4 SPARQL Query Language........................... 16 2.2 Big Data Management................................. 17 2.2.1 BDM Definitions................................ 17 2.2.2 BDM Challenges................................ 17 2.2.3 BDM Concepts and Principles........................ 18 2.2.4 BDM Technology Landscape......................... 20 2.3 Data Integration.................................... 22 2.3.1 Physical Data Integration........................... 23 2.3.2 Logical Data Integration............................ 24 2.3.3 Mediator/Wrapper Architecture....................... 24 2.3.4 Semantic Data Integration........................... 24 2.3.5 Scalable Semantic Storage and Processing.................. 25 3 Related Work 29 3.1 Semantic-based Physical Integration......................... 29 3.1.1 MapReduce-based RDF Processing...................... 29 3.1.2 SQL Query Engine for RDF Processing................... 31 3.1.3 NoSQL-based RDF Processing........................ 32 3.1.4 Discussion.................................... 34 3.2 Semantic-based Logical Data Integration....................... 35 3.2.1 Semantic-based Access............................. 36 3.2.2 Non-Semantic Access.............................. 37 3.2.3 Discussion.................................... 38 vii 4 Overview on Query Translation Approaches 41 4.1 Considered Query Languages............................. 43 4.1.1 Relational Query Language.......................... 43 4.1.2 Graph Query Languages............................ 43 4.1.3 Hierarchical Query Languages......................... 44 4.1.4 Document Query Languages.......................... 44 4.2 Query Translation Paths................................ 44 4.2.1 SQL ↔ XML Languages............................ 45 4.2.2 SQL ↔ SPARQL................................ 45 4.2.3 SQL ↔ Document-based........................... 45 4.2.4 SQL ↔ Graph-based.............................. 45 4.2.5 SPARQL ↔ XML Languages......................... 46 4.2.6 SPARQL ↔ Graph-based........................... 46 4.3 Survey Methodology.................................. 46 4.4 Criteria-based Classification.............................