U Niversal I Ntegration a Rchitecture for H
Total Page:16
File Type:pdf, Size:1020Kb
U NIVERSAL I NTEGRATION A RCHITECTURE FOR H ETEROGENEOUS D ATASOURCES AND O PTIMISATION M ETHODS UNIWERSALNA ARCHITEKTURA INTEGRACYJNA DLA HETEROGENICZNYCH ZRÓDEŁ´ DANYCH I METOD OPTYMALIZACJI THIS DISSERTATION IS SUBMITTED FOR THE DEGREE OF Doctor of Philosophy BY MICHAŁ CHROMIAK FACULTY OF MATHEMATICS, PHYSICS AND COMPUTER SCIENCE, Maria Curie-Skłodowska University, Lublin ADVISOR: prof. dr hab. Krzysztof Stencel INSTITUTEOF FUNDAMENTAL TECHNOLOGICAL RESEARCH, POLISH ACADEMYOF SCIENCES WARSAW 2015 Table of Contents Page LISTINGS ............................................... 5 LISTOF FIGURES ........................................... 6 LISTOF TABLES ........................................... 8 ABSTRACT .............................................. 9 CHAPTER 1. INTRODUCTION ................................... 19 1.1 Motivation . 19 1.2 Considerations, Objectives and the Thesis . 20 1.3 History and Related Work . 22 1.4 Thesis Outline . 23 CHAPTER 2. THE STATE OF THE ART AND THE RELATED WORKS . 25 2.1 Integrity - the Philosophy of Integration . 25 2.2 Integration - Cure for Chaos of Multiplicity, General Considerations . 27 2.2.1 At the beginning there was a relation . 28 2.2.2 Revolution - the Web changes everything ................... 30 2.2.3 Integration - Principia and Taxonomy . 35 2.2.4 Data Integration Practices . 38 2.2.5 Integration Theory . 42 2.2.6 Data Integration Issues . 47 2.3 Data Stores - the Integration Targets . 51 2.3.1 Database modelling - persistence . 51 2.3.2 Relational Model . 51 2.3.3 Object-oriented Database Model . 55 2.3.4 Column-oriented Relational Database Model (CORDB) – Relational Ap- proach . 56 2.3.5 NoSQL – Distributed Storage Services . 57 2.3.6 NewSQL . 63 2.3.7 Big Data - all or nothing . 66 2.3.8 After SQL Era . 68 2.3.9 Database taxonomy . 70 3 2.4 Related Works - Overview of Modern Integrating Solutions . 71 2.4.1 OLTP & OLAP - sets of operations . 72 2.4.2 Metamodels - Metadata . 78 2.4.3 Distributed File Systems - Embracing Scaling Up in Size . 80 2.4.4 Enterprise Service Bus (ESB) . 94 2.4.5 ESB / SOA - Rules of Engagement . 96 2.5 Conclusions . 97 CHAPTER 3. THE MODELOFTHE ARCHITECTURE ....................... 99 3.1 Data vs Application Integration Patterns . 100 3.1.1 Patterns in Software Development . 100 3.1.2 Architectural Patterns in Integration . 101 3.2 General Architecture and Assumptions . 103 3.2.1 Virtualization as the Key to Integration – Postulates . 103 3.2.2 Polyglot Persistence – building "The Tower of Babel" . 105 3.2.3 Event Sourcing as a Persistence Technique . 108 3.2.4 Command Query Responsibility Separation (CQRS) Pattern . 109 3.2.5 OMG CORBA - Standard Specfication . 115 3.2.6 Metadata . 116 3.2.7 Design Patterns - Study of Utility . 116 3.2.8 Integration Database Model - IDBM . 118 3.2.9 Indexing Role in Integrated Datamodel . 119 3.3 The Architecture . 121 3.3.1 Principia – Assumptions and Directions . 121 3.3.2 Components of the Architecture . 124 3.3.3 Workflow . 139 3.4 Faced Challenges . 141 CHAPTER 4. APPLICATIONS .................................... 143 4.1 Integration . 143 4.1.1 Polystores as the Next-gen Federations vs Qboid-based Architecture for BigData Integration . 145 4.2 Optimization . 147 4.2.1 Indexing Distributed and Heterogeneous Data . 148 4.2.2 Indexing Projections . 148 4.2.3 Exploiting Order Dependencies Optimization Technique for Qboid-based Integration Architecture . 150 4.2.4 Polyglot Persistence as an Optimization Technique for Integration Archi- tecture . 156 4.3 Conclusions . 161 CHAPTER 5. SUMMARY AND CONCLUSIONS . 163 5.1 The Limitation of Prototype and Further Works . 164 5.2 Additional Mediator Functionalities . 165 APPENDIX A. PROTOTYPE IMPLEMENTATION ........................... 167 A.1Integration Layer . 167 A.1.1 The IDL Scheme for Integration Contexts of Qboid and the Integration View169 A.1.2 The Integration Scheme in Action – Example . 172 APPENDIX B. STANDARDS AND CLASSIFICATIONS ........................ 177 APPENDIX C. HADOOP ECOSYSTEM ................................ 185 4 LISTINGS 5 BIBLIOGRAPHY ............................................ 189 Listings 2.1 OWL/XML Syntax for Ontology Management . 41 2.2 GaV on data sources . 44 2.3 GaV based query. 45 2.4 GaV query unfolding . 45 2.5 LaV S1_emp(Name, Age) .................................. 45 2.6 LaV S2_emp(Name, Age) .................................. 45 2.7 Declare emp_type object with methods - PL/SQL style . 55 2.8 Define emp_type object with methods - PL/SQL style . 55 2.9 Define column and table of emp_type type....................... 55 2.10 Query column of emp_type type ............................. 55 2.11 Column . 59 2.12 Super-Column . 59 2.13 ColumFamily - simplified notation - i.e. no timestamps and column/super-column names removed . 59 2.14 Raw XML based document . 61 2.15 JSON-based document; MongoDB style . 61 2.16 Metadata document for page node . 62 3.1 Employee class. 113 3.2 Employee repository class. 113 3.3 Employee class. 113 3.4 Employee repository class. 113 3.5 Employee repository class – now handles COMMANDS. 114 3.6 Extracted query search handler class. 114 3.7 SQL based FAM selection . 126 3.8 Contributory View metadata schema. Some parts omitted for readability . 126 3.9 Remote Database Object Reference (rDOR) . 128 3.10 Contact and Connection Details of a rDOR . 131 3.11 Virtual, BRI-based data identification strategy . 133 3.12 Exemplary Cell Definition . 135 3.13 Exemplary Tuple Definition . 135 3.14 Exemplary Record Definition . 136 3.15 Exemplary Record Definition . 136 3.16 SQL based FAM selection . 136 3.17 Qboid Layer . 137 3.18 Qboid replica . 137 3.19 Qboid replication . 138 4.1 BigDWAG selection . 146 4.2 Index on Employee’s salary . 149 4.3 Index on Employee’s salary . 150 4.4 A query for sales in the indicated period . 151 4.5 A rewritten query for sales in the indicated period . 152 4.6 Query general schema . 152 4.7 PLSQL function that finds minimal Fact_ID for a given date . 153 4.8 Simple rewrite with sub-queries . ..