Multi-Schema-Version Data Management
Total Page:16
File Type:pdf, Size:1020Kb
Multi-Schema-Version Data Management Dissertation submitted August 31, 2017 by Dipl.-Inf. Kai Herrmann born July 25, 1988 in Dresden at Technische Universität Dresden and Aalborg Universitet Supervisors: Prof. Dr.-Ing. Wolfgang Lehner Prof. Torben Bach Pedersen ,7%, R o SI ' & JOINT PH.D. DEGREE The thesis is jointly submitted to the Faculty of Computer Science at Technische Universität Dresden (TUD) and the Technical Faculty of IT and Design at Aalborg University (AAU), in partial fulfillment of the requirements within the scope of the IT4BIDC program for the joint Ph.D. degree in computer science (TUD: Dr.-Ing., AAU: Ph.D. in computer science). The thesis is not submitted to any other organization at the same time. DEFENDED: DECEMBER 13, 2017 IN DRESDEN PH.D. COMMITTEES: TUD Ph.D. Supervisor: Prof. Dr.-Ing. Wolfgang Lehner Technische Universität Dresden, Dresden, Germany AAU Ph.D. Supervisor: Prof. Torben Bach Pedersen Aalborg University, Aalborg, Denmark TUD Ph.D. Committee: Prof. Dr. Franz Baader Technische Universität Dresden, Dresden, Germany Prof. Dr.-Ing. Wolfgang Lehner Technische Universität Dresden, Dresden, Germany Prof. Dr.-Ing. Stefan Deßloch Technische Universität Kaiserslautern, Kaiserslautern, Germany Prof. Dr. Thorsten Strufe Technische Universität Dresden, Dresden, Germany Prof. Torben Bach Pedersen Aalborg University, Aalborg, Denmark AAU Ph.D. Committee: Prof. Torben Bach Pedersen Aalborg University, Aalborg, Denmark Prof. Dr.-Ing. Stefan Deßloch Technische Universität Kaiserslautern, Kaiserslautern, Germany Panos Vassiliadis, Associate Professor University of Ioannina, Ioannina, Greece Kristian Torp, Associate Professor Aalborg University, Aalborg, Denmark © Copyright by Kai Herrmann. The author has obtained the rights to include parts of the already published articles in the thesis. THESIS DETAILS Thesis Title: Multi-Schema-Version Data Management Ph.D. Student: Kai Herrmann Supervisors: Prof. Dr.-Ing. Wolfgang Lehner, Technische Universität Dresden Prof. Torben Bach Pedersen, Aalborg University The main body of this thesis consists of the following peer-reviewed and published papers. 1. Kai Herrmann, Hannes Voigt, Andreas Behrend, Jonas Rausch, and Wolfgang Lehner. Living in Parallel Realities: Co-Existing Schema Versions with a Bidirectional Database Evolution Lan- guage. In International Conference on Management Of Data (SIGMOD), pages 1101–1116. ACM, 2017 Input for Chapters 2 and 6 2. Kai Herrmann, Hannes Voigt, Jonas Rausch, Andreas Behrend, and Wolfgang Lehner. Robust and simple database evolution. Information Systems Frontiers, (Special issue):1–17, 2017 Input for Chapters 4 and 5 3. Kai Herrmann, Hannes Voigt, Thorsten Seyschab, and Wolfgang Lehner. InVerDa – The Liq- uid Database. In Datenbanksysteme für Business, Technologie und Web (BTW), pages 619–622. Gesellschaft für Informatik (GI), 2017 Relates to Chapter 2 4. Kai Herrmann, Hannes Voigt, Thorsten Seyschab, and Wolfgang Lehner. InVerDa – co-existing schema versions made foolproof. In International Conference on Data Engineering (ICDE), pages 1362–1365. IEEE, 2016 Relates to Chapter 2 5. Kai Herrmann, Hannes Voigt, Andreas Behrend, and Wolfgang Lehner. CoDEL – A Relationally Complete Language for Database Evolution. In European Conference on Advances in Databases and Information Systems (ADBIS), pages 63–76. Springer, 2015 Relates to Chapter 4 This thesis has been submitted for assessment in partial fulfillment of the joint Ph.D. degree. The thesis is based on the published scientific papers which are listed above. As part of the assessment, co-author statements have been made available to the assessment committee and are also available at the Technical Doctoral School of IT and Design at Aalborg University and the Faculty of Computer Science at Technische Universität Dresden. ABSTRACT Modern agile software development methods allow to continuously evolve software systems by easily adding new features, fixing bugs, and adapting the software to changing requirements and conditions while it is continuously used by the users. A major obstacle in the agile evolution is the underly- ing database that persists the software system’s data from day one on. Hence, evolving the database schema requires to evolve the existing data accordingly—at this point, the currently established solu- tions are very expensive and error-prone and far from agile. In this thesis, we present InVerDa, a multi-schema-version database system to facilitate agile database development. Multi-schema-version database systems provide multiple schema versions within the same database, where each schema version itself behaves like a regular single-schema database. Cre- ating new schema versions is very simple to provide the desired agility for database development. All created schema versions can co-exist and write operations are immediately propagated between schema versions with a best-effort strategy. Developers do not have to implement the propagation logic of data accesses between schema versions by hand, but InVerDa automatically generates it. To facilitate multi-schema-version database systems, we equip developers with a relational complete and bidirectional database evolution language (BiDEL) that allows to easily evolve existing schema versions to new ones. BiDEL allows to express the evolution of both the schema and the data both forwards and backwards in intuitive and consistent operations; the BiDEL evolution scripts are orders of magnitude shorter than implementing the same behavior with standard SQL and are even less likely to be erroneous, since they describe a developer’s intention of the evolution exclusively on the level of tables without further technical details. Having the developers’ intentions explicitly given in the BiDEL scripts further allows to create a new schema version by merging already existing ones. Having multiple co-existing schema versions in one database raises the need for a sophisticated phys- ical materialization. Multi-schema-version database systems provide full data independence, hence the database administrator can choose a feasible materialization, whereby the multi-schema-version database system internally ensures that no data is lost. The search space of possible materializations can grow exponentially with the number of schema versions. Therefore, we present an adviser that re- leases the database administrator from diving into the complex performance characteristics of multi- schema-version database systems and merely proposes an optimized materialization for a given work- load within seconds. Optimized materializations have shown to improve the performance for a given workload by orders of magnitude. We formally guarantee data independence for multi-schema-version database systems. To this end, we show that every single schema version behaves like a regular single-schema database independent of the chosen physical materialization. This important guarantee allows to easily evolve and access the database in agile software development—all the important features of relational databases, such as transaction guarantees, are preserved. To the best of our knowledge, we are the first to realize such a multi-schema-version database system that allows agile evolution of production databases with full support of co-existing schema versions and formally guaranteed data independence. 7 8 DANSK RESUMÉ (SUMMARY IN DANISH) Moderne agile software udviklings metoder gør det muligt at kontinuerligt videreudvikle software sys- temer ved at tilføje ny funktionalitet, fjerne fejl og tilpasse systemet til at opfylde nye krav samtidig med at det bliver brugt af dets brugere. En stor udfordring i den agile software udviklinger er den underliggende database der indeholder systemets data, fordi at foretage ændringer af database ske- maet kræver også at alt data bliver ændret. At foretage sådanne ændringer i nuværende etablerede løsninger er meget ressourcekrævende og risikoen for fejl er høj og hele processen er langt fra agil. I denne afhandling, præsentere vi InVerDa, en multi-skema-version database der muliggøre agil vide- reudvikling af databasen. Multi-skema-version databasesystemer har flere versioner af database ske- maet, hvor hvert skema version fungere som en standard enkelt-skema databasesystem. Det er meget simpelt at lave en ny version af et skema og det er passer godt med den agil udviklings filosofi. Flere skema version kan eksistere samtidigt og skrive operationer bliver propageret mellem alle skemaer. Software udviklerne behøver ikke at implementere datatilgang logikken mellem skema versioner, det gør InVerDa automatisk. For at facilitere multi-skema-version systemer, har vi skabt et relationelt komplet og tovejs database udviklings sprog (BiDEL), det gør det muligt og nemt at lave nye og videreudvikle eksisterende skemaer. BiDEL gør det muligt at skifte mellem sema versioner både fre- mad og bagud i tid. Endvidere, er BiDEL sproget flere størrelsesordner kortere end implementationen af samme opførelse i standard SQL. Der er også langt mindre risiko for at lave fejl ved implementerin- gen fordi sproget beskriver software udviklerens intentioner, baseret på database tabeller, uden at gå i unødige tekniske detaljer. At have software udviklernes intentioner beskrevet med BiDEL gør det muligt at kombinere eksisterende skemaer for at skema helt nye skema versioner. Det er at have flere database skema versioner i en database kræver sofistikeret fysisk materialisering af data. Multi-skema-version databasesystemer tager konceptet “data uafhængighed”