The Potential of Temporal Databases for the Application in Data Analytics

The Potential of Temporal Databases for the Application in Data Analytics Master Thesis Alexander Menne July 2019 Thesis supervisors: Prof. Dr. Marko van Eekelen ICIS, Radboud University and Ronald van Herwijnen Avanade Netherlands Second assessor: Dr. Stijn Hoppenbrouwers ICIS, Radboud University Radboud University and Avanade Netherlands Abstract The concept of temporal databases has gotten much attention from research in computer science, but there is a lack of literature concerning the practical application of temporal databases. This thesis examines the potential of temporal databases for data analytics. The main concepts of temporal databases and the role of temporal data for data analytics and data warehousing is studied using a literature review. Furthermore, the current implementations of temporal databases are discussed to exemplify the differences between the literature and practice. We compare temporal databases embedded in a data warehouse architecture with conventional data warehouses by means of two proto- types and five assessment criteria. The results of the assessment indicate that the use of temporal databases as alternative for traditional data warehouses has great advantages. Temporal databases solve data integrity issues of the classical ETL process and enable a more direct data flow from the source database to the business intelligence tool. Also, the integrated support of the temporal dimension reduces programming efforts and in- creases the maintainability of the system. Hence, we find that temporal databases have the potential to enhance the data-driven strategy of companies significantly. Keywords: Temporal Database, Data Analytics, Data Warehouse, ETL 1 Acknowledgements I would like to thank Marko van Eekelen for the excellent guidance through our biweekly meetings which provided me with very valuable insights into research methodology. Also, I want to thank my external supervisor Ronald van Herwijnen for sharing his expertise in data warehousing, which contributed significantly to the prototype construction. Fur- thermore, I want to thank Stijn Hoppenbrouwers for the time and effort he has devoted to reading and assessing this thesis. Finally, I want to express my gratitude to all the people that supported, encouraged, and inspired me on this academic journey in the Netherlands, which is a place I am proud to call home. 3 Table of Contents 1. Introduction7 1.1. Related work . .8 1.2. Research question . .9 1.3. Thesis outline . 10 2. Research Methods 11 2.1. Literature review . 11 2.2. Assessment . 12 2.2.1. Issues and criteria . 12 2.2.2. Measurement . 13 2.2.3. Technology . 15 3. Theoretical Background 19 3.1. Temporal Data . 19 3.2. Conventional Databases and Temporal Data . 20 3.3. Temporal Databases . 22 3.3.1. Implementation issues . 24 3.4. Data Analytics . 25 3.4.1. Big Data . 26 3.4.2. The role of temporal data . 27 3.5. Data Warehouses . 27 3.5.1. The temporal dimension . 29 4. Practical Background 31 4.1. SQL:2011 standard . 31 4.2. Microsoft SQL Server implementation . 35 4.3. Other implementations . 37 5 6 Table of Contents 5. Temporal Databases compared to Conventional Data Warehouses 39 5.1. Prototype A: Conventional data warehouse . 40 5.1.1. Architecture . 40 5.1.2. Insights . 40 5.2. Prototype B: Data warehouse with system-versioned tables . 41 5.2.1. Architecture . 41 5.2.2. Insights . 42 5.3. Assessment results . 42 5.3.1. Performance . 43 5.3.2. Costs . 44 5.3.3. Data integrity . 45 5.3.4. Maintainability . 46 5.3.5. Acceptance . 47 5.3.6. Three issues of the classical ETL process . 47 6. Discussion 49 6.1. Interpretation of the findings . 49 6.2. Limitations . 52 7. Conclusion 53 7.1. Conclusions . 53 7.2. Future work . 54 Glossary 55 References 57 A. Source Code 61 A.1. Transfer between staging database and data warehouse . 61 A.2. Data warehouse transformation . 70 A.3. Views in temporal database . 98 B. Technical specifications 115 C. Assessment 117 C.1. Performance assessment results . 117 C.2. Assessment scripts . 119 1 Introduction 1.1. Related work ........................... 8 1.2. Research question ......................... 9 1.3. Thesis outline ........................... 10 Thirty years ago, the British computer scientist Tim Berners-Lee submitted a ’vague but exciting’ proposal for an information system meant as a "free, open, permission- less space for all of humanity to share knowledge and ideas" [32]. Nowadays, we call this system the internet, and its purpose has expanded far beyond the spreading of knowledge. All kinds of data are stored online, which leads to the exponential growth of the internet. The fact that the data stored online doubles every 20 months gives an indication of the challenge that companies face to cope with the masses of collected data [26]. The field of data analytics aims at finding solutions for the increasingly important extraction of knowledge to empower data-driven company strategies. The internet has become not only a source of data but also a facilitator for data analytics due to the efficiency and flexibility that cloud computing offers. In spite of the technological progress driven by the internet, the architecture of data warehouses has hardly changed in the last two decades. Databases usually represent the current state of an organisation and are updated when a change happens in the real world. Periodically, data is extracted, transformed, and loaded (ETL) into a data warehouse to gain insights for decision-making processes. In between two ETL processes, the propositions stored in the database can change multiple times, which is not reflected in the next ETL process, as only the current state of the database is extracted at the start of the ETL process. In this fast-changing world, this is a potential bias for the decision-making process, since ETL processes are often executed only once per day or less. Temporal databases may solve this problem by recording all changes made to the database in history tables. Hence, no data ever gets lost as the archiving of the data is not dependent from an ETL process. This, however, also means that much data needs to be stored, which might be a reason why it took almost 30 years from the first idea 7 8 Chapter 1. Introduction Fig. 1.1.: Average disk drive price per gigabyte in US dollar, based on [21]. of a temporal database to an implementation by a database management system ven- dor. Nowadays, the conditions have changed as the prices for hardware have declined immensely, which makes it affordable to store large amounts of data even for small busi- nesses. An illustration of the price decline of disk drives from 2004 to 2019 can be found in figure 1.1. Surprisingly, the low storage prices and available solutions has not led to an establishment of temporal databases in the industry yet. 1.1. Related work Much research on temporal databases has been done in the field of computer science, which mainly focuses on the underlying concepts and possible implementations of temporal databases in existing database management systems. With over 100 publications, Richard T. Snodgrass is the primary contributor to the design and implementation of temporal databases [31]. His early publications on temporal databases contribute for a significant part to the theoretical foundation this thesis is built on [10, 9]. Also, Snod- grass is co-director of the institution TimeCenter, which plays an essential role in the advancement of knowledge within the domain of temporal databases [17]. More recent works include the books of Johnston [18] and Date et al. [14], which combine the theoretical aspects of temporal data with a rather practical view on the design of temporal databases. Especially, Date et al. [14] offer an interesting discussion on the problems that arise when implementing a temporal database. There is a relatively small body of literature that is solely concerned with data analytics in general. Most recent academic research focuses on trends within the field of data analytics, such as big data. Gandomi et al. [2] provide a comprehensive overview of relevant techniques in the field of big data analytics. The study of Russom [27] offers some important insights into the state of big data analytics and best practices of the 1.2. Research question 9 industry as reported by 325 data management professionals. However, the reader should bear in mind that the study was taken in 2011. Closely related to data analytics, data warehouses are foremost addressed by practical researchers and industry leaders. Ralph Kimball and the Kimball Group are important authorities in this field, as they established the best practices of data warehouse architecture with their ’Toolkit’ books [20]. A main contribution of Kimball is the concept of dimensional modelling, which implies that tables in the data warehouse are modelled as a star schema with fact tables surrounded by dimension tables [19]. In contrast to the data warehouse design suggested by Inmon [16], the Kimball data warehouse is not normalised to simplify the information retrieval process. Overall, there is sufficient theoretical research on temporal databases and practical research on data analytics and data warehousing. There is, however, very little known about the use of temporal databases for the purposes of data analytics. Also, the use of a temporal database within a data warehouse architecture has not been investigated yet. 1.2. Research question This research aims at closing the knowledge gap on the use of temporal databases in data warehouse architectures to achieve

The Potential of Temporal Databases for the Application in Data Analytics

LINGI2172 Databases 2013-2014

A Logical Framework for Temporal Deductive Databases S

Table of Contents

Temporal Metadata for Discovery

Basically Speaking, Inmon Professes the Snowflake Schema While Kimball Relies on the Star Schema

MASTER's THESIS Role of Metadata in the Datawarehousing Environment

Kimball Vs. Inmon

A Framework for Ontology-Based Library Data Generation, Access and Exploitation

Dealing with Granularity of Time in Temporal Databases

A Centralized Ledger Database for Universal Audit and Verification

Time, Points and Space - Towards a Better Analysis of Wildlife Data in GIS

SQL and Temporal Database Research: Unified Review and Future Directions