Data Vault 2.0 Automation Solutions for Commercial Use
Total Page:16
File Type:pdf, Size:1020Kb
Jenni Laukkanen Data Vault 2.0 Automation Solutions for Commercial Use Faculty of Information Technology and Communication Sciences (ITC) Master’s thesis September 2020 Abstract Jenni Laukkanen: Data Vault 2.0 Automation Solutions for Commercial Use Master’s thesis Tampere University Master’s Degree Programme in Computational Big Data Analytics September 2020 As the amount of data and the need for its processing and storage have increased, methods for its management and reporting have been intensely developed. However, these methods require a lot of skills, time, and manual work. Eforts have been made to fully automate data warehousing solutions in various areas, such as loading data at diferent stages of data warehousing. However, few solutions automate data warehouse construction, and learning how to use these data warehouse automation solutions requires a certain amount of expertise and time. In this research, we discuss diferent solution options for automating data ware- house construction. From the point of view of organizations, the study identifes diferent options such as purchasing, collaborating with other organizations to ob- tain or building the solution. In addition to market analysis, we also create and implement an automated tool for building a Data Vault 2.0 type data warehouse by leveraging metadata as well as sources RDBMS relationships to predict critical components of Data Vault 2.0 data warehousing, most of which are usually defned by experts. Based on the metadata collected and processed, the classifcation algorithm was able to correctly classify an average of 85.89% of all given observations correctly and 55.11% correctly for business keys alone. The algorithm was able to classify more correctly the observations that were not business keys than the business keys themselves. However, the correctness of the classifcation has the most signifcant impact on what the Automation tool that builds Data Vault 2.0 inserts into the target tables of the data model, rather than what kind of tables and what source table they consist of. The model generated by the tool corresponded well to the target model implemented at the beginning of the study. What came to hubs and satellites, without taking into account a couple of missing hubs and the content of some hubs due to shortcomings in the classifcation of business keys, the model would have been able to be used as an enterprise data warehouse. Links difered more from the original target, but after testing, the link variations produced by the tool worked well either way. There are still many shortcomings and areas for development in the created and implemented tool of the research, which, however, have been considered in the logic and structure of the tool. Also, the tool can be implemented with even a small amount of fnancial capital but requires a lot of experience and expertise on the subject. Keywords: Data Vault 2.0, Automation solution, Commercial, Data warehouse, SQL server, Data modeling, Automated relationship transform, Metadata, Cloud solution The originality of this thesis has been checked using the Turnitin Originality Check service. Acknowledgements Thanks to my employer Digia Plc and supervisor there Teemu Hämäläinen, who served as the father of the idea and provided the tools, resources, training, and constructive feedback to conduct the research. Thanks also to my thesis supervisor Martti Juhola, who has always been available when needed and to support me during the research, and auditor Marko Junkkari, whose instructions taught me a lot. However, my greatest gratitude goes to my loved ones who have managed to be enthusiastic, encouraging, and listen to my problem-solving self-talks, even though the topic is not part of their daily life at all. If nothing else, at least we have got out of the research process, lots of confused expressions, and plenty of laughter. Contents Table of fgures . List of tables . List of abbreviations used . 1 Introduction . 1 2 Background . 4 2.1 Literature review . 4 2.1.1 Conceptual model . 4 2.1.2 Building a Data Vault 2.0 Type Datawarehouse . 6 2.1.3 Automation Tools in literature . 9 2.2 Market analysis . 11 2.3 Summary and prediscussion . 15 3 Research methods . 18 3.1 The Data . 18 3.1.1 AdventureWorks . 18 3.1.2 The Training Data . 19 3.2 The Basics of Data Vault 2.0 . 22 3.2.1 Hubs . 24 3.2.2 Links . 25 3.2.3 Satellites . 26 3.2.4 Advanced Components of Data Vault 2.0 . 27 3.2.5 Conclusions of Structures and Best Loading Practices . 27 3.3 Used Cloud Services and Resources . 30 3.4 The Data Vault 2.0 Automation Tool . 30 3.4.1 Gathering the metadata and feature Selection . 31 3.4.2 Training and testing the classifcaton of the business keys . 35 3.4.3 Building the Data Vault 2.0 . 39 4 Results . 44 5 Conclusions and future development . 48 Bibliography . 51 APPENDIX A. The productized tools in the market analysis comparison . 55 APPENDIX B. The Sample Data sets used in the training of the tool . 56 Table of fgures 2.1 The Physical Data Vault automation tool phases [24]......... 9 2.2 The Physical Data Vault automation tool logical path [24]...... 10 2.3 The façade measures of the automation tool organizations. 17 3.1 AdventureWorks sample database provided by Microsoft on a general level. A more observable version of fgure in the GitHub [36]..... 19 3.2 AdventureWorks sample database ordinary relationships represented by SQL server database diagram. 20 3.3 AdventureWorks sample database exceptional relationships represented by SQL server database diagram. 20 3.4 The Data Vault 2.0 Architecture [5]................... 29 3.5 The correlation matrix of all of the variables to be used in the clas- sifcation of the business keys. 35 3.6 Iris data decision tree. 37 3.7 AdventureWorks database as The Data Vault 2.0 type data ware- house modeled by an expert of the feld. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [37]............................. 40 3.8 The ordinary transition of one table into hub containing the Business key and the satellite holding the descriptive data. 41 3.9 The ordinary transition of two related tables into hubs containing the Business keys, the satellites holding the descriptive data and the link containing the foreign keys. 42 3.10 The transition of one table related to itself into hub containing the Business key, the satellite holding the descriptive data and the link containing the foreign keys. 43 3.11 The transition of one table that includes only foreign keys as its business keys to the satellite holding the descriptive data and the link containing the foreign keys. 43 4.1 The decision tree of the classifer of the automation tool created from the numerical training metadata. A more observable version of fgure in the GitHub [39]............................. 46 4.2 AdventureWorks database as The Data Vault 2.0 type datawarehouse modeled after the results of the Data vault 2.0 Creation Automati- zation Tool. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [38]... 47 List of tables 2.1 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 3. 13 2.2 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 3. 13 2.3 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 3 out of 3. 13 2.4 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 2. 14 2.5 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 2. 14 4.1 The 10 iterations’ results of the classifcation algorithm part of the Data Vault 2.0 Creation Automation Tool. 44 List of abbreviations used ETL / ELT - Extract, transform, load / Extract, load, transform 3NF - 3rd Normal Form BI - Business Intelligence CART - Classifcation and regression tree CASE - Computer-aided software engineering CRM - Customer Relationship Management DB - Database DV - Data Vault DW/EDW - Data warehouse / Enterprise data warehouse ER - entity-relationship ERP - Enterprise Resource Planning ID3 - Iterative Dichotomiser 3 IoT - Internet of Things IP - Internet Protocol IT - Information Technology JSON - JavaScript Object Notation MD - Message-digest MPP - Massively Parallel Processing NA - not applicable, not available or no answer OLTP - online transaction processing OS - Operating System PDV - Physical Data Vault PIT - Point in Time RDBMS - Relational database management system SQL - Structured Query Language VM - Virtual machine VNet - Virtual network XML - Extensible Markup Language 1 1 Introduction The amount of data is increasing by 2.5 quintillion bytes every day at the current rate, and the amount of data will only increase further as the number of Internet of Things -devices grow [1]. Today, diferent business areas need to consider not only the Four V’s of big data (volume, variety, velocity, and veracity) but also variability, visualization, and value [2]. Data and its utilization in business have never been as crucial as it is today. Businesses and their stakeholders, partners, and customers are continually producing valuable data that could be left unutilized if the data cannot be efectively processed, stored, and analyzed. Previously mentioned utilization can be a very time and resource consuming activity. Therefore, at this stage, the automation of data processing becomes crucial. A data warehouse (DW, also known as Enterprise Data Warehouse EDW) is a hub for data collection and management from one or more diferent data sources that serve as the core of Business Intelligence.