Jenni Laukkanen

Data Vault 2.0 Automation Solutions for Commercial Use

Faculty of Information Technology and Communication Sciences (ITC) Master’s thesis September 2020 Abstract

Jenni Laukkanen: Data Vault 2.0 Automation Solutions for Commercial Use Master’s thesis Tampere University Master’s Degree Programme in Computational Big Data Analytics September 2020

As the amount of data and the need for its processing and storage have increased, methods for its management and reporting have been intensely developed. However, these methods require a lot of skills, time, and manual work. Eforts have been made to fully automate data warehousing solutions in various areas, such as loading data at diferent stages of data warehousing. However, few solutions automate construction, and learning how to use these data warehouse automation solutions requires a certain amount of expertise and time. In this research, we discuss diferent solution options for automating data ware- house construction. From the point of view of organizations, the study identifes diferent options such as purchasing, collaborating with other organizations to ob- tain or building the solution. In addition to market analysis, we also create and implement an automated tool for building a Data Vault 2.0 type data warehouse by leveraging as well as sources RDBMS relationships to predict critical components of Data Vault 2.0 data warehousing, most of which are usually defned by experts. Based on the metadata collected and processed, the classifcation algorithm was able to correctly classify an average of 85.89% of all given observations correctly and 55.11% correctly for business keys alone. The algorithm was able to classify more correctly the observations that were not business keys than the business keys themselves. However, the correctness of the classifcation has the most signifcant impact on what the Automation tool that builds Data Vault 2.0 inserts into the target tables of the data model, rather than what kind of tables and what source table they consist of. The model generated by the tool corresponded well to the target model implemented at the beginning of the study. What came to hubs and satellites, without taking into account a couple of missing hubs and the content of some hubs due to shortcomings in the classifcation of business keys, the model would have been able to be used as an enterprise data warehouse. Links difered more from the original target, but after testing, the link variations produced by the tool worked well either way. There are still many shortcomings and areas for development in the created and implemented tool of the research, which, however, have been considered in the logic and structure of the tool. Also, the tool can be implemented with even a small amount of fnancial capital but requires a lot of experience and expertise on the subject.

Keywords: Data Vault 2.0, Automation solution, Commercial, Data warehouse, SQL server, Data modeling, Automated relationship transform, Metadata, Cloud solution

The originality of this thesis has been checked using the Turnitin Originality Check service. Acknowledgements

Thanks to my employer Digia Plc and supervisor there Teemu Hämäläinen, who served as the father of the idea and provided the tools, resources, training, and constructive feedback to conduct the research. Thanks also to my thesis supervisor Martti Juhola, who has always been available when needed and to support me during the research, and auditor Marko Junkkari, whose instructions taught me a lot. However, my greatest gratitude goes to my loved ones who have managed to be enthusiastic, encouraging, and listen to my problem-solving self-talks, even though the topic is not part of their daily life at all. If nothing else, at least we have got out of the research process, lots of confused expressions, and plenty of laughter. Contents

Table of fgures ...... List of tables ...... List of abbreviations used ...... 1 Introduction ...... 1 2 Background ...... 4 2.1 Literature review ...... 4 2.1.1 Conceptual model ...... 4 2.1.2 Building a Data Vault 2.0 Type Datawarehouse ...... 6 2.1.3 Automation Tools in literature ...... 9 2.2 Market analysis ...... 11 2.3 Summary and prediscussion ...... 15 3 Research methods ...... 18 3.1 The Data ...... 18 3.1.1 AdventureWorks ...... 18 3.1.2 The Training Data ...... 19 3.2 The Basics of Data Vault 2.0 ...... 22 3.2.1 Hubs ...... 24 3.2.2 Links ...... 25 3.2.3 Satellites ...... 26 3.2.4 Advanced Components of Data Vault 2.0 ...... 27 3.2.5 Conclusions of Structures and Best Loading Practices ...... 27 3.3 Used Cloud Services and Resources ...... 30 3.4 The Data Vault 2.0 Automation Tool ...... 30 3.4.1 Gathering the metadata and feature Selection ...... 31 3.4.2 Training and testing the classifcaton of the business keys . . . . 35 3.4.3 Building the Data Vault 2.0 ...... 39 4 Results ...... 44 5 Conclusions and future development ...... 48 Bibliography ...... 51 APPENDIX A. The productized tools in the market analysis comparison . . . 55 APPENDIX B. The Sample Data sets used in the training of the tool . . . . . 56 Table of fgures

2.1 The Physical Data Vault automation tool phases [24]...... 9 2.2 The Physical Data Vault automation tool logical path [24]...... 10 2.3 The façade measures of the automation tool organizations...... 17

3.1 AdventureWorks sample provided by Microsoft on a general level. A more observable version of fgure in the GitHub [36]..... 19 3.2 AdventureWorks sample database ordinary relationships represented by SQL server database diagram...... 20 3.3 AdventureWorks sample database exceptional relationships represented by SQL server database diagram...... 20 3.4 The Data Vault 2.0 Architecture [5]...... 29 3.5 The correlation matrix of all of the variables to be used in the clas- sifcation of the business keys...... 35 3.6 Iris data decision tree...... 37 3.7 AdventureWorks database as The Data Vault 2.0 type data ware- house modeled by an expert of the feld. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [37]...... 40 3.8 The ordinary transition of one table into hub containing the Business key and the satellite holding the descriptive data...... 41 3.9 The ordinary transition of two related tables into hubs containing the Business keys, the satellites holding the descriptive data and the link containing the foreign keys...... 42 3.10 The transition of one table related to itself into hub containing the Business key, the satellite holding the descriptive data and the link containing the foreign keys...... 43 3.11 The transition of one table that includes only foreign keys as its business keys to the satellite holding the descriptive data and the link containing the foreign keys...... 43

4.1 The decision tree of the classifer of the automation tool created from the numerical training metadata. A more observable version of fgure in the GitHub [39]...... 46 4.2 AdventureWorks database as The Data Vault 2.0 type datawarehouse modeled after the results of the Data vault 2.0 Creation Automati- zation Tool. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [38]... 47 List of tables

2.1 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 3...... 13 2.2 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 3...... 13 2.3 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 3 out of 3...... 13 2.4 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 2...... 14 2.5 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 2...... 14

4.1 The 10 iterations’ results of the classifcation algorithm part of the Data Vault 2.0 Creation Automation Tool...... 44 List of abbreviations used

ETL / ELT - Extract, transform, load / Extract, load, transform 3NF - 3rd Normal Form BI - CART - Classifcation and regression tree CASE - Computer-aided software engineering CRM - Customer Relationship Management DB - Database DV - Data Vault DW/EDW - Data warehouse / Enterprise data warehouse ER - entity-relationship ERP - Enterprise Resource Planning ID3 - Iterative Dichotomiser 3 IoT - Internet of Things IP - Internet Protocol IT - Information Technology JSON - JavaScript Object Notation MD - Message-digest MPP - Massively Parallel Processing NA - not applicable, not available or no answer OLTP - online transaction processing OS - Operating System PDV - Physical Data Vault PIT - Point in Time RDBMS - Relational database management system SQL - Structured Query Language VM - Virtual machine VNet - Virtual network XML - Extensible Markup Language 1

1 Introduction

The amount of data is increasing by 2.5 quintillion bytes every day at the current rate, and the amount of data will only increase further as the number of Internet of Things -devices grow [1]. Today, diferent business areas need to consider not only the Four V’s of big data (volume, variety, velocity, and veracity) but also variability, visualization, and value [2]. Data and its utilization in business have never been as crucial as it is today. Businesses and their stakeholders, partners, and customers are continually producing valuable data that could be left unutilized if the data cannot be efectively processed, stored, and analyzed. Previously mentioned utilization can be a very time and resource consuming activity. Therefore, at this stage, the automation of data processing becomes crucial. A data warehouse (DW, also known as Enterprise Data Warehouse EDW) is a hub for data collection and management from one or more diferent data sources that serve as the core of Business Intelligence. The data warehouse is designed for queries and analysis instead of transaction processing. The data warehouse approach enables, among other things, data integration from many diferent data sources into a single database and model, isolating the burden from the source system due to large queries, presenting organizational information consistently, and facilitating decision support. Data warehouses store a copy of the current data from their data sources as well as maintain historical data, even if the source system does not maintain it. Diferent roles such as data analysts and data engineers use various analytics tools utilize the data warehouse and the data it contains. Many businesses rely on data and analytics, which are powered by data warehouses. There are many services and tools available on the market for automating a build of the data warehouse. However, not all services and tools are fully working universal, meaning that they are not completely customizable, as each data storage project is diferent, and each project has its own characteristics. The use of specifc tools and services may also require the purchase of other services and tools, which are needed for the particular solution to work, but these services may not be suitable for the project owner’s use. Even if the service or tool one purchases turns out to be a workable and functional solution, they have their maintenance costs and operating expenses (for example, personnel, monthly fees, and licenses), which can accrue over time. Free open-source tools may be an alternative solution to the purchased service if the business already has the skills of their own to carry out the integration of that tool. However, open-source tools carry their own security and ethical risks, which one is needed to be prepared for when implementing the tool. Besides, many companies refuse to partner with light terms and costs. As a piece 2 of evidence there was a general lack of partnership oferings that emerged among the companies specializing in data warehousing automation while conducting the market analysis. Furthermore, creating and negotiating a working partnership can take a lot of the company’s resources and may not achieve the desired result. Therefore, the purpose of this study is to explore the basics of Data Vault 2.0 data storage, discuss the best options for automated data warehouse construction, and create one possible solution by building a tool that can build a Data Vault 2.0 model data warehouse based on the characteristics and identify the essential business keys of the source data without human intervention. Data Vault 2.0 is a database modeling tool designed to provide a long-term solution for historical data storage from multiple source systems, enabling, among many other things, data auditing. Data Vault 2.0 is a hybrid approach that combines the best of 3rd Normal Form (3NF) and dimension modeling [41]. These advantages are for example storing historical data, integration of several data sources and making it easier to validate the data. These data sources could be for example organization’s operational, customer relationship management or some other system, where the data are extracted from with an ETL tool. Data Vault 2.0 is based on the hub, link, and satellite concepts. Hub is the key concept of Data vault 2.0 method and it holds the business keys of the data. Links link all the hubs including the business keys together creating a fexible structure for the model. Satellites hold all the descriptive data of the business keys and could be attached either to a hub or a link. However, the main advantage of the Data Vault 2.0 architecture is that, due to its structure, Data Vault 2.0 is much easier to automate than other data warehouse approaches. The Data Vault 2.0 structure is more fexible and adaptable to changes and additions, as no rework is required when adding additional information to the data warehouse model. For this reason, Data Vault 2.0 has been selected as the target data model to be automated. Here the target data model means a data model where all the data models are brought together from multiple source systems as one. While turning existing data models automatically into the Data Vault 2.0 model, we will also be evaluating in the second section of this thesis all of the possible challenges, which have arisen in the area of automating the creation of the data vault in general and other studies related to the area. After this, one shall take a look at the market analysis conducted in the spring of 2020, which provides information on the level and scope of existing Data Vault automation tools and potential collaboration opportunities. After the background section, training and testing data are to be introduced. Using represented training and testing data, the rudiments of the Data Vault 2.0 are demonstrated and visual- ized. Once the data and basic methods are known, the used cloud service resources 3 are reviewed. Azure is the provider of the cloud service resources used in the study. The tool is built using the data, methods, and resources mentioned in the sections above. The details and creation of the tool are represented in section 3.4. Once we have reached a satisfactory tool according to the research aim stated above, the performance, efciency, and accuracy of the solution will be compared with a concrete Data Vault 2.0 model made by an expert in the feld in the results section. 4

2 Background

In the recent past, one of the most signifcant developments in data warehousing has been the development and generalization of data warehouse automation tools. The goal of data warehouse automation is to signifcantly speed up the construction and maintenance of the data warehouse and reduce the amount of manual work. With data warehouse automation, it is possible to do iterative development work ef- fciently. This is especially true if the data warehousing is approached with the Data Vault 2.0 model. Data warehousing and its management are also moving mainly to the cloud, which is bringing a new dimension to data warehousing solutions. How- ever, the idea of automation is that while every data warehouse project is diferent, there are abundantly similarities in the design and loading of data warehouses that can be exploited across project boundaries. In the following section, we review the current state of automation tools in both the literature and the market.

2.1 Literature review

Building a Data Vault 2.0 is both an interesting and tricky concept in literature. There are several studies and articles on the subject, but when looking for a concrete example of Data Vault construction automation through the literature, sources are scarce. Most of the existing literature portrays how Data Vault 2.0 should be built and automated, but not how it is implemented in practice. For example, the down- load processes and basics have been described in an accurate and comprehensible way, but the implementation and selection or construction of tools has been left in the hands of the literature reviewer. Often, the content of the literature remains only in the construction phase or in the automation of loads, but not in the automa- tion of the construction of the Data Vault 2.0 itself. (Such as [5]; [7]; [13]; [14]; [15]; [16]; [18]; [22]; [25]; [27]; [28]; [29]; [30]; [31]) In the following sections, the literature on automating the construction of the Data Vault 2.0 data warehousing model, its challenges, best practices, and practical implementations of the tools are to be reviewed. We aim to take advantage of the solutions reviewed here and resolve the issues that have arisen in the past in creating an automation tool in section 3.4.

2.1.1 Conceptual model

As the frst step in building a data warehouse for any model, it is essential frst to create a conceptual model. A conceptual model presents an existing ensemble represented by diferent concepts that make it easier for people to understand what 5 the ensemble represents. Evidence of the conceptual Data Vault model and its importance is provided by Jovanovic et al. [11]. They propose among many other researchers [13]; [14]; [15], that the frst generation of data warehouse approaches and their biggest weakness in modeling was the lack of a conceptual model. However, the suggestions for conceptual models mentioned in their study pay little or no attention to the staging area of data warehousing, which is one of the starting points for data warehousing. Nonetheless, Jovanovic et al. note that several advances in information technol- ogy have led to a revision of second-generation data warehousing and the inclusion of conceptual modeling in data warehousing. The second-generation data warehouse architecture can be considered, among other architectures, Inmon’s top-down ap- proach, where the data warehouse is built with normalized enterprise data modeling [16], which has also infuenced other data modelers in their data warehouse mod- els. The best known of these models are the Anchor model [17] developed by Lars Rönnbäck and the Data Vault 2.0 [5] developed by Dan Linstedt used in this thesis. Data Vault 2.0 emphasizes the value of historical data in, for example, data auditing, data discovery, load speed, and fexibility what comes to changes in the warehouse [5]; [7]. The Data Vault 2.0 model is a good alternative for exploiting the conceptual model due to, among other things, its logical model and it resolving the challenges of frst-generation data warehousing. Challenges in the frst generation data warehouse 1.0 data models have been identifed as complex updates, coordi- nation and integration of source data, increasing delays in incomplete retrieval and retrieval due to incompatibilities in data, slow loading, lack of integration of data and lack of data traceability, and convoluted schema rebuilds [11]. Data Vault 1.0 and 2.0 as data warehouse 2.0 data models have successfully solved all the previous challenges in their models, so they are the most optimal data models to use [11]. The Data Vault 2.0 model already specifes the data warehouse and the information mart layer separately [5], but ofcially the transformation of the model from the staging area to the Data Vault 2.0 data warehouse itself has not been conceptually modeled [11]. This step is the most crucial part of the tool being worked on in this research, so the transformation must be modellable from a dimensional model to a Data Vault 2.0 model. The tool created in the research does not automatically create a conceptual model, but the conceptual model is implemented from the data warehouse to be created so that the outcome of the tool can be compared to the desired need and utilized when building the logic of the tool. Here, Jovanovic et al. [11]; [12] devel- oped logic in modeling various relations is utilized. Reasons to leverage their logic include the need to present and analyze data needs at an early stage, develop an existing ontology independently of available data sources, and demonstrate trans- 6 formations between the source database and the target database [11].

2.1.2 Building a Data Vault 2.0 Type Datawarehouse

According to Inmon: ”A Data Warehouse is a subject-oriented, integrated, time- variant and non-volatile collection of data in support of management’s decisions making process” [18]. Indeed, data warehousing systems support decision making at the core of most enterprise database applications [19]. Watson states that the data warehousing consists of three components [20]. Ac- cording to Watson, data warehousing includes enterprise data warehousing, data marts (or, in the case of the Data Vault 2.0 model, information marts), and applica- tions that extract, convert, and upload data to a data warehouse or database. These applications can be various ETL tools or reporting applications such as Power BI. Although Watson previously listed three diferent areas as part of data storage, an essential part of data storage is also the area of data pre-storage before the EDW layer, i.e., the staging area. Metadata, including defnitions and rules for processing data, is also a critical part of the data warehouse [21]. Metadata are data about data. In other words, metadata are a description or summary of data and its features. While metadata play a signifcant role in data warehousing, it also plays a signifcant role in identi- fying the business keys of the tool created in this research. Building and developing a data warehouse with a company’s needs requires knowledge of best practices and solutions. The challenges and characteristics of diferent types of data warehouse approaches must also be considered when build- ing a data warehouse. Diferent data warehouse models can be identifed based on their architecture. Inmon proposed Corporate Information Factory as a data ware- housing solution [18]. Corporate Information Factory is the 3rd normal form (3NF) database, from which multidimensional data marts are produced [18]. Another data storage architecture is the so-called bus architecture developed by Kimball [22]. In Kimball’s bus architecture, the data warehouse consists entirely of data marts and the dimensions they require. As discussed earlier in the Conceptual Modeling section, Inmon’s and Kimball’s data warehousing models are among the frst-generation data warehousing models. Second-generation data warehouses introduced new advantages to the data ware- housing, such as the ability to support changes in data over time. In particular, the Data Vault 2.0 model developed by Linstedt has acknowledged the problems of scalability, fexibility, and efciency compared to the frst generation models. The Data Vault 2.0 model is also recognizable by its architecture. Data Vault 2.0 is mainly modeled using hub, link, and satellite tables [5]. Unique business keys are stored in hubs, foreign key references between diferent tables are stored in links, 7 and data describing business keys are stored in satellites [5]. In addition to the tables mentioned above, the Data Vault 2.0 model may have other types of unusual tables, which are discussed later in the Data Vault 2.0 basics in section 3.2.4. Data Vault 2.0 is an excellent solution for integrating, storing, and protecting data [23]. However, as Krneta et al. note that it is neither intended nor suitable for intensive queries or reports, and therefore that data warehouse architecture also includes a , information mart data-sharing layer, which generally follows the star pattern [23]. This is because the Data Vault 2.0 data model is INSERT ONLY type data model, i.e., the data item is exported as such with errors and omissions to the data warehouse. Erroneous and incomplete data are only processed in the data distribution layer, taking the data to error marts, for example, which can be used, among other things, to improve data quality. As we have already said in practice, there are not yet many options for automating the construction of a data warehouse. Krneta et al. note that one of the practical problems in building organizations’ data warehouses is the lack of automation in the identifcation of various metrics, dimensions, and entities (hubs, links, and satellites in the case of Data Vault 2.0) [23]; [24]. Since the design of a data warehouse model is a vital part of building a data warehouse, because without planning it is not possible to build a (at least good) data warehouse, there are shortcomings in the overall picture of automatic data warehouse construction. Krneta et al. also discuss in favor of the previous statement when examining the physical data vault design while building their CASE tool. We discuss more about their CASE tool in the next section [24]. Krneta et al. compare eight diferent approaches to data warehouse design from several diferent researchers compared to their design and implementation tool being developed. All approaches supported sources of structured data and few sources of semi-structured data. Only one approach followed the rules set for the data model. Most of the comparative approaches supported the generation of the conceptual and physical data model, but none supported the construction of the Data Vault except the CASE tool they built. This supports the view of the lack of tools to build a Data Vault 2.0 type data warehouse. Before building the data warehouse itself or implementing it from a conceptual model, it is a good idea to have an entity-relationship (ER) model for the data coming from the source. If the ER model does not exist, but the data exists, it is good to build an ER model, since the ER model gives an overall picture of the concepts of the target area and the relationships between them. Once the schemas have been identifed, distinguished, and created, the business keys, in other words, the facts, must be identifed. Next, the necessary dimensions, metrics, and aggregate functions are identifed. In the last step before building the data model, a logical 8 and physical data warehouse schema is produced. This order of implementation is also used in the work of Krneta et al. [23]; [24]. This study also utilizes the same data storage features, as Krneta et al. used automating their data mart designs. The automatic construction of Data Vault 2.0 is also approached based on metadata and especially table relationships. There are also specifc requirements for building a Data Vault data warehouse. These requirements are presented, among other things, by Linstedt et al. and Krneta et al. [5]; [24]. Krneta et al. fnd data warehousing design a difcult task that involves several challenges for the architect. The frst challenge is the choice of the target data warehouse data model discussed earlier. In this work, Data Vault has been chosen as the data model, as Jovanovic et al. has already defned it as the most optimal data model to use [11]. The next challenge concerns the diference in data modeling between the source system and the data warehouse. Therefore, when creating a solution, the modeler should consider whether the modeling follows the demand-driven data models that determine the source systems and needs, or whether a data-driven data model is created in the target data warehouse based on the data. The hybrid model, on the other hand, again acknowledges both the properties of the two previous models as independent of each other (sequential approach). Data modeling should also consider existing and possible new separate source information systems. According to Linstedt, Data Vault 2.0 is the most sensible choice when it comes to distributed data sources [25]. In addition, a signifcant feature for automating Data Vault 2.0 data warehouse construction is the ability to build a Data Vault 2.0 data model directly based on source Relational Database Management Systems (RDBMS) schemas due to the characteristics of Data Vault 2.0 entities (hub business keys, link foreign keys, and satellite descriptive data storage). The frst version of the data warehouse is considered to be the frst version of the data model built from scratch. Not necessarily, the frst version is the one that meets all the requirements and needs, but subsequent versions of the data warehouse are no longer created from scratch. Either new source systems are added to the existing data warehouse or changes from existing source systems are added. After automating the construction of a Data Vault 2.0 type data warehouse, the next challenge concerns the evolution of existing schemas and how these changes are handled in the construction tool, either when building a new one or modifying an old one. Managing and monitoring change in data warehousing is also discussed by Sub- otic et al., who see these changes and keeping the data warehouse up to date a core challenge. Schema evolution assumes that there is one version of a schema at a time. 9

When changes occur in source systems, for example, new attributes are added to a table/entity, the data item is stored in the current version, but transferred to the new version to which those changes have been made. However, Subotic et al. did not yet reach a concrete solution in terms of schema versioning and change manage- ment but mapped out the state of evolution of the schema and mapped out possible future development targets and ideas. [26]

2.1.3 Automation Tools in literature

In the literature, concrete solutions for data warehouse construction, especially the Data Vault 2.0 model, have not been presented much. This is partly because building a data warehouse is not a completely simple process, as only to identify the data warehouse schema from the sources; the logic of the conceptual business model must also be taken into account. Krneta et al. noticed a lack of automation tools in the industry literature when building their prototype [24]. Infuenced by the fndings of other researchers, they found that several of the sources they studied did not develop concrete automation of the modeling process ([25]; [27]; [28]; [29]; [30]; [31]). However, these sources consist of fundamental information considering Data vault 2.0, while the automation of the modeling process is already a more advanced subject.

Figure 2.1 The Physical Data Vault automation tool phases [24].

Krneta et al. present a direct algorithm for the incremental design of a physical Data Vault data warehouse [24]. The algorithm utilizes the meta-model and rules of the existing data schemas. The algorithm also acknowledges some unstructured and semi-structured sources. Based on the algorithm, Krneta et al. built a prototype for the Data Vault modeling case tool. Krneta et al.’s prototype supports vital data needs, source data retention, and the scalability of the Data Vault model. 10

However, Krneta et al. argue that data warehouse modeling cannot be fully automated, but some of the measures required to build a data warehouse must be considered to keep manual. Krneta et al. feel that these measures are the identif- cation of business keys and the creation of metrics [24]. Indeed, the metrics used to support business decision-making are most often customized and implemented for the organization’s needs in the data warehouse information layer, which operates in the data warehouse as a separate layer from the raw data warehouse. Nonetheless, the identifcation of business keys as a manual can be argued. Although the business keys are unique depending on the organization, the concepts in the dif- ferent areas of the business are very similar for each organization. Utilizing enough metadata and features of diferent concepts, it can be assumed that business key identifcation is possible with diferent classifcation algorithms trained to identify business keys.

Figure 2.2 The Physical Data Vault automation tool logical path [24].

Krneta et al.’s create Physical Data Vault (PDV) automation tool, and its phas- ing follows the steps of fgure 2.1. The logic of the tool in its simplicity again follows the steps of fgure 2.2. The tool was developed using Larman’s practices [32]. Larman’s practices included that attributes of modeling needs can be pre- sented verbally, which means that some sort of user interface, e.g., an application, is created to support the design. Krneta et al. created a user interface where, among other things, the source is selected, and the business keys are given as parameters. Because of the conciseness of this study, in the empirical part, the user interface 11 for the tool to be created is not implemented, but the implementation of the tool supports the creation of a possible user interface. In the Krneta et al.’s tool, the structured data are loaded directly into the data warehouse, the semi-structured data through the staging area for additional struc- turing before the load into the data warehouse, and the unstructured data through the textual analysis into the staging database for further structuring and from there into the data warehouse [24]. In the tool of this study, all the data are imported into the staging area at some point before the data warehouse, because the tool collects, among other things, the metadata to be utilized and the relationships between the tables using RDBMS schemas from the tables coming to the staging area. In their dissertation, Nataraj successfully provided a tool to transform heteroge- neous data, such as XML and JSON, into a Data Vault 2.0 model [33]. In addition, Nataraj took advantage of the Data Vault 2.0 Big Data features and also integrated image, video, and IoT data into the data warehouse. However, the tool concerned the automatic transformation of unstructured and semi-structured data into a Data Vault 2.0 model and not the construction of the entire Data Vault 2.0 data model. Nataraj’s work utilized in the same way the rules to build the Data Vault 2.0 repos- itory as Krneta et al. [24], which can be exploited from several perspectives in approaching the challenges of this research as well.

2.2 Market analysis

As for the research question, in addition to building the tool by oneself, purchasing and partnership options should also be considered. The current state of the latter two options is clarifed by examining the market of the Data Vault 2.0 automation tools through market analysis. The market analysis executed in this thesis has been carried out in the spring 2020. The market analysis includes organizations and their already productized tools, rather than comprehensive solutions from the perspective of diferent data warehousing solutions, as the study is intended to focus on automating the con- struction of a data warehouse that takes place within a database dedicated to it. The database is one component in the overall data warehousing solution, which includes many other components such as ETL orchestration, datamart layer, and reporting. Of particular interest is the tool’s compatibility with various products, and features, either as part of a database or as a stand-alone unit. The comparable tools included in the analysis are products on the market that have been found high in the search results when searching for the tool in the topic area. In addition, some of the tools have been included in the evaluation both based on the search results and the basis of their visibility, for example, in the courses related to the topic [3] or in the work community. 12

The frst observation in the analysis emerged even before the comparison of the features and compatibility of other products were made. Some of the tools searched in the topic area, found high in the search results, were not related to automating the building of the Data Vault 2.0 model data warehouse itself or even general level workloads. However, their main job was more as ETL, Data Lineage, or Data Insight focused component for either existing data warehouse or data warehouse being built. This already says a commendable amount about the market situation and room there for tools like the one being created and implemented in this study. The next observation concerned the opportunities for development cooperation in the market. The tools and organizations studied showed at a very early stage that there is no opportunity for collaboration in the development of their tools, or it is not publicly advertised. The only tools that could be developed by users were those that were open source. However, almost all organizations in the market analysis ofered a technological or consulting partnership, i.e., the tool could be used and implemented in customer projects, but the price of this cooperation was not disclosed on behalf of any organization, so the terms of this partnership should be clarifed. Also, the possibility of cooperation must consider supplier-related products. That is, is it possible to use or develop the product on behalf of other third parties once it has been acquired? In other words, is the ownership of a product is easy to pass on when necessary? Thus the other party would not be dependent on the creator of the product as both a supplier and a developer, but they could use and develop the product themselves. In terms of features and compatibility, the results of the analysis are shown in tables 2.1, 2.2, 2.3, 2.4 and 2.5. In tables 2.1, 2.2 and 2.3, the features contemplate the characteristics of the organization’s tool, as reported on their public website. The variations, number, and details of the features varied very much, depending on the organization. Self-explanatory features are not marked unless they have been mentioned by the organization. Other compatible products of the organization have not been acknowledged. Only Data Vault 2.0 building-related tool’s features are considered. In addition, those tools are included in the tables that are not related to the automation of Data Vault 2.0 building the way this study had intended, but have been found very high in search results when retrieving automation tools, as mentioned earlier in this section. In tables, 2.4 and 2.5, compatibility is also considered only at the general level and with the most commonly used tools. If only, for example, SQL server without the associated cloud service is mentioned in the compatibility in the or- ganization’s website, the compatibility cannot be fully ensured with certain cloud resource providers, nor can all the other internal options of the candidate cloud services. 13

Table 2.1 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 3.

Table 2.2 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 3.

Table 2.3 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 3 out of 3. 14

Table 2.4 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 2.

Table 2.5 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 2.

Each comparable tool has its strengths and characteristics. From the tools rep- resented in the analysis, the tool(s) that meet the needs of both the intended use and the available technologies should be selected. However, every tool, as well as the tool in this study, needs a human user at frst steps, but the goal is to fnd or produce a tool that minimizes human user training and time spent on that tool. Depending on the project and the organization, the tool needs to be chosen to suit one’s budget. The tools, which price were publicly shared, varied in the price range of an average of € 731.75 in the monthly fee (standard deviation, 241.01). In addition, part of the represented monthly prices applied to only one user. However, some organizations billed according to usage and the amount of data, making the price of the tool more fexible and better suited for smaller organizations. Nevertheless, most organizations did not include prices on their pages but instead wanted to be contacted. In the case of fxed annual or monthly licenses, the lifespan of the tool should be recognized. It must be assessed whether the tool assists in setting up the data 15 warehouse or whether the tool controls data warehousing around the clock and has the same life cycle as the data warehouse itself. Additional costs are also incurred for training costs. Many organizations also ofer costly courses for their tools. However, given the fxed quoted prices and possible additional costs (training, consultants), there is room in the market for a more afordable alternative, if it meets the needs of the user at a certain level and depending on the customer organization’s internal technical capital. If budget is not an issue, there are already options available in the market, depending on the features sought. In addition to features, compatibility, and price, the reliability and functionality of the tool must be examined. When considering the market situation, the turnover of the organization ofering the tool, the number of employees, the country of origin, and the existing references must also be acknowledged. There are no guarantees that the tool will work unless it is demonstrably successful. In addition, the organization’s turnover and a number of employees say a lot about work productivity. The country of origin has therefore been taken into account because some organizations may face either a language or a geographical location barrier in cooperation or acquisition. Figure 2.3 shows the factors measuring the façade of organizations, previously studied by their tools. For other factors, the choice is quite a matter of opinion, as diferent people prefer organizations of diferent sizes for diferent reasons, and for most, the location of the organization is not a stumbling block. However, the market supply can be perceived as slightly worrying in that more than a third of the automation tool providers have provided either no references at all to the tool’s operation or customer cases or very weakly, for example by telling about new features of the tool in their blog posts but not given any concrete used cases. Once the features, compatibility, and reliability of the tool have been ascertained, the continuation and type of cooperation can be agreed between the parties.

2.3 Summary and prediscussion

After going through the automation tools in the literature and the market, a clear diference can be noticed. There is not much literature about the development or creation of the automation tools, but there are lots of tools in the market. The literature seeks to broaden the understanding and perspective of its target group regarding the automation of data warehouse modeling and its possibilities. Thus, the literature provides skills and theory that can be applied to develop automation tools for diferent areas of data warehousing. The market, on the other hand, puts the information provided by the literature into practice. With data warehousing and efectively utilizing data being a very topical issue, it is very understandable that nothing related to it should be freely distributed, such as accurately describing the construction of a working automation 16 tool in a publicly distributed document. On the other hand, it also requires one’s knowledge to create an automation tool based on ready-made instructions, if there is no previous experience on the subject. However, there are also exceptions in the market, meaning that free open source tools are available. Although these open source tools performed well in comparison to other tools, they were still not as advanced as tools that are licensed. 17

Figure 2.3 The façade measures of the automation tool organizations. 18

3 Research methods

In the following subsections, we acknowledge each part of the empirical part of the research and development for the Data Vault 2.0 building automation tool. First, we shall consider the data used as a training and testing data, second, the Data Vault 2.0 model itself and its basics, third, used cloud resources, and lastly, the Data Vault 2.0 Automation Tool and each development block it consists of.

3.1 The Data

In the following subsections, the dimensional database is represented into which the research tool is to be tested. In addition to this, a brief review of the training , which has been used in the training of the tool’s ability to fnd correct business keys, will be introduced.

3.1.1 AdventureWorks

A simple but extensive AdventureWorks data set has been chosen to test the tool cre- ated in this study. The AdventureWorks is an online transaction processing (OLTP) sample database provided by Microsoft [10]. The AdventureWorks database can be uploaded to an existing database in several diferent ways. It can either be down- loaded from a publicly distributed database snapshot backup, downloaded from GitHub, or downloaded as a light version directly as a part of the SQL database from Azure [10]. Microsoft is continuously updating the sample database by releas- ing new versions. In this study, we use the 2017 version. The AdventureWorks sample database includes the database of the fctional international company Adventure Works Cycles. Figure 3.1 shows the diferent schemas of the database in diferent colors. Due to the size of the database, fgure 3.1 is only a descriptive graph of the contents and relationships of the database. A better quality, zoomable image from the database can be found in the GitHub [36]. Production-related tables are shown in yellow, purchasing in pink, sales in gray, person-related in blue, human resources in green, and dbo (database owner) tables in purple. The relationships between the tables are represented by relational lines accompanied by a three-part identifer representing the table in which that attribute acts as a foreign key (1), the original table of the attribute (2), and the attribute itself (3). An example of this is the relationship between UnitMeasure (Production) and ProductVendor (Purchasing) represented in 3.2 fgure. Sometimes, however, the tables are joined by an attribute that does not have a uniform name in the source 19

Figure 3.1 AdventureWorks sample database provided by Microsoft on a general level. A more observable version of fgure in the GitHub [36]. and destination tables. An example of this is the relationship between the Vendor (Purchasing) and PurchaseOrderHeader (Purchasing) tables represented in fgure 3.3, where the attribute is named according to what it is as a foreign key in the target table. Thus, instead of BusinessEntityID, this relation is labeled as VendorID. This database is now modeled with a dimensional model, and adding new tables and data to it is tedious and afects previous tables. This model is to be transformed by automation from a dimensional database to a Data Vault 2.0 modeled database at the end of this research.

3.1.2 The Training Data

However, to identify the business key, the business key identifying portion of the automation tool must be trained to identify the correct business key from the tables. Training databases also take into account most of the features of Data Vault 2.0, such as multiple sources, including business keys from diferent sources that defne the 20

Figure 3.2 AdventureWorks sample database ordinary relationships represented by SQL server database diagram.

Figure 3.3 AdventureWorks sample database exceptional relationships represented by SQL server database diagram. same business concept but might have diferent name. Several publicly distributed sample databases have been selected for the training. The sources of the databases used in the exercise are specifed separately in their appendix. Below is a brief introduction to each database used for the training.

New York City Taxi Sample

The NYC Taxi sample includes information on, among other things, the time of pick- ing up, the time of dropping of, the location of the pickup and drop of, the length 21 of the trip, and separate fares. This sample data set contains approximately 1.7 million rows of data. The data was collected and provided by technology providers authorized under the Taxicab Livery Passenger Enhancement Programs.

School Database

This database contains one schema for the school database. The database is ele- mentary and fundamental but contains several tables for the used purpose.

Dofactory Database

Dofactory’s database is an upgraded version of Microsoft’s Northwind sample database, which AdventureWorks replaced in 2005. The database contains a basic description of the product ordered from supplier to customer, including Supplier, Product, Or- derItem, Order, and Customer tables.

Bikestores

The Bikestores database contains a simple description of the stores and production of the retailer selling the bikes. The Bikestores database contains two schemas, sales, and production.

Wide World Importers sample database v1.0

Wide World importers is a fctitious organization, which imports and distributes wholesale novelty goods such as toys and chilli chocolates. The sample database consists of Wide World Importers’ daily operational data selling products to retail customers across the United States. If needed, Wide World Importers has its own documented workfow from the beginning to the end of the supply chain.

Microsoft Contoso BI Demo Dataset for Retail Industry

Contoso is a fctitious retail organization demo dataset to demonstrate data ware- house and business intelligence solutions and functionalities among Microsoft Ofce product family. Dataset consists of sales and marketing, IT, fnance, and many other scenarios for a retail organization, including transactions from Online trans- action processing, aggregations from Online analytical processing, references, and dimensional data.

Northwind and pubs sample databases for Microsoft SQL Server 22

Northwind and pubs sample databases include the sales data of fctitious organiza- tion Northwind Traders, specialized in food export and import. Similar to Contoso, Northwind databases are also used by Microsoft to demonstrate the features of some of its products.

Lahman’s Baseball Database

Lahman’s Baseball database consists of batting and pitching data from 1871 to 2019. Among batting and pitching, the database includes felding, standings, team, man- agerial, post-season, and other statistics. Data has been gathered from the American League, National League, American Association, Union Association, Players League, and Federal League, and the National Association of 1871-1875.

SalesDB Sample Database

SalesDB database ofers a straightforward structure database with only four tables representing small organizations, employees, customers, products, and sales. This database is a natural benchmark test to classifcation tool since if there would be problems in the identifcation of business keys among this database, there must be more signifcant problems with the classifcation tool or meta-attributes chosen to classify the data.

3.2 The Basics of Data Vault 2.0

There are two well-known models for data modeling. In ’s modeling, the data warehouse solution is made entirely of star models. In ’s data warehouse solution, the data warehouse side is formed in a normalized way, and the datamarts formed from it are formed as star models. However, both models have found change management to be problematic, and both models require a large number of skilled personnel to maintain it [4]. In addition, the Inmon model can also be a Data Vault model. Thus, Data Vault 2.0 is based on the Inmonian architecture [6]. However, Data Vault 2.0 modeling has considered the weaknesses of previous models, also its predecessor Data Vault 1.0. Data Vault 2.0 used in this thesis (established 2013) works as an extension to version 1.0. While Data Vault 2.0 is a technique to model a data warehouse, it is much more multidimensional. In addition to modeling, Data Vault 2.0 covers as much methodology as architecture. The methodology also includes an implementation that includes rules, best practices, and processes, among other things. This allows for fexible data integration from multiple sources, storing, historizing, and data distribution [5]. Data Vault was developed by Dan Linstedt as early as the 1990s but 23 did not enter public distribution until 2000. The Data Vault 2.0 method is platform- independent and supports both relational-based and NoSQL environments. The method also allows the hybrid solutions of the previous two and Big data solutions. The Data Vault 2.0 method is based on generally accepted methods of process development, project management, and application development such as Scrum and Kanban. The cornerstones of the method are coherence, repeatability, and the utilization of business models [5]. Data Vault 2.0 enables agile development of iterations, and data modeling according to architecture and methodology guarantees the model’s fexibility, scalability, and fault tolerance. The modeling approach has been made fexible for changes in source systems by separating structured data from descriptive data [7]. Of particular importance in the data model is the traceability of the data (auditing the data) and the possibility to add data sources so that the existing implementation is not disrupted [5]. In addition, the integration and harmonization of data from diferent systems are essential. Allowing handling and interaction with NoSQL and Big Data structures and following Scrum and other agile best practices were one of the new characteristics added to Data Vault 2.0 compared to Data Vault 1.0. However, the main distinc- tion between 2.0 and 1.0 is replacing the sequence numbers with hash keys [34]. The sequence numbers worked as primary keys, as well as hash keys do. However, they have several limitations such as dependencies slowing loading the data (specifc tables needed to be loaded before others due to dependencies between tables), syn- chronization problems (sequence generator could cause duplication without syncing), erroneous satellite-link connections (diferences in sequence numbers for the same data), scalability issues and preventing Massively Parallel Processing (MPP). To overcome all the sequence number limitations, hash keys were included in version 2.0. Hash keys are generated by hash functions, which make absolute that same hash key will be created for the same data every time. Besides, hash keys are cre- ated automatically and independently during loading, so one does not have to have lookup into other entities overcoming the dependency and parallel loading problem. In the Data Vault 2.0 model, data storage and information generation are im- plemented separately. In our research, we focus on the area of data storage, not the area of the information mart layer, as known as the data mart layer. Separating data storage from the information mart layer enables the automation of data stor- age workfow, as data are loaded and stored as raw data. Indeed, a data warehouse provides ”a single version of the facts” because it does not in itself defne bad and good data. Information is added to the data to facilitate auditing and updating, such as download time, checksum, hash key, and source system. However, the data in the data storage will not be changed at any stage. Incorrect data item is also left as it is for traceability, and because the incorrect data item must be corrected in 24 the source system itself in any case. Data editing operations are performed only at the information mart level. The Data Vault 2.0 solution is based on business concepts that vary from or- ganization to organization. Before starting a data warehouse project, it is a good idea to fnd out the fundamental concepts of the business and the relations between them. A business key is defned for each concept. More about the business key will be discussed in the hub subsection. The implementation of the data warehouse is based on business keys and data decentralization. The hash keys, previously mentioned in this section, as an add-on to the data to facilitate updating, are calculated for the data to be stored based on the unique identifers. When data are exported to hub, satellite, and link structures, the related data are combined with the same hash key value. Hubs, satellites, and links, as well as their features and downloads, are discussed in more detail in the separate subsections of this section. The Data Vault 2.0 method is an insert only - method. Thus, if, for example, the sales order data has changed in the source system, the checksums on the old line of the sales order in the data warehouse and on the new line entering the data warehouse do not match. This saves the new version of the data alongside the previous version. Diferent versions of the data can be distinguished, for example, based on the time of load. The insert only -method using checksums and hash keys also allows for faster updates and less chance of errors. The use of hash keys, on the other hand, enables direct referrals, streamlines both loads and searches, enables parallel runs (including massively parallel processing, MPP), and allows hybrid database relational databases and NoSQL data to be combined [8]; [9]. The latest versions of the data loaded to information mart are usually retrieved for reporting, but sometimes historical data are also used for reporting. Potentially incorrect data can be compiled into Error marts, based on which, among other things, the quality of the data can be improved.

3.2.1 Hubs

Data Vault seeks to address the challenge of changes in source systems by separating the business keys associated with business concepts and the associations between them from the attributes that describe them [5]. For each key business concept, a business key is defned that can be used to identify information. Thus, Business keys are unique but may consist of many attributes. The business key is identifed by the fact that it is a column that is continuously used to locate and identify data. Business keys have a defnite meaning in the source system, such as a customer number or sales order. Business keys must be true Business Keys that are not tied to any source system. Surrogate keys generated by source systems are not good business keys, as they may not be utilized at all in the integration of diferent source 25 systems and, in the worst case, may change due to changes or upgrades to the source system. Business keys are the most stable and central parts of Data Vault modeling that form the skeleton of the model. Therefore, the selection of business keys is the most critical step in building a stable and efcient model [5]. Individualized and infrequently changing business keys related to the concept of business are stored in the hub. In addition to the business keys, the hub also stores the surrogate key, load time, source, and metadata corresponding to each key regarding the origin of the business key [5]. However, in Data Vault 2.0 modeling, the hub’s surrogate primary key has been replaced with a hash key consisting of hub business columns. A modern choice is to generate a hash key both in the hub and in other parts of the Data Vault 2.0 model using the MD5 algorithm. The MD5 algorithm results in a 128-bit hash, which is typically represented as a 32- character key. The hub must not have duplicates of the business keys, except when two systems supply the same business key. Hubs are always modeled frst and should always be associated with at least one satellite, and they do not historize data [5]. With hubs as the backbone of the model, hubs can be used to perform data profling and basic analysis based on a business key. Based on the content of the hub, it can be concluded, among other things, how many keys related to the concept exist, for example, how many customers the organization has. In addition, it can be determined, for example, how many sources customer data consists of. Data quality problems can also be noticed in hubs. Hubs can have several nearly identical cus- tomers, but from diferent sources. This may be related to issues [35].

3.2.2 Links

Link tables are the main reasons for the fexibility and scalability of Data Vault modeling. They are modeled so that any changes and additions to the model are easy to add without afecting previous components. In Data Vault 2.0 modeling, link tables store relationships of business concepts such as references to sales orders belonging to a customer. These relations can be considered as the foreign keys of source system tables. In hubs, foreign keys cannot be represented because of the modeling rules, so they are presented in links. The purpose of the link is to capture and store the relationship of the data elements with the lowest possible granularity. The link between hubs has only unambiguous connections, and there can also be transactional links. The relation of links is always the relation of many to many (m:n). The relations are many to many because it creates fexibility in the model [5]. Usually, many to many relationships should be avoided, but if the business model is done right, the only thing that can change is the cardinality. Most likely, a city can only be located in one country, but even if this changed, the ratio of one to 26 one (1:1) could change to the ratio of one to many. However, link tables can control this change without anyone modifying the model or download process. Links can be linked to other links to deal with the granularity of the data, because for example, adding a new key to a database table would change the granules in the database table. The organization can have a connection between the customer and the address, in which case a reference to the link between the centers of the product and transport company could be added. However, this is not recommended. The link, like hub, is a simple structure. The link contains the primary key of the link, which is a hash key. In addition, the link includes a hash key for its parent hub, load time, and source. As in hubs, the link only saves the connection the frst time it appears in Data Vault 2.0 (no historizing). [5]

3.2.3 Satellites

Hubs and links form the structure of the model, but do not contain any descriptive data. Satellites contain descriptive information related to business concepts. [5] Satellites contain historical information, i.e., information on which changes are actively monitored. The satellite contains date attributes from the start and end dates of the validity of the data, which are either provided by the source system or generated by logical procedures and efectivity tables in the data warehouse [5]. However, often no end date is available, and creating logical procedures, and ad- ditional tables are unnecessarily cumbersome, so changes are tracked using checksums that store entire row data, for example, based on the aforementioned MD5 algorithm, which is also often used to generate hash keys. Thus, if the data describing the key of a business concept changes, a new version of the line is stored in the satellite when the values do not match when comparing the checksums. If the checksums match, the new version is not stored in the data warehouse. This speeds up the loading process considerably because instead of comparing the values of several columns, you only need to compare one column value in the load. The satellites also include a link that connects them to their parent hub or link. This connection is a hash key based on the same information as in the hub or link. Thus, a satellite can join either a hub or a link, and multiple satellites can join in the same hub. If more than one satellite is connected to the hub, this may be due to, for example, the frequency of satellite updates (such as bank current account transactions compared to savings account transactions). It is also good practice to distinguish satellite data related to the same issue from diferent source systems to diferent satellites, but these may also be on the same satellite because of the column identifying the source system. [5] 27

3.2.4 Advanced Components of Data Vault 2.0

The Data Vault 2.0 model also includes tables other than hubs, satellites, and links. These tables include the Reference, Efectivity, Point in Time (PIT), and Bridge tables, among others. [5] Reference tables are shared tables among the other tables and are not linked in the model by relations to other tables. In general, reference tables are used by satellites. Reference tables include, for example, calendars and codebooks that exist independently without information from other tables, such as a country or language codebook. Reference tables also contain data that is frequently referenced and used to curb unnecessary memory usage. The Efectivity table is a satellite that temporarily stores start and end dates for the validity of rows of links and hubs. Efectivity satellites can also store deleted dates and other metadata that describe parent information within or across a spec- ifed timeline. Bridge and PIT tables exist to improve performance. Bridge tables are placed be- tween the Data Vault and the Information mart and are essential for those instances where joins and aggregations to the raw data vault cause performance problems. So without sacrifcing the required granularity or performance, it is a good idea to take advantage of Bridge tables. Bridge tables store a cross-combination of a hub and link keys that are controlled by the where statement. The number of rows is thus managed based on business needs. [5] If there is a need to view the data at diferent times, snapshot-type PIT data structures can be formed for reporting based on the load times, in which direct reference keys are compiled into the versions of the data valid at the desired times. PIT tables store hub and link hash keys, hub business keys, and primary key values for surrounding satellites. Bridge and PIT tables are both snapshot-based timed loads on the business data vault side, including raw data and critical elements. The purpose of both tables is to eliminate outer joins and also provide full scalability for views over Data Vault. Besides, these are intended to improve performance and data partitioning.

3.2.5 Conclusions of Structures and Best Loading Practices

Hubs and links describe the structure of a data model and are used to describe how data items are related to each other. Satellites again represent descriptive information. Data Vault 2.0 is for Enterprise data warehouse design located between operating systems and information marts. Data Vault 2.0 integrates data from diferent sources, linking them while preserving the context of the source systems. Data Vault stores the fact in its raw form, while data are interpreted in information 28 marts. Hubs, links, and satellites have several attributes in common. These include hash keys, natural business keys, load date, and source. Hubs, links, and satellites can also have the end date of the validity of the row, but this is not mandatory and, most of the time, nowadays not even included. In data modeling, the connection between hubs is only allowed via links. No reference keys (or known as foreign keys) or descriptive information can be found on the hubs. The reference keys can be found in the links and descriptive information from the satellites. There must be at least two reference keys in the link. Both hubs and links have their surrogate as the primary key. Recursive connections are also established with a link structure. Link-to-link communication is possible but not recommended. Event-type tables are usually represented as link tables. Satellites should always have one parent table, either a hub or a link. The satellite primary key is a combination of the parent table’s key and the load date. Satellites cannot have satellites, but multiple satellites can be connected to one hub. A satellite associated with a single hub can be divided into multiple satellites depending on the type, frequency of change, or data source. In Data Vault 2.0 modeling, the hubs are frst modeled by selected business keys associated with the business concepts. Next, the relations between the hubs and possibly between links are modeled with link tables. Third, satellites are mod- eled for hubs and links where data warehouse history takes place. Finally, possible independent tables, such as reference tables, are modeled. The data must not be changed when loaded to the data warehouse, but the data will be exported with errors and in raw format [5]. Only technical changes and additions to facilitate auditing may be made. The data are only reviewed and processed at the data distribution layer. Errors can also be driven into error marts to improve data quality in the future. A simple insert only load reduces costs and minimizes changes. Business rules are brought closer to the users. The Data Vault 2.0 upgrading ETL section is relatively straightforward and similar to the Data Vault 2.0 modeling itself. First, all hubs are loaded, and new surrogate keys are created for possible new business keys. Because the hubs do not attach directly to each other, they can be loaded in parallel. Next, all links between the hubs are explored and loaded, and new surrogates are created for possible new relations. Since the links should not be directly related to each other, these can also be loaded in parallel. At the same time, satellites and their surrogates that attach directly to hubs can be created and updated. Once all the links and any new connections have been established, the satellites that will be attached to the links can be added and updated. Satellites can also be loaded in parallel. 29

Difculties with parallel downloads can arise if, for example, the model has link-to-link relations. The main goal of the load is to be fast, parallel, restartable, scalable, and partitioned. First, the pre-storage (staging area) is loaded, then the Enterprise data warehouse section and then the information marts, i.e., data marts and error marts. The data are modifed in only two sections of the datafow. The frst modifcation concerns a load of a data warehouse, which may be accompanied by strict business rules such as changes in data types for dates and the declassifcation of an identity number to age and gender. Processing related to soft business rules is only carried out in the information distribution layer. These treatments include, for example, quality control and cleaning, derived data, and summations. The Data Vault 2.0 architecture as a whole is represented in fgure 3.4 [5]. However, unlike Linstedt’s architecture model, the range of hard rules has been extended to cover the entire staging area and not only the ETL part between sources and staging, as the types and formats of variables can and most often are still dealt with in the Staging area. Nevertheless, the data itself is not modifed in the staging area.

Figure 3.4 The Data Vault 2.0 Architecture [5].

Data Vault 2.0 is fexible and easily expandable. Traceability is a built-in fea- ture, and its methodology supports agile development. The approach simplifes, standardizes, and speeds uploads and performance. However, it is not suitable for modeling with users, but frst as traditional modeling and then as an information distribution layer of user-specifc needs for end-users. Also, due to the modeling ar- chitecture, more tables are created in the data warehouse than in the source systems themselves. Thus, Data Vault 2.0 is also not suitable as a basis for direct reporting, but users need derived reporting views. While Data Vault 2.0 is coherent, it requires a lot of expertise, and there is a higher threshold for building Data Vault 2.0 than for star model-based data warehouse. 30

3.3 Used Cloud Services and Resources

This research tool can be implemented locally, but in today’s world, safest due to hardware failures and other potential physical challenges, it is suitable for organiza- tional development to secure tools and the date to the cloud with access to the right people when needed, but also protected from all possible attacks and threats. Of the Azure resources used in the study, the simplest could be to use SQL Server and the SQL databases to create needed training and testing databases. However, the current version of Azure SQL Servers does not yet support specifc external scripts that are utilized in the tool created and used in the study. Instead of just setting up an SQL server and databases, this study utilizes eight diferent Azure resources. The primary resource in the tool is a virtual machine (VM) and an SQL virtual machine running inside it, which together enable the comprehensive utilization of the Azure SQL server and external scripts. A virtual machine is like another computer in a cloud rented for a specifc use. Usually, a virtual machine is called an image, but it behaves like a regular computer. For virtual machine, there are disks that are like physical disks in an on-premises server but virtualized, and they control the storage of the Virtual Machine. Typically, there are at least three disks for each virtual machine, one reserved for data, one for the operating system (OS), and one is a temporary disk. In addition to these, the Storage Account resource is used to share important fles and collect virtual machine diagnostics. The set of resources also utilizes Azure’s virtual network (VNet), which is a representation of the network in the cloud dedicated exclusively to the subscription in case. In addition, the public IP address of that network resource, through which Azure’s resources communicate, among other things, is stored in its resource. For all the above resources to be able to communicate with each other, with the Internet and on-premise resources, the solution includes a network interface resource installed by the Azure portal by default with the virtual machine. Who is entitled to these resources is the responsibility of the network security group. The network security group contains zero or more rules that allow or prohibit inbound and outbound trafc on resources. After all, these have been installed, one is also responsible to handle other needed applications for the tool to work, such as installing Python to the virtual machine and making sure that SQL launchpad for Python code is running. The tool’s code will warn the user with the error messages if all the above is not assembled correctly.

3.4 The Data Vault 2.0 Automation Tool

In the following subsections, we go into detail about the creation and implementation of the tool built during the research. The functionality and areas for improvement 31 of the tool are discussed in the results and future development section. In addition, the code used in the tool is in the GitHub [40]. The tool consists of one preprocessing section and three larger logical sections consisting of queries and code. The frst of the three major sections is the collection of metadata for the second section of the tool. However, the training and testing data attributes used in the second part for the classifcation are used before the actual part in the preprocessing. In order to select the best possible attributes for the classifcation, they are subject to feature selection. After selecting the attributes, the training metadata trains the classifcation model, which is run as a classifcation procedure for the test data. The classifcation results are added as part of the fnal metadata table, from which the third part of the tool, which builds the Data Vault 2.0 model data warehouse, takes its information in building the data warehouse. The previous three larger entities can be run at the same time with the push of a button, as feature selection has been implemented separately from their imple- mentation, and its results are already part of the latest version of the code.

3.4.1 Gathering the metadata and feature Selection

When we started collecting metadata, the issue was approached from two angles: what metadata already exists and what kind of metadata are worthwhile as well as useful to create to promote the classifcation of business keys. From the pre-existing metadata built into the database, it was decided to start collecting several attributes from the tables called

INFORMATION_SCHEMA.COLUMNS, INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE, INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA

Of these tables, the attributes

TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, ORDINAL_POSITION, IS_NULLABLE, 32

DATA_TYPE, CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH, NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE, DATETIME_PRECISION, CHARACTER_SET_NAME and CONSTRAINT_TYPE were used in the research. In addition to the built-in metadata, the research found it necessary to create a few both fundamental human-reviewed and more specifc attributes to recognize business keys. These were

FOREIGN_KEY, PRIMARY KEY, CONTAINS_ID, CONTAINS_NAME, CONTAINS_NUMBER, CONTAINS_NODE, CONTAINS_KEY and COLUMN_NAME_LENGTH.

In addition to these, both the test data and the training data metadata included the attribute BUSINESS_KEY. The attribute has been implemented by a human expert based on existing data and is used to identify whether the attribute of the specifc row is a business key or not. This attribute is also utilized in examining the goodness of tool functionality. The Transact-SQL query language was used to collect metadata. As one can see from the code in the GitHub [40], the frst one selects the database where one wants to start implementing the tool, and in this case, it is the test database Ad- ventureWorks2017. For both training and testing data, the frst step is to read the customized metadata data into the tables provided for them. When custom meta- data are available in the tool’s target database, the desired information is also se- lected from the built-in database tables. Metadata are collected by looping through 33 all the databases inside the server and retrieving the necessary information from them. Because both our test and training data are on the same server, when we collect training data, we ignore the database that contains the test data as well as all the databases contained in the master branch. For test data, we only consider the database that contains the test data, excluding tables that have data considering the tool itself. In between the metadata collection, we implemented the transformation of the categorical attributes into numerical form because the classifcation algorithm cho- sen for the study does not accept categorical attributes. Because the classifcation is about diference/distance between chosen points, we cannot change the metadata into a form, for example, 1 = dog, 2 = cat, 3 = horse, and 4 = human. This num- bering indicates that the horse would be more human than a dog or cat. Due to this, several binary attributes were generated for each value of each metadata attribute. Rows were assigned a value of 1 for each attribute if rows had the feature and 0 if rows did not have the feature. Already at this stage in implementation, the frst pitfall was identifed. The content of the attributes of the existing metadata already formed more than 1 024 columns, which is the maximum number of columns in a single table without a wide table feature, while in a wide table feature, there can be up to 30 000 columns. In addition, the intention was also to keep the number of columns in the training metadata table as constant as possible. Of these more than 1024 attributes, the metadata that could be identifed to minimally afect the result was immediately removed, including binary attributes based on the TABLE_CAT- ALOG, TABLE_SCHEMA, TABLE_NAME, and COLUMN_NAME attributes. Feature selection was performed for the remaining 53 attributes. This selection analysis was performed with the R program utilizing a correlation matrix and incor- porating the Pearson correlation coefcient method. Pearson correlation coefcient measures the linear correlation between two points and obtains values between - 1 and 1. A value of 1 means a complete positive correlation, and a value of -1 means a complete negative correlation. If a value of 0 is obtained, it means that there is no correlation. The Pearson correlation between the attributes X and Y

{(x1, y1),…,(xn, yn)} can be calculated by the following function:

Pn (x − x¯)(y − y¯) r = i=1 i i pPn 2pPn 2 i=1(xi − x¯) i=1(yi − y¯)

Where n is the sample size, (xi, yi) are individual sample values with the the i index and n 1 X x¯ = x n i i=1 is the sample mean. 34

The following R code was used to construct and visualize the correlation matrix:

# Reading the wanted data in training_data <- read.delim(file.choose())

# Implementing correlation matrix with pearson method res<-cor(training_data, method = "pearson", use = "complete.obs") round(res, 2)

# Installing visualization packages install.packages("devtools") devtools::install_github("taiyun/corrplot", build_vignettes = TRUE) install.packages("corrplot")

# Visualizing correlation matrix results library(corrplot) corrplot(res, method="pie", type="upper")

In the frst round of implementation of the correlation matrix, those attributes which did not get any results with attributes other than themselves (those with NA values with every other variable) were removed from the candidate attributes. There were 7 of these, which can be presented as attributes commented out of the SQL code in the GitHub source [40]. In the second round of implementation of the correlation matrix, variables that correlated completely or otherwise highly with one or more other attributes were considered. All attribute pairs with a correlation greater than 0.5 were examined. All these attributes were frst considered to see if any variable had a high correlation with several diferent attributes, at which point the variable was decided to be removed. Three such candidate attributes were removed. After previous removals, if an attribute had a strong correlation with only one other attribute, then by random sampling, always one of these two attributes was included in the classifying attributes, and the second one was removed. However, only those attribute pairs with a correlation of at least 0.8 ended up in the random removing process. Due to random sampling removals, fve more attributes were dropped from the classifcation. These can also be presented as attributes commented out of the code in the GitHub [40]. Based on the metadata mentioned above, collection and feature selection, the variables and their correlations in fgure 3.5 were included in the classifcation. 35

Figure 3.5 The correlation matrix of all of the variables to be used in the classifcation of the business keys.

3.4.2 Training and testing the classifcaton of the business keys

Utilizing the result of the metadata collection and feature selection in the previous section, we trained a business key classifcation model with the training data that allowed us to generate business keys for test data based on the same attributes. As a classifcation method, we used decision trees. A decision tree is a method that resembles a fow-chart-like upside-down tree graph, including diferent decision branches (paths) ending in leaf nodes resembling 36 diferent possible outcomes (classes). Every inner node (not a leaf node) in a tree represents an attribute in the data. A decision tree is created by splitting the data set into a subset based on a selection criterion values of attributes. Previous is done iteratively in each subset split recursively. The goodness of the split could be calculated with the selection criterion. One selection criterion is an entropy:

c X E(S) = −pilog2pi i=1 Another possible criterion is following Gini impurity:

c X 2 G(S) = 1 − p (ci) i=1 2 where p (ci) is the relative frequency of class ci in S. In this research, the classifer is using the Gini impurity as a measurement for the impurity of the node. The iteration of the decision tree is done when the subset has a node that ends in one outcome value. Also, if splitting adds no more value to the decision tree, the construction is done. The decision tree is easy to construct since it does not need any domain knowledge or parameter set up, and it also handles high dimensional data. Regression trees are presented and created in the same way as the decision trees, but they predict continuous values instead of classes. However, if a tree grows promptly, the tree might become too complicated, or one might end up with an overftted model. One could either set a minimum number of inputs to use in each leaf, so there will not be lots of leaves with one or two inputs, or one could set up a maximum depth of the tree (longest path in a tree), so paths will not become too complicated and lengthy. Performance as a whole could also be increased by pruning. Pruning reduces the size of a decision tree by taking of those parts of the tree that provide only little value to the classifed inputs. Pruning can either start from the top (root) or the bottom (leaves) of the tree, while the tree is represented upside down. However, in this research, neither the maximum depth of a path nor the number of inputs to use in each leaf was not set up. The decision tree was let to grow on its own without pruning. Next, we will go through an example of the creation of the decision tree and its splits using Gini impurity as a selection criterion. The Gini impurity splits favor larger partitions. The Gini impurity is used in the classic CART (Classifcation and regression tree) algorithm. First, to select the partition starting from the root node, including all the observations, one should fnd the Gini impurity, also known as the Gini index of the class, and then calculate the weighted sum of the Gini Index of the classes to the given variables. The Gini calculates the randomness of the sample. If 37 the Gini is high, it is harder to draw any conclusions. The best possible outcome for the decision tree is to have the lowest possible Gini index. After calculating the Gini index to all the variables, split splitting the root node will be the one with the lowest Gini index. After choosing the frst split, we will then continue creating new splits from the observations in each path split until there are observations with only one class left at the end of the path, making it as a leaf node. In fgure 3.6 an example calculation of the previous splitting is presented based on the known Iris data with the same default settings in the classifer as in this research automation tool.

Figure 3.6 Iris data decision tree.

Like in fgure 3.6, not all the attributes have to be taken into account while creating the tree, only those with the most signifcant advantages to the tree. In the Iris case, where x[3] represents petal width, x[2] petal length, and x[1] sepal width, the three variables are enough to classify observations, even though more variables could be used. 38

There are many other decision trees than the classifcation and regression tree (CART), which is dynamic and creates either a regression tree or classifcation tree depending on the attribute. Also very known is ID3, which created iteratively by calculating the entropy of every attribute in the dataset, splitting the set into subsets by attribute with the most signifcant gain and making decision tree node out of that attribute. Then there is C4.5. Decision tree type C4.5 is the successor of ID3. C4.5 uses either information gain or gain ratio to decide the attribute that becomes the following node. It is almost as same as ID3, but it can also handle classifying of both continuous and missing attribute values. There are also CHi-squared Automatic Interaction Detector, better known as CHAID-decision tree, MARS, and conditional inference trees, but these are more advanced and will not be discussed further in this research as well as ID3 and C4.5. There are several advantages and disadvantages of the decision trees. The results of the decision trees are easy to explain, understand, and present. They are easy to learn and use, and that is why they are very user-friendly and used widely. There are also other useful advantages. Decision trees support feature selection, variable screening, and require very little work to prepare and implement. They generate rules that anyone can understand and perform classifcation without high time con- sumption and computation. Like C4.5, decision trees can handle both continuous and categorical variables. In addition, decision trees indicate the importance of dif- ferent attributes, and data exploration on behalf of data lineage is simple. It depends on the tree model, but outliers and missing values should not have a critical efect on some of the models, while some data variations could create an entirely diferent decision tree. This variance could be, however, lowered by bagging or boosting, which we are not discussing further here. However, as mentioned, there are also disadvantages of decision trees. Even though some of the decision tree models can classify continuous attributes, they are still less appropriate to use in tasks where the value of the continuous attribute is predicted. Also, if the number of classes grows and training data are quite small, decision trees are inclined to errors, calculations might become intricate, and trees are prone to overftting. However, overftting could be handled by pruning. Decision trees can be costly to train. The growth of the decision tree can be expensive depending on the candidate splitting since the split feld must be sorted before its most optimum split can be found. After all, information gain also prefers attributes with a more signifcant number of values with categorical variables. As previously mentioned, we are using the default settings of the Decision- TreeClassifer algorithm, including the Gini index as a selection criterion and no pruning from sklearn.tree library in the classifcation code. As seen in the code [40], frst, we create a procedure that creates the trained model using the training 39 metadata in numeric form gathered, selected, and transformed previously. Then we create a table that storages all the trained models that are to be created, even the future ones not created in this research. After creating the procedure, we declare the model itself and a name for it. We set a name to the model and execute the previous model generating procedure. If the model already exists in the model storing table, we replace it with the new model. If the model does not exist, we just insert the new model into the table. Then we create a procedure that takes in as a parameter the trained model that one wants to be used. However, before executing the classifer to the test data, we create a table to store the predicted business keys. When the classifer is done, the predicted business keys will be joined with the original categorical and numerical data to be used in the Data Vault 2.0 building automatization.

3.4.3 Building the Data Vault 2.0

In the last and most visible step of the tool, the basic concepts of the Data Vault 2.0 method are built and loaded using the metadata table created in the previous steps and the business keys classifed by the classifcation model, hubs, links, and satellites. The goal of the tool is to classify business keys correctly and use both this and other relational data to build a model as similar as possible to that shown in fgure 3.7. Figure 3.7 is a model modeled by one expert on the feld, and, depending on the modeler, the model may have diferent variations, all of which, however, are very close to each other. However, if the modeler strictly adheres to the rules of Data Vault 2.0 modeling, the model should be similar regardless of the modeler, although there is room for interpretation. In its simplicity, the construction algorithm seeks to follow the transitions from shown in fgures 3.8, 3.9, 3.10, and 3.11 to Data Vault 2.0. As shown in fgure 3.8, the standalone tables in the model, which do not contain any foreign keys, consist only of a hub and an associated satellite. A simple relationship between two tables in fgure 3.9 forms a hub and a satellite for each table, but in addition to these, the business key and the associated foreign keys are collected to the link connecting the hubs. If the table has a relation to itself, i.e., the table contains a business key and a foreign key with the same values, for example in a hierarchical situation such as in fgure 3.10, a hub, satellite and link table, which contains links related to the internal relation of the table, are created from the table. In a situation where the table does not itself contain a unique business key to be used as an identifer, but its identifer is formed as a mapping of several foreign keys, as in fgure 3.11, a link containing foreign keys and a satellite containing the descriptive data are created. A hub is not formed in the situation. The building algorithm itself is implemented by using dynamic and nondynamic 40

Figure 3.7 AdventureWorks database as The Data Vault 2.0 type data warehouse modeled by an expert of the feld. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [37].

Transact-SQL. The building algorithm consists of three parts, in which the hubs, links, and satellites are formed separately, even though, if created as separate pro- cedures, could be processed parallel. In addition, initially, before a tool is ofcially to be run, the part of the tool containing the creation of hash function must be run separately so it can generate needed hash keys and checksums later. As well as a hash function, one has to also create a target schema, for example, from database security settings. A schema called edw (enterprise data warehouse) has been formed for the tool for this study. 41

Figure 3.8 The ordinary transition of one table into hub containing the Business key and the satellite holding the descriptive data.

The three concept building blocks of the algorithm have similar base logic. For each hub, link, and satellite formation, a metadata table of test data are looped through at three diferent layers: schema, table, and column layer. The purpose is to go through each schema table and its columns and direct their data to the correct locations in the Data Vault 2.0 model. For each part, a table is frst created in which the algorithm builds four default columns for each table: a hash key formed based on the table’s business keys, a checksum formed based on all but the default columns excluding source in the table, the load time, and the source. The logic of both the hash key and the checksum is based on the MD5 algorithm, which results in a 128- bit hash that is typically represented in 32-character hexadecoded format, but the results of the hash and the checksum are based on diferent data. Hash key aims to create links between diferent tables, while checksum is meant to spot changed data and help with historization. Changed data are tracked only in the Data Vault 2.0 method in satellites, but checksums were also implemented on hubs and links in this study due to, among other things, testing and reproducibility. The created table is then modifed by adding columns related to the concept to the created table, after which the table is populated with the corresponding data from the test data staging area from which all the information needed for the building has been retrieved. However, each of the three parts of the algorithm has its characteristics. The frst part, which creates and loads the hubs of the data model, deals explicitly only with those metadata rows that have been identifed as business keys by the previous classifcation algorithm. Based on this information and the columns in the staging tables, hub tables are created and loaded. For each column, any special cases related to the column are treated in the column layer, while at the table level, special cases related to the table data are treated in the data insertion. Since hub tables should 42

Figure 3.9 The ordinary transition of two related tables into hubs containing the Business keys, the satellites holding the descriptive data and the link containing the foreign keys. not be formed from tables that do not contain the business key but, for example, a combination of foreign keys as a unique identifer, each number of columns in the created hub table is compared to the default columns to be formed in all tables. If only default columns have formed in the table during the loop, the table is removed from the fnal model. In the second part of the fnal part of the complete tool, link tables are created and loaded. Creating and loading link tables follow quite the same logic as creating and loading the hubs, but instead of considering only rows of metadata identifed as a business key, rows with foreign key status in their constraint type are also considered. However, the hash keys are still formed based on the business key(s), but now foreign keys are also considered in the checksum, table construction, and loading the data. Special cases related to columns are also handled at the column level, and special cases related to data are handled at the table level. Each table has at least some type of unique key identifer but may not have foreign keys, so what comes to links, the link table number of columns will be compared to the number of 43

Figure 3.10 The transition of one table related to itself into hub containing the Business key, the satellite holding the descriptive data and the link containing the foreign keys.

Figure 3.11 The transition of one table that includes only foreign keys as its business keys to the satellite holding the descriptive data and the link containing the foreign keys. columns its hub. If link and hub have the same amount of columns or link includes only the default columns, it means that table does not have any foreign keys or even business key in it and the link is redundant. Lastly, satellite tables are created and loaded following almost the same logic as in hubs and links, but all the information found in the metadata table regarding the source tables is taken into account when building the tables and loading the data. Checksum again considers the data of all the columns excluding previously mentioned default columns, while the hash key takes into account the business key in the satellite or a combination of business keys formed by more than one column. After the construction and loading of the satellites, the algorithm of the tool has completed its task, and the frst built and loaded version of the Data Vault 2.0 model has been formed into the schema created for the research. 44

4 Results

Figure 4.1 shows one of the decision trees generated by the training data, which is used to classify the test data. A more observable image of the tree can be found in the GitHub [39]. The classifcation algorithm itself had 543 observations to classify as either a business key or not. In the observations, the multi-part business keys have been specifed with separate observations, i.e., the classifcation algorithm has also been able to obtain part of the multi-part business key correctly and then receive credit for the correctly classifed part. That is, the entire business key is not classifed wrong if the algorithm does not fully recognize all its parts. The classifcation algorithm was iterated ten times, and its results are shown in table 4.1. For the correct classifcation of the classifcation algorithm, all observations were averaged at 85.89%, and the standard deviation was 0.63%. Considering only the business keys, there were 92 observations with a correct classifcation mean of 55.11% and a standard deviation of 2.24%. Based on this, it can be concluded that the results of the classifcation algorithm are very constant, but the algorithm can classify those observations that are not business keys better than the business keys themselves.

Table 4.1 The 10 iterations’ results of the classifcation algorithm part of the Data Vault 2.0 Creation Automation Tool.

In fgure 4.2 it is shown the AdventureWorks database modeled as a Data Vault 2.0 type of model. The model in fgure has been created from the results of the Data Vault 2.0 Creation Automatization Tool. As one can notice, the model difers a signifcant amount from fgure 3.7 what comes to links, but satellites and hubs are quite the same as in the previously presented fgure. Some of the links are either missing (see especially links of those the hubs ending in word ’History’) or split into separate links. Also, the model does not yet respect all the rules of the Data Vault 2.0 methodology, since for example satellite called Currency has two relations 45 in two diferent links (satellites can only attach to one entity either parent link or parent hub) due to fact that it did not have hub since the classifcation algorithm did not classify any business key for the staging table. However, after a couple of test queries from the database compounding several tables, the automatically created model works well for the purpose. 46

Figure 4.1 The decision tree of the classifer of the automation tool created from the numerical training metadata. A more observable version of fgure in the GitHub [39]. 47

Figure 4.2 AdventureWorks database as The Data Vault 2.0 type datawarehouse modeled after the results of the Data vault 2.0 Creation Automatization Tool. Blue objects represent hubs, green links and yellow satellites. A more observable version of fgure in the GitHub [38]. 48

5 Conclusions and future development

The overall aim of the research was to examine the basics of Data Vault 2.0 data storage, discuss the best options for automated data warehouse construction in com- mercial use, and create a tool that constructs a Data Vault 2.0 type data warehouse by utilizing metadata, dimensional data model’s existing relationships and decision tree classifcation on fnding the business keys. At the beginning of the research, we found that the literature on the topic focused frmly on the theoretical aspects, and very few studies approached the topic as pragmatically as this research has sought. Many texts in the literature aim to create a framework and best practice on the topic, while the market analysis examined how these work in practice and what so- lutions already exist for modeling a Data Vault 2.0 type data warehouse. However, computing is a growing and fnancially productive feld, many of whose benefts are still hidden and unharvested, so all new knowledge and algorithms on the subject are to be utilized as much as possible by organizations for their use. There are softwares on the market whose features are sure to satisfy the needs of most organizations. However, the development of these tools has required several years and thousands of hours of work of experts, which is, of course, refected in the prices of the licenses and use of the tools. In addition, the tools include many mandatory hidden costs, such as their internalization and associated staf training. Because of this, we set out to implement a proof of concept to prove that in addition to purchasing and cooperated development, an organization can also build its automation tool if they have staf with the skills required to implement the tool. In addition, we wanted to deviate from existing solutions where many selections are implemented through the user interface, and we started to build the tool so that based on the selected metadata, the tool itself classifes the key business keys of the Data Vault 2.0 model and builds a data warehouse based on these and staging data. Many development areas were raised during the construction of the tool, some of which were already modifed in the frst published version of this research, but as a topic and model in Data Vault 2.0, collecting and creating usable metadata and classifying business keys contains many special cases that need to be addressed. When collecting training metadata, the algorithm loops through the databases inside one server, but in the future, a feature should be added to the algorithm where training data can be collected from several diferent servers, i.e., looped at the level of both servers and the databases inside them. Besides, if the company using the tool allows their metadata to be utilized, it would be a favorable idea to add a feature to the algorithm that collects metadata from all authorized parties to use as part of the tool’s classifcation algorithm, further improving classifcation 49 quality. Concerning the collection and editing of metadata, the form of metadata emerged as areas for development. Metadata often comes in several diferent forms of data, such as numbers or text. However, due to the classifcation algorithm used in the study, the data should be in numerical form and not in textual or categorical form. As a possible future solution to this, the library’s sklearn.preprocessing algorithm OneHotEncoder, which encodes categorical features as a one-hot numeric array, was investigated but not used further. It is also preferable to have a constant number of columns for the metadata to be utilized and to improve the metadata to be utilized. There can be no more than 1024 columns in the current metadata tables because SQL Server does not support wider tables than this. The solution for this is to take advantage of the widetable feature (supports wide tables up to 30,000 attributes), the use of which has not been explored in the current version because for the sake of the classifer alone, the num- ber of attributes in the classifer table should be kept to a minimum. In the current solution, the number of attributes is tentatively controlled by the feature selection mentioned in the methods section, which is, however, implemented separately from the tool. This preliminary feature selection should also be automated and incorpo- rated into the tool, as the more training data accumulates for the tool, the more signifcant of the classifcation attributes may change. In general, retrieving and modifying training and testing data according to needs should be more automated. The current solution involves too much manual data addition and selection for cus- tom metadata, and it should be further explored in more depth whether the same information can be more dynamically incorporated into usable metadata or how metadata quality can be improved for classifcation results that are not yet good enough. In the current version, the training data and their data models used by the classifer are also straightforward and small, which clearly shows that the au- tomation of building a database as large and complex as AdventureWorks and the classifcation of business keys are far from satisfactory enough for the tool to start immediately to be utilized in potential customer projects. The current constructive algorithm only considers the construction of the frst version of the Data Vault 2.0 type of data warehouse, i.e., the algorithm can currently be run only once. If the status of the current static data are continuously changing, which is more the rule than the exception today, the checksum that follows the changes in the data on row-level can be utilized in further data loads to track the changes. It can also consider information about the same attribute in a diferent source system. However, often there may be diferences in naming the same concepts between source systems, making it more challenging to load those attributes into the same target tables in the data warehouse. Nonetheless, this problem can be 50 addressed in the future with algorithm modifcations and concept mapping tables, based on which the form of the same concept in diferent source systems exists. In addition to the previous developments, challenges also emerged for which there is certainly a solution, but in this study and solution, they have not explicitly been focused on. Many examples of the topic have highlighted how one should distinguish between observations of diferent activity levels and separate them in their satellites from the source table, for example, regarding the organization’s active and inactive bank accounts. The tool of this study does not consider the update activity of the data stream describing the business keys within the tables or the development of the number of observations because the testing has been performed with static data. In the special cases of each concept (hub, link, and satellite), two diferent issues have been addressed, both of which tell about the data used in testing. For each concept, columns whose names are so-called illegal on the SQL server and the T- SQL used in the tool have had to be specially treated. According to good naming conventions, no spaces should be used in attribute names, which has caused problems with the data used in the current version of the tool. Also, code-reserved names such as group, schema, and primary have been used as attribute names. Besides, the combination of non-dynamic and dynamic T-SQL used in the tool poses challenges for a pair of advanced data types, including hierarchyid and xml. In the development of the future, the central development is the handling of all possible special cases universally. Although several challenges were identifed in the implementation of the tool, the fnal result after the collection and processing of metadata, the classifcation of business keys and the construction of the data warehouse, satisfactorily met the objectives set for the tool. While the current version is also not smart enough to slavishly follow the rules of the Data Vault 2.0 data warehouse, the tool has potential. The most apparent conclusion can be drawn from the fact that depending on the company’s interest in modeling Data Vault 2.0 and the extent of the needs, there is an alternative for everyone to implement them. 51

Bibliography

[1] Bernard Marr. How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. https://www.forbes.com/sites/bernard- marr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing- stats-everyone-should-read/#602591d560ba, 2018.

[2] Heikki Hyyrö. Slides: Introduction to big data analysis. TUNI, course TI- ETA17 - Introduction to Big Data Processing, 2019.

[3] Ari Hovi. Maailmanluokan koulutukset - Ari Hovi yhdistää parhaat asiantun- tijat parhaiden asiakkaiden kanssa. https://www.arihovi.com/kurssit/

[4] Sakthi Rangarajan. Data Warehouse Design – Inmon versus Kimball. The Data Administration Newsletter, https://tdan.com/data-warehouse-design-inmon- versus-kimball/20300, 2016.

[5] Dan Linstedt and Michael Olschimke. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, 2015.

[6] Ari Hovi. EDW suunnittelu Data Vault -menetelmällä. Kurssimateriaali omaan käyttöön, ei jaettavaksi. https://www.arihovi.com/kurssit/data-vault- johdanto-2/, 2020.

[7] Dan Linstedt. Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault. Kent Graziano, Createspace Indepen- dent Pub, 2011.

[8] Scalefree blog. Hash Keys In The Data Vault. https://blog.scale- free.com/2017/04/28/hash-keys-in-the-data-vault/, 2017.

[9] Dan Linstedt. #datavault 2.0 hashes versus natural keys. https://danlin- stedt.com/allposts/datavaultcat/datavault-2-0-hashes-versus-natural-keys/, 2014.

[10] Microsoft Documentation. AdventureWorks installation and confgura- tion. https://docs.microsoft.com/en-us/sql/samples/adventureworks-install- confgure?view=sql-server-ver15, 2018.

[11] Vladan Jovanovic and Ivan Bojicic. Conceptual Data Vault Model. 2020.

[12] Vladan Jovanovic, Ivan Bojicic, Curtis Knowles and Mile Pavlic. Persistent staging area models for data warehouses. Issues in Information Systems. Vol- ume 13, Issue 1, pp. 121-132, 2012. 52

[13] Matthias Jarke. Fundamentals of Data Warehousing. 2ed, Springer, 2010.

[14] Matteo Golfarelli and Stefano Rizzi. Data Warehouse Design. McGraw Hill, 2009.

[15] Elzbieta Malinowski and Esteban Zimanyi. Advanced Data Warehouse Design. Springer, 2009.

[16] William Inmon, Derek Strauss and Genia Neushloss. DW 2.0-The Architecture for the next generation of data warehousing. Morgan Kaufman, 2008.

[17] Lars Rönnbäck, Olle Regardt, Maria Bergholtz, Paul Johannesson and Petia Wohed. - Agile information modeling in evolving data envi- ronments. Data Knowledge Engineering. 69 (12): 1229–1253, 2010.

[18] William Inmon. Building the Data Warehouse. Wiley Computer Publishing, New York, 1992.

[19] Matteo Golfarelli and Stefano Rizzi. A Survey on Temporal Data Warehousing. International Journal of Data Warehouse and Mining, 5(1), 1-17, 2009.

[20] Hugh Watson. Recent developments in data warehousing. Communications of the AIS. 8: 1-25,2001.

[21] Sid Adelman, Larissa Moss and Majid Abai. Data Strategy. Addison Wesley, New York, 2001.

[22] Ralph Kimball. The Data Warehouse Toolkit: Practical Techniques for Build- ing Dimensional Data Warehouses. Wiley Computer Publishing, 1996.

[23] Dragoljub Krneta, Vladan Jovanovic and Zoran Marjanovic. An Approach to Data Mart Design from a Data Vault. INFOTEH-JAHORINA. Vol. 15, March 2016.

[24] Dragoljub Krneta, Vladan Jovanovic and Zoran Marjanovic. A Direct Ap- proach to Physical Data Vault Design. Computer Science and Information Systems. 11(2):569–599, 2014.

[25] Dan Linstedt. Data Vault Modeling Methodology. http://www.learn- datavault.com, 2011.

[26] Danijela Subotic, Vladan Jovanovic and Patrizia Poscic. Data Warehouse Schema Evolution: State of the Art. 2014.

[27] Dan Linstedt. Data Vault Series 1-5. http://www.tdan.com/view-articles, 2002. 53

[28] Kent Graziano. Introduction to Data Vault Modeling. True BridgeResources, White paper, 2011.

[29] Ronald Damhof. The next generation EDW. Database Magazine, 2008.

[30] Thomas Hammergren and Alan Simon. Data Warehousing for Dummies. 2nd edition. 2009.

[31] Matt Casters, Roland Bouman and Jos van Dongen. Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. Wiley Publishing, 2010.

[32] Craig Larman. Applying UML and Patterns: An introduction to object- oriented analysis and design, Second edition. Prentice Hall, New Jersey, 1998.

[33] Geethanjali Nataraj. Integration of Heterogeneous Data in the Data Vault Model. Institute for Parallel and Distributed Systems, Universität Stuttgart, Universitätsstraße 38, 70569 Stuttgart, Germany, 2019.

[34] Roelant Vos. Data Vault 2.0 – Introduction and (technical) difer- ences with 1.0. http://roelantvos.com/blog/data-vault-2-0-introduction-and- technical-diferences-with-1-0/, 2014.

[35] Faizura Haneem, Rosmah Ali, Nazri Kama and Sufyan Basri. Resolving data duplication, inaccuracy and inconsistency issues using Master Data Manage- ment. 1-6. 10.1109/ICRIIS.2017.8002453, 2017.

[36] Jenni Laukkanen. Dimensional modeling of the AdventureWorks2017 database. https://github.com/jennilaukkanen/masterthesis/blob/master/Ad- ventureWorks_v3.png, 2020.

[37] Jenni Laukkanen. Data Vault 2.0 type of target data warehouse model for Ad- ventureWorks2017 dimensional model. https://github.com/jennilaukkanen/- masterthesis/blob/master/AdventureWorksDV2_25062020_v2.png, 2020.

[38] Jenni Laukkanen. Data Vault 2.0 type of data warehouse model for Adventure- Works2017 dimensional model created by the Data Vault 2.0 building automa- tion tool. https://github.com/jennilaukkanen/masterthesis/blob/master/Ad- ventureWorksDV_20072020_result.png, 2020.

[39] Jenni Laukkanen. Decision tree used in the classifcation of the business keys created from the training data. https://github.com/jennilaukkanen/mas- terthesis/blob/master/owndectree.png, 2020. 54

[40] Jenni Laukkanen. The Code for automated Data Vault 2.0 building tool. https://github.com/jennilaukkanen/masterthe- sis/blob/master/Jenni%20Laukkanen_Master%20Thesis_Empiri- cal%20part%20code_DV2.0Automation_20072020.sql, 2020.

[41] Editorial Team. Is Data Vault Modeling a Good Choice for Your Orga- nization? https://insidebigdata.com/2017/07/28/data-vault-modeling-good- choice-organization/, 2017. 55

APPENDIX A. The productized tools in the market analysis comparison

1. Datavault builder (https://datavault-builder.com/) 2. ADE (https://www.solita.f/en/agiledataengine/) 3. Wherescape (https://www.wherescape.com/solutions/automation-software/data- vault-express/) 4. Fivetran (https://fvetran.com/) 5. Time xtender (https://www.timextender.com/data-automation/) 6. Datarider (https://francescolerario.net/dv2-model-automation/) 7. Ajilius (https://www.minerra.net/business-intelligence-technologies/ajilius-data- warehouse-automation/) 8. Erwin (https://erwin.com/products/erwin-data-modeler/) 9. Quipu (https://quipu.nl/solutions/data-warehouse/) 10. Optimal data engine (http://ode.ninja/ or https://optimalbi.com/product/optimal- data-engine/) 11. Vaultspeed (https://vaultspeed.com/) 12. Bimlfex (https://varigence.com/bimlfex) 13. Datagaps ETL Validator (https://www.datagaps.com/etl-testing-tools/etl-validator/) 14. Dbtvault (https://www.data-vault.co.uk/dbtvault/) 15. Bi4dynamics (https://www.bi4dynamics.com/data-warehouse-automation/) 16. BI-Ready (http://www.bi-ready.com/) 56

APPENDIX B. The Sample Data sets used in the training of the tool

1. New York City Taxi Sample (https://docs.microsoft.com/en-us/sql/machine- learning/tutorials/demo-data-nyctaxi-in-sql?view=sql-server-ver15) 2. School database (https://docs.microsoft.com/en-us/ef/ef6/resources/school-database) 3. Dofactory database (https://www.dofactory.com/sql/sample-database/) 4. Bikestores (https://www.sqlservertutorial.net/sql-server-sample-database/) 5. Wide World Importers sample database v1.0 (https://github.com/Microsoft/sql- server-samples/releases/tag/wide-world-importers-v1.0) 6. Microsoft Contoso BI Demo Dataset for Retail Industry (https://www.microsoft.com/en- us/download/details.aspx?id=18279) 7. Northwind and pubs sample databases for Microsoft SQL Server (https://github.com/microsoft/sql- server-samples/tree/master/samples/databases/northwind-pubs) 8. Lahman’s Baseball Database (http://www.seanlahman.com/baseball-archive/statistics/) 9. SalesDB Sample Database (https://www.sqlskills.com/sql-server-resources/sql- server-demos/)