A proposal for improvements of the Data Vault

Ensemble process approach to retrieve Big Data

Data Vault limitations and optimization

Tahira Jéssica da Silva Ruivo Vissaram

Dissertation presented as partial requirement for obtaining the Master’s degree in Information Management

NOVA Information Management School Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa

A PROPOSAL FOR IMPROVEMENTS OF DATA VAULT ENSEMBLE PROCESS APPROACH TO RETRIEVE BIG DATA

by

Tahira Jéssica da Silva Ruivo Vissaram

Dissertation presented as partial requirement for obtaining the master’s degree in Information Management, with a specialization in Information Systems and Technologies Management

Advisor / Co Advisor: Dr. Vítor Santos

November 2019

ACKNOWLEDGMENT

I would like to express my sincere gratitude to my supervisor, Ph.D. Professor Vítor Santos, for the support, motivation, guidance, and persistence that drove me to complete this dissertation, as well as the knowledge it transmits.

I am also extremely grateful to the Nova IMS institution, all the teachers and staff for these years of learning and motivation as a student, who have enabled the conditions for this final work to be executed and helped me in my professional career.

A special thanks to my mother and my brother, for their unconditional support, for their encouragement, love, and dedication.

Finally, I thank all my friends who directly or indirectly contributed to this work, with words of encouragement and motivation.

ABSTRACT

Data becomes the most powerful asset in an organization due to the insights and patterns that can be discovered and because it can be transformed into real-time information through BI tools to support decision making.

So, it is crucial to have a DW architecture that stores all the business data of an organization in a central repository to be accessible for all end-users, allowing them to query the data for reporting.

When we want to design a DW, the most common approach used is the , created by Kimball; however, the costs of maintenance and the re-design of the model, when the business requirements and business processes change, or even when the model needs to be incremented are very high and have a significant impact on the whole structure.

For that reason, a Data Vault approach invented by Dan Linstedt emerged, which brings a methodology more oriented to auditability, traceability, and agility of the data, which rapidly adapts to the changes of the business rules and requirements, while handling large amounts of data. Therefore, this hybrid modus operandi combines the best of 3NF and Star schema, being flexible, scalable, consistent, whereupon the costs of implementation and maintenance become reduced, without the need to modify all the model structure, allowing increment building of new business processes and requirements.

However, as it is still recent, the Data Vault approach has limitations compared to Star Schema, requiring many associations to access and execute ad-hoc queries, which makes end-user access to the model difficult. Consequently, the model has low performance, and more storage is required due to denormalization.

Although both are competitors, when we refer to building an EDW capable of providing a central view of all business, the Star Schema and Data Vault 2.0 approaches complement each other according to Data Vault Architecture. On the top of the Data Vault, in the information delivery layer, as the Data Vault cannot be accessed for end-users, Data Marts are created using Star Schemas or OLAP cubes to apply BI tools to perform reports for organizational decision-making.

So, briefly, the purpose of this Dissertation is, through a case study, to compare the Star Schema model with the Data Vault 2.0 Ensemble model. Also, to demonstrate the limitations of Data Vault 2.0 studied and present an optimized way of designing a Data Vault 2.0 model, reducing the joins required to query the data, minimizing the complexity of the model, and allowing users to access directly to the data, instead of creating Data Marts. KEYWORDS

Big Data; Data Vault; Modeling; limitations; optimization

INDEX

1. Introduction ...... 1 1.1. Problem justification ...... 1 1.2. Problem (Research Question) / General objective (Main goal) ...... 2 1.2.1. Specific objectives ...... 2 1.3. Methodology ...... 3 1.4. Case Study Research ...... 3 1.5. Case study strategy ...... 4 1.6. Methodology and Tools...... 4 2. Literature review ...... 5 2.1. and Big Data Concepts ...... 5 2.1.1. Big Data Concept ...... 5 2.1.2. Data Warehouse definition ...... 7 2.2. Data Modelling and Big Data challenges ...... 8 2.3. Data Integration problems ...... 9 2.4. Problems with Traditional Data Warehousing and ...... 13 2.5. Data Vault Ensemble Modeling ...... 17 2.5.1. Data Vault Fundamentals ...... 18 2.5.2. Data Vault Architecture ...... 24 2.5.3. Benefits, disadvantages and limitations of Data Vault Approach ...... 26 2.5.4. Comparison with other dimensional models ...... 28 4. Case study ...... 31 4.1. Data Sources and Data Collection ...... 32 4.1.1. Business Entities ...... 33 4.1.2. of ER model ...... 35 4.2. Differences between a Relational model and a Dimensional model ...... 38 4.2.1. Traditional DW model - Star schema ...... 39 4.2.2. Traditional Data Vault 2.0 Ensemble Modeling ...... 51 4.2.3. The proposal for the optimized Data Vault 2.0 model...... 64 5. Results and Discussion...... 76 6. Conclusions ...... 82 7. Limitations ...... 83 8. Recommendations for future works ...... 84 Bibliography...... 85

Annexes ...... 89 Load Dimension tables – ETL process ...... 89

LIST OF FIGURES

Figure 1 - The three V's of Big Data, (Whishworks, 2017) ...... 6 Figure 2 - Big Data drivers and risks, (EY, 2014) ...... 7 Figure 3 - ETL Pipeline, (Hultgren, 2012) ...... 10 Figure 4 - Implementation problems in Business Intelligence projects, (BI-Survey.com, n.d.)16 Figure 5 - Data Vault EDW, (Hultgren, 2012 ...... 18 Figure 6 - Data Vault EDW, (Hultgren, 2012) ...... 19 Figure 7 - Data Vault EDW, (Hultgren, 2012) ...... 20 Figure 8 - Hub table, adapted by (Hultgren, 2018) ...... 21 Figure 9 - Link table, adapted by (Hultgren, 2018) ...... 22 Figure 10 - Satellite table, adapted by (Hultgren, 2018) ...... 23 Figure 11 - Data Vault Architecture, (Linstedt & Olschimke, 2015) ...... 25 Figure 12 - Parallel load in Data Vault 2.0 approach, (Hultgren, 2012) ...... 27 Figure 13 - SWOT analysis ...... 32 Figure 14 - ER model data source from Hotel Chain ...... 33 Figure 15 - Main differences between relational and dimension modeling,(Varge, 2001) ..... 38 Figure 16 - Star Schema model, (Moody & Kortink, 2000) ...... 39 Figure 17 - Star schema model for Bookings Management ...... 40 Figure 18 - Star schema model for Services Management ...... 40 Figure 19 - Load Dimension and Fact tables dtsx ...... 47 Figure 20 - Load Fact Tables package in SSIS ...... 48 Figure 21 - Fact Booking measures, through derived column component...... 48 Figure 22 - ETL process to Load Fact Booking ...... 49 Figure 23 - OLE DB Source, using a SQL command to extract services data from source ...... 49 Figure 24 - ETL process to Load Fact Service...... 50 Figure 25 - Load Dimension Tables package in SSIS ...... 51 Figure 26 - Traditional Data Vault 2.0 Model ...... 56 Figure 27 - Load Hubs, Links and Satellites tables dtsx ...... 58 Figure 28 - Load Hubs entities package in SSIS ...... 58 Figure 29 - Example of load a Hub table in SSIS ...... 59 Figure 30 - Adding attributes in the Hub entity...... 59 Figure 31 - Load Link tables package in SSIS ...... 60 Figure 32 - Adding metadata to the Link table ...... 60 Figure 33 - Example of load a Link table in SSIS ...... 61 Figure 34 - Load Satellite tables package in SSIS...... 61

Figure 35 - Adding metadata to the Satellite tables ...... 62 Figure 36 - Update new records in SSIS ...... 63 Figure 37 - Example of load a Satellite table in SSIS ...... 63 Figure 38 - Case 1 - query result in Data Vault 2.0 model...... 65 Figure 39 - Case 2 - Query result in Data Vault 2.0 model ...... 66 Figure 40 - Proposal for an optimized Data Vault 2.0 model ...... 67 Figure 41 - Bridge Booking Sales table ...... 69 Figure 42 - SQL Stored Procedure to load the Bridge Booking Sales table ...... 70 Figure 43 - Query result using the Bridge Booking Sales table in Data Vault optimized model ...... 70 Figure 44 - Bridge Booking Guest table ...... 72 Figure 45 - SQL Stored Procedure to load the Bridge Booking Guest table ...... 72 Figure 46 - Query result using the Bridge Booking Guest table in Data Vault optimized model ...... 73 Figure 47 - Creation of views using Bridge tables ...... 74 Figure 48 - SQL query to create the Booking Sales view by using the Bridge Booking Sales table ...... 74 Figure 49 - SQL query to create the Booking information view by using Bridge Booking Sales table ...... 74 Figure 50 - - SQL query to create the Guest Information view by using Bridge Booking Guest ...... 75 Figure 51 - Load Hotel Dimension table ...... 89 Figure 52 - Load Discount Dimension table ...... 89 Figure 53 - Load Booking Status Dimension table ...... 90 Figure 54 - Load Cancellation Detail Dimension table ...... 90 Figure 55 - Load Services Dimension table ...... 91 Figure 56 - Load Trip Type Dimension table ...... 91 Figure 57 - Load Room Type Dimension table ...... 92 Figure 58 - Load Rating Dimension table ...... 92 Figure 59 - Load Platform Dimension table ...... 93 Figure 60 - Load Guest Dimension table ...... 94 Figure 61 - Load Dates Dimension table ...... 94

LIST OF TABLES

Table 1 - Differences between ETL and ELT, adapted by (Smallcombe, 2019) ...... 13 Table 2 - Main differences between traditional and modern DW, adapted by (McCue, 2007; Santoso & Yulia, 2017) ...... 14 Table 3 - Principal features of BI, adapted by (Chugh & Grandhi, 2013) ...... 16 Table 4 – Different concepts in different Data Models (Bojičić et al., 2016) ...... 28 Table 5 - Compare Inmon, Data Vault and Kimball approaches, (adapted by Orlov, 2014) .... 30 Table 6 - Business entities of the ER model ...... 35 Table 7 - Case study attributes, data dictionary of ER model ...... 37 Table 8 – Fact Tables Booking and Service measures ...... 41 Table 9 - Hotel dimension attributes ...... 42 Table 10 - Cancellation dimension attributes ...... 43 Table 11 - Discount Dimension attributes...... 43 Table 12 - Booking Status dimension attributes ...... 43 Table 13 -Trip type dimension attributes ...... 44 Table 14 - Date dimension attributes ...... 44 Table 15 - Room type dimension attributes ...... 45 Table 16 - Customer dimension attributes ...... 46 Table 17 - Platform dimension attributes ...... 46 Table 18 - Rating dimension attributes ...... 46 Table 19 - Service dimension attributes...... 47 Table 20 - Identification of Hubs and business keys ...... 52 Table 21 - Booking Satellites ...... 53 Table 22 - Service Satellites ...... 53 Table 23 - Hotel Satellites...... 54 Table 24 - Guest Satellites ...... 54 Table 25 - Room Satellite ...... 55 Table 26 - Link entities ...... 55 Table 27 - Bridge Booking Sales table ...... 68 Table 28 - Bridge Booking Guests table ...... 71 Table 29 - Results of case study ...... 80

LIST OF ABBREVIATIONS AND ACRONYMS

BI Business Intelligence

CWM Common Warehouse Metamodel

DW Data Warehouse

DWBI Data Warehouse and Business Intelligence

EDW Enterprise Data Warehousing

EWBK Enterprise Wide Business Keys

ELT Extract, Load, and Transform

ETL Extract, Transform and Load

IS Information System

KPI Key Performance Indicators

MPP Massively Parallel Processing

NF Natural Form

OLAP Online Analytical Processing

SMP Symmetric Multiprocessing

1. INTRODUCTION

Nowadays with the expansion of the Internet and consequently with the increase of information systems (Sarker, Bin Deraman, Hasan, & Abbas, 2019) ) and diffusion of the social networking, mobile computing, and online advertising, companies are faced with large amounts of data – Big Data (Hashem & Ranc, 2015), that are crucial to their core business. The information transforms into a powerful and strategic resource that can support decision-making grounded with real facts that allow companies to achieve medium and long-terms goals (EY, 2014).

However, most of the data that are collected can be challenging to provide feasible answers due to the multiple sources of information. These data are subject to various transformations, are unrelated to the various departments of the organization, have no standards or structure, and can sometimes be obsolete (Oumkaltoum, Mohamed Mahmoud, & Omar, 2019).

The solution for companies to lead with Big Data is to implement an approach capable of transforming these large volumes of data - Data Warehouse, into useful information, and consequently reliable knowledge to the decision-making process. Besides, this multi-dimensional approach is a robust architecture to apply techniques of data analysis and reporting using heterogeneous data sources that can be accessed and understood (Ballard et al., 1998).

These heterogeneous data sources contain structured, unstructured, and semi-structured data in different formats in real-time, which leads to Big Data. Traditional cannot handle these large volumes of datasets, so data modeling becomes relevant research for designing an architecture capable of defining and categorizing them, establishing standard definitions and descriptors, allowing the consumption of the data (Rao, Mitra, Bhatt, & Goswami, 2018).

The Inmon and Kimball approaches are the most famous methodologies used when designing a DW. However, a new approach created by Dan Linstedt, the Data Vault approach, has revealed importance in recent years, in the way of building a DW through raw data (unprocessed) from heterogeneous sources (Yessad & Labiod, 2017). The emergence of this approach has enabled the traceability of the data and improved the scalability, flexibility, and the productivity of the DW compared with other data models (Bojičić et al., 2016) and the total cost of ownership is low (Yessad & Labiod, 2017).

Data Vault aims to represent the real core business of the company (Inmon & Linstedt, 2015), consisting of an incremental approach (flexibility), without requiring the total redesign of the dimensional structure (Naamane & Jovanovic, 2016), which provides added value to large amounts of data, Big Data, that are constantly changing and that fit in budgetary expectations (Hultgren, 2012).

1.1. PROBLEM JUSTIFICATION

Organizations handle large amounts of data daily, which makes it challenging to adapt to the constant changes in the business rules and requirements. Big Data still needs to confront challenges to achieve a successful architecture model (Storey & Song, 2017), which is why data modeling

1

capable of integrating, aligning, and reconciling unpredictable formats of mainly unstructured and multi-structured data is crucial (Hultgren, 2012).

Data Vault represents a viable and effective approach to modeling data that needs to be traceable, respond to the business changes over time, integrate multiple types of sources, capable of incrementing new subject areas, highly agile, and the most important, with lower maintenance costs (Hultgren, 2012).

Although this methodology is very useful for managing, architecting, and abstracting the main business requirements, it is a model that still presents limitations when storing and accessing the data. Data Vault cannot be used by end-users, due to the exhaustive joins that need to be performed, for querying the data, which will have a significant impact on the model performance (Naamane & Jovanovic, 2016).

So the challenge will be modeling Big Data into a DW architecture, through the Data Vault 2.0 Ensemble approach, to understand the main challenges that companies face when designing a conceptual model, and on the other hand, demonstrate the main limitations of this approach, comparing them to the Star Schema model and present an optimized model capable of responding to the limitations found.

1.2. PROBLEM (RESEARCH QUESTION) / GENERAL OBJECTIVE (MAIN GOAL)

The main purpose of this Dissertation is to propose improvements in the Data Vault 2.0 approach by retrieving Big Data in order to optimize the exhaustive joins needed to connect the main entity elements in the Data Vault approach: Hubs, Links, and Satellites. Furthermore, another goal of this Dissertation is to compare the Data Vault 2.0 model with the traditional DW model - Star Schema model and demonstrate the limitations that DW projects still face by using the Data Vault 2.0 approach.

A new Data Vault 2.0 model will be proposed with the aim of minimizing the Data Vault limitations and allowing end-users to be capable of accessing and querying the data using this approach, which is possible by applying BI tools directly.

1.2.1. Specific objectives

The following research questions will be researched and studied to achieve the goal under study:

▪ What are the benefits and the disadvantages of Data Vault 2.0, compared with the Kimball approach? ▪ Why is Data Vault not an end-user approach? ▪ Can end-users apply BI tools in Data Vault 2.0 architecture directly? ▪ What are the limitations of the Data Vault 2.0 Ensemble approach? ▪ Why are so many joins used to relate the entities in the Data Vault 2.0 Ensemble? ▪ Is there a way to optimize the Data Vault 2.0 model?

2

1.3. METHODOLOGY

Exploratory research will be conducted In the scope of this master dissertation. The main goal is to provide a better understanding of the research questions identified and find improvements and limitations in the framework under study.

The purpose will be to reveal new standards and insights around the concepts in the study to provide an optimized model to deliver a better response to the challenges faced by organizations.

The choice of this type of research design is based on flexibility and adaptability to the change that it yields since the goal is to observe and comprehend the data and discover new ideas by tentative means.

The research method used is based on a qualitative research method, a case study, as described below, to increase the knowledge and find new aspects relevant to this phenomenon.

1.4. CASE STUDY RESEARCH

A case study method allows exploring, investigating, and gaining a better understanding of data coming from a given scenario (Bolder-Boos, 2015).

Case study research is used to investigate the phenomenon under study more deeply and profoundly, to get more contextual insights and understandings (Yin, 2008). Besides, case study methods allow researchers to respond to “How” and “Why” questions of the study problem and do not require any control over it (Yin, 2008).

A case study is a “general term for the exploration of an individual, group or phenomenon” (Sturman, 1997), corresponding to an extensive description of the case and its analysis (Starman, 2013).

According to (Simons, 2009) a case study is an in-depth exploration from multiple perspectives of the complexity and uniqueness of a particular project, policy, institution, program, or system in real life”.

Case study research can present some advantages regarding its capacity to reach high conceptual validity, which consists of determining and quantifying the indicators related to the theoretical concepts under study. Case studies also integrate methods capable of inducing new hypotheses or even identifying new variables pertinent to particular cases. Besides, it allows researchers to examine causal mechanisms in detail on an individual case context and have a strong capability to adjust in complex causal relations (Starman, 2013).

Briefly, this case study will allow investigating, exploring, demonstrating, and gathering results of this specific scenario in a more practical component, with the objective of justifying and supporting the analysis under study.

3

1.5. CASE STUDY STRATEGY

An emblematic case study from a typical BI project was chosen to achieve the research objectives.

The strategy is to apply the Data Vault process approach with some improvements and see if these improvements bring benefits beyond the respectable traditional Data Vault.

If these improvements are to be observed, as this is a typical BI project, it can be inferred that the proposed improvement measures will also benefit future projects.

In summary, with this case study, it will be possible to understand, analyze, compare, and study, on a technical level, what the types of differences in the implementation of these approaches are, which is the model that can most quickly meet the expectations and business needs and using the data, to demonstrate limitations and forms of optimization that still exist in the emerging Data Vault approach in recent years.

1.6. METHODOLOGY AND TOOLS

This case study implemented Kanban, an agile methodology, owing to the interactivity and incremental building that provides. Moreover, this methodology allows changes, adjusting to the business requirements, and is focused on the business value and end-users, leading to the quality improvement of each delivery.

Besides, the Data Vault Ensemble approach is focused on this methodology, which is capable of adapting to business changes and improving the model quality.

Regarding the tools, for the creation of the models, a Star schema and Data Vault models will be based on Microsoft tools, such as SQL Server Management Studio 2017 and SQL Server Data Tools 2015, due to the licenses provided by Nova Information Management School.

4

2. LITERATURE REVIEW

In this chapter, a theoretical background will be presented in order to introduce the main studies and research already done, associated with the topic of this Master’s Dissertation.

In order to sustain the theoretical research and to perform and support the Dissertation presented, some subjects related with data structures and the conceptual data model, data integration issues, traditional DW problems, main challenges with Big Data and finally the Data Vault modeling and comparison with Inmon and Kimball approaches are included in the study.

This literature review aims to understand the main problems and challenges when implementing a DW using Big Data, that organizations are faced with nowadays, and the strategies that they use. As a foundation to define and collect studies and research about the Data Vault approach and compare the Linstedt’ DW methodology with the Inmon and Kimball approaches, in order to comprehend the benefits, disadvantages, and limitations when developing a DW project with large amounts of data.

To start and for a better understanding of the two main concepts most discussed in this Dissertation, Big Data and Data Warehouse, a definition of these two notions are presented.

2.1. DATA WAREHOUSE AND BIG DATA CONCEPTS

Before presenting the theoretical background collected related to the topic of this Dissertation, it is crucial to define the two main concepts that will be addressed during this research: Big Data and Data Warehouse.

2.1.1. Big Data Concept

Big Data concept is related to a large amount of data that is dynamic because it is continuously changing, which people, tools, and machines create (EY, 2014).

Gartner defines Big data as “high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation” (Gartner, 2012). Increasingly unstructured data are faced with exponential growth due in large part to the explosion of machine-generated data and the human engagement within the social networks (Eberendu, 2016).

In the beginning, the term Big Data was characterized, according to Doug, by three key concepts: volume, velocity, and variety. The volume corresponds to the total amount of data that is generated and increased by the e-commerce channels. The velocity responds to how often data is generated by these e-commerce channels, which needs to be stored and finally the variety concept, which determines the heterogeneity of the data sources, induced incompatible data formats, non-aligned data structures, and inconsistent data semantics, making the adoption of an effective data management critical (Laney, 2001).

Later, by incorporating structured, semi-structured, and unstructured data, Big Data was again characterized, adopting one more V: Veracity, related to the truthfulness of the data, and their integration. The amount of data that is created is enormous compared to traditional databases,

5

encompassing a diversity of sources, which are generated continuously and to rapid processing. Since the data coming from various data sources, it is necessary to test its veracity (EY, 2014).

Figure 1 - The three V's of Big Data, (Whishworks, 2017)

Currently, the Big data concept has 7 V’s associated: Volume, Velocity, Variety, Veracity, Variability, Visualization, and Value (Mcnulty, 2014). Incorporating new V’s: Variability which is associated to the constantly changing data, Visualization that corresponds to the representation of the large volume of data to be understandable for those who use it, and finally, Value, related to the significance of the data for the business goals (Mcnulty, 2014). For these massive amounts of data, it is necessary to have scalable technology capable of collecting, storing and analytically processing all the information (EY, 2014), in order to extract and gain insights to transform into knowledge, increasing their competitive advantage, become more innovative and increase their level of productivity (Eberendu, 2016).

6

Figure 2 - Big Data drivers and risks, (EY, 2014)

2.1.2. Data Warehouse definition

Gartner defines a DW as an architecture that stores data from different data sources (transactional systems, operational data, and external sources), which aggregate all these data and business requirements into a single one enterprise-wide view suitable for reporting and data analysis for the decision-making process (Gartner, 2019).

A DW is characterized as being subject-oriented, giving information of a specific business subject, integrated, which collects heterogeneous sources into a single one, nonvolatile because the data is not updated or changed, once inserted into the DW, and in the last time-variant, due to the data being related to a certain period of time (Inmon, 2002).

The advantages of creating a DW are related to the following characteristics (Almeida, 2017):

▪ Integrating data from multiple sources; ▪ Performing new types of analytical analysis; ▪ Reducing costs to access historical data; ▪ Standardizing data across the organization, having a single vision of the data; ▪ Improving turnaround time for analysis and reporting; ▪ Sharing data and allowing others to access data easily; ▪ Supporting ad-hoc reporting and inquiry; ▪ Reducing the development burden on IS/IT; ▪ Removing informational processing load from transaction-oriented databases.

However, the adoption of this architecture can lead to some challenges (Almeida, 2017):

▪ Time-consuming preparation and implementation; ▪ Difficulty in integration compatibility considering the using of different technologies; ▪ High maintenance costs; ▪ Limited use due to confidential information; ▪ Data ownership and data security; ▪ Underestimation of ETL processing time; ▪ Inability to capture the required data; ▪ Increased demands of the users.

The costs of building and maintenance of a DW can be very high and significantly different from the cost of a standard system, due to the large volume of data that the DW stores, the cost of the keeping the interface between the DW and the operational sources (depending of the use of the ETL tools or not) and the implementation of the DW is never done, due to the ongoing need of adding new data or new areas to the DW (Inmon, 2002).

Data Warehousing is a collection of decision support technologies that allows experts (management, analysts) to make better and quicker responses in decision-making (Chaudhuri & Dayal, 1998).

7

The importance of Data Warehousing increases with the need to structure and store data for the decision-making process of the companies. This factor is considered as a powerful and tangible asset, which can bring competitive advantages in the business world. The purpose of creating a DW has been growing due to the vast quantities of data that are generated by organizations, which they need to access and use for the quotidian of the business (Ballard et al., 1998).

2.2. DATA MODELLING AND BIG DATA CHALLENGES

Companies are faced with large amounts of data due to the development of new technologies, which have been growing exponentially. This information boom is characterized by the difficulty of the process to interpolate and aggregate all the data to support the organization's data structure, especially for data management and decision-making (Oumkaltoum et al., 2019).

Conceptual data modeling becomes increasingly, more crucial in the sense of documenting and understanding the whole organization's existing data elements and attributes, the flow of the information, and particularly how they can be associated - relationships between the data (Teorey, Jagadish, Modeling, & Edition, 2011).

Data modeling consists of a representation/visualization of the business world, incorporating abstraction and a reflection of the business area, before any implementation, which is why it is so important. This concept is characterized as a well-organized abstraction of business data (Ballard et al., 1998).

Furthermore, the conceptual data model is essential because it defines the business objects (data abstraction) and their properties (attributes). Permits communication with all members involved, who do not need any expertise to understand the business model, identifies the scope of the business data and defines the cardinalities (associations) of the relations between the data objects (Teorey et al., 2011).

Designing a conceptual data model is an iterative process, which becomes more detailed as the entities and relationships are added to transform logical designs into physical designs (Hultgren, 2012).

Nevertheless, it is also essential that the definition of the requirements is clear to be possible to model a conceptual data model. Otherwise, the IS project fails due to reasons such as unclear and incomplete requirements and specifications, user input, or lack of it and constant change of the requirements by the stakeholders. Although the design of the IT structures is essential for success, this continuously presents some challenges (Gemino & Wand, 2003).

Besides, it is challenging to design a DW architecture due to the business dynamism and complexity of the data. It is not realistic that the information continues static, and the requirements are not always provided at the beginning of a DW project (Jovanovic, Romero, Simitsis, Abelló, & Mayorova, 2014). Nowadays, the DW must be adaptable to the constant data and source changes.

With the arising of the Big Data era, the design process becomes more difficult to organize and represent. The challenges faced becoming bigger because of the volume of the data, the uncertain

8

veracity of the data, the variety of the sources, and finally, due to the fast velocity that data comes and changes (Gil & Song, 2016).

Besides, the variety of data in different formats, different platforms, and structures make it difficult to represent a big picture perspective (Ballard et al., 1998).

However, modeling is an essential key to communicate with the stakeholders, to codify the business needs and requirements, and more importantly, to provide technical aspects and details for the developers build the DW (Hultgren, 2012).

Before any implementation, the ideal is to analyze and design the data structure, to have a solid conceptual data model capable of representing the business and the data flow, of providing a better selection of the DW approach to be used. This step will ensure an effective DW and the reduction of costs in implementation (Ballard et al., 1998).

Many data modeling approaches exist to design DW architectures, and they are designed with the same characteristics, tables, and relationships. So, the main difference between them is the essence of the rules established and their purpose in the way of model these tables and relationships (Hultgren, 2012).

When we talk about significant amounts of data, the Data Vault modeling approach is considered one of the most effective, because the primary purpose of this methodology is oriented requirements, integration of multiple heterogeneous sources, especially unstructured data (semi- structured, multi-structured) and the provision of agility, absorbing the business changes rapidly (Linstedt & Olschimke, 2015).

The principal difference between the traditional modeling approaches is that the design is not expected to receive business changes, because their source systems are constant, and the project scope is restricted to specific requirements. The auditability and an Enterprise view of data are required or planned (Hultgren, 2012).

2.3. DATA INTEGRATION PROBLEMS

Data integration is critical in the process of building a DW (Calvanese, De Giacomo, Lenzerini, Nardi, & Rosati, 2002).

However, with the Big Data era, data integration requires special attention in the way of extracting data from massive data sources. The variety, volume, and overlap of the data provide some efficiency and effectiveness problems when we talk about the integration of the data. The massive volume of data can be very costly and bring some issues when accessing all the data sources, which makes it challenging to achieve scalability and efficiency (Linstedt & Olschimke, 2015).

Moreover, data integration can deal with other kinds of challenges, such as the semantics and meaning of the business objects, the grain, and precision, accuracy, and quality of data defining the keys and the identifiers, the formats, defaults, exceptions rules, null interpretations, temporal and timeline issues, the consistency of loads and changes and reengineering of the data and requirements (Hultgren, 2018).

9

In the process of integrating data from the source to the DW structure, the data can suffer redundancies and inconsistencies (Calvanese et al., 2002) that must be solved through the ETL tools. It is vital to consider some data quality criteria in order to have a reconciled and integrated view of business data. Consistency, validity, conformity, accuracy, and integrity are keywords when it comes to data processing (Shivtare & Shelar, 2015). These ensure that the data that we are going to load at the destination is non-conflicting and consistent data, reasonable over a given period, accurate and useful to the real world, and if they can relate to each other.

Figure 3 represents the different forms of transformations possible in the Extract and Load processes of ETL.

Load

Extract

Record

Cleanse

Validate Integrate

Calculate

Transform D/T Stamp D/T

Figure 3 - ETL Pipeline, (Hultgren, 2012)

However, it is not always possible to ensure efficient data integration for multiple reasons. Bad data, unexpected changes, different formats, and missing data in source systems, the non-compliant data in the sources with standards, the complex DW, the different encoding formats and lack of business ownership, policy and planning of the entire enterprise data contribute to data quality problems (Shivtare & Shelar, 2015) and are some of the most prevalent issues with which the integration process deals.

Currently, in the Big Data era, data integration becomes critical due to the variety of the data, which is provided from autonomous and heterogeneous data sources, being more vulnerable to the overlapping of the data. Besides, the characteristics of Big Data brings challenges, especially in efficiency and effectiveness aspects (Lin, Wang, Li, & Gao, 2019).

The massive data sources that are needed to be handled in order to integrate Big Data makes the process costly and sometimes impossible to access the data, requiring high computational complexity and efficient algorithms capable of dealing with this phenomenon (Lin et al., 2019).

Data quality also becomes a concern in the data integration strategy because bad data quality can bring poor insights and improper decision-making (Brown, 2019).

10

With the Big Data paradigm, data integration also suffers some changes in the methods of transforming and loading data from diverse sources to one location. The typical ETL process, which is well-known and used in data integration, has now been replaced by the ELT technique.

In ETL, after the extraction of data, the transformation of data was performed before being loaded to the final architecture. However, now, with Big Data, this process becomes more complex due to data that are generated in large quantities (Smallcombe, 2019). So, data is loaded first, and the necessary transformations are made afterward. However, this new technology is very recent, so the ELT pipeline presents challenges, and it is necessary for experts to implement this.

The following table 1 presents the differences between these two concepts (ETL and ELT), for a clearer understanding.

Characteristics ETL ELT

Adoption of the technology and Very well known to expertise over New technology, so it can be availability of tools and the last 20 years. difficult to locate experts experts

It only transforms and loads the Loads all data immediately, and Availability of data in the necessary data for the DW. The users can determine which system data is transformed before the data to transform and analyze load. later.

Not a solution for data lakes It offers a pipeline for data Compatible with data because it only integrates data in a lakes to absorb unstructured lakes relational data warehouse system. data.

Requires the upload of the data before removing sensitive Removes sensitive information Compliance information. Sensitive before loading it into the DW. information will be more vulnerable.

More appropriate to handle smaller Handles massive amounts of Data size vs. complexity data sets that require complex structured and unstructured of transformations transformations. data – Big Data.

Works with cloud-based DW Works with cloud-based solutions Data warehousing solutions to support and DW. It requires a relational or support structured, unstructured, semi- structured data format. structured, and raw data types.

ELT processes are cloud-based Cloud-based ETL platforms do not Hardware requirements and do not require specialized require specialized hardware. hardware.

11

With a cloud-based target data If datasets increase in size, How are aggregations system, it is possible to process aggregation becomes more different massive amounts of data complicated. quickly.

Automated, cloud-based ETL solutions, requires little Cloud-based and generally Maintenance maintenance. However, an onsite incorporates automated requirement ETL solution that uses a physical solutions, so very little server will require frequent maintenance is required. maintenance.

Data is extracted, then loaded Data transformations happen into the target data system immediately after extraction within Order of the extract, first. Only later is some of the a staging area. After transform, load process data transformed on an “as- transformation, the data is loaded needed” basis for analytical into the data warehouse. purposes.

Transformations happen within a Transformations happen inside Transformation process staging area outside the data the data system itself, and no warehouse. staging area is required.

ELT is a solution for uploading ETL can be used to structure unstructured data into a data Unstructured data unstructured data, but it cannot be lake and make unstructured support used to pass unstructured data into data available to business the target system. intelligence systems.

ETL load times are longer than ELT Data loading happens faster because it is a multi-stage process: because there is no waiting for (1) data loads into the staging area, transformations, and the data Waiting time to load (2) transformations take place, (3) only loads one time into the information data loads into the DW. Once the target data system. However, data is loaded, the analysis of the the analysis of the information information is faster than ELT. is slower than ETL.

Data transformations take longer initially because every piece of data Since transformations happen requires transformation before after loading, on an as-needed loading. Also, as the size of the data basis—and one only Waiting time to perform system increases, transformations transforms the data required transformations take more time. However, once for analysis at the time— transformed and in the system, transformations happen much analysis happens quickly and faster. However, the need to efficiently. continually transform data

12

slows down the total time it takes for querying/analysis.

Table 1 - Differences between ETL and ELT, adapted by (Smallcombe, 2019)

2.4. PROBLEMS WITH TRADITIONAL DATA WAREHOUSING AND BUSINESS INTELLIGENCE

The constant increase of data and the need to obtain knowledge instantly increased the capacity for accuracy and efficacy about the organization of information. Databases came to play a critical role in their management (Wannalai & Mekruksavanich, 2019).

The EDW has emerged to represent all the organization’s business data and specific rules, according to the multiple subject business areas – “Single version of truth,” instead of a traditional DW that only represents one single business area (now called Data Marts). The EDW concept is capable of providing all organizational information, aggregated by contexts – “single version of facts,” which can provide all all-time organizational data, by the individual organization users (Linstedt & Olschimke, 2015).

Conventional systems lack efficiency when dealing with large volumes of data and information (Wannalai & Mekruksavanich, 2019), which is reflected in the DW development with the absence of a standardized DW data model (Bojičić et al., 2016).

So, a DW is a critical corporate asset nowadays, given its importance in strategic business decisions, which, besides providing operational system support, can also bestow personalized offers and present upsell promotions (Linstedt & Olschimke, 2015). This aspect can be crucial when organizations deal with their competitors.

With the expansion of the information, companies must have real-time data to facilitate decision- making and to have the capacity to respond more quickly to their customers. So, traditional DWs are neither prepared to handle these volumes of data nor deliver information in real-time (Bouaziz, Nablil, & Gargouri, 2017).

Traditional DWs can give us information from the past, answering questions like “What has happened?” which is supplied by historical data. However, although these questions remain relevant, Big Data can yield the organization answers for their future. Using advanced analytics, they are capable of discovering powerful insights and trends on the variety of data and transform them into information and knowledge useful for the strategies of the company (McCue, 2007).

Modern DW, using Big Data, can respond to questions such as “What is happening now?” or even “What could happen?” (McCue, 2007), which traditional DWs cannot. Besides, with the new era of Big Data, it is possible to make predictive analyses based on the data, which adds value to the core business of organizations.

Traditional DWs are more oriented only for strategic decisions, containing historical data that is being integrated daily, weekly or monthly, which causes difficulty in performing reporting (this is more restrictive to the existing processes and patterns) and measuring the data (Bouaziz et al., 2017).

13

Besides, much of the data comes from unstructured or semi-structured sources, and a traditional DW cannot categorize and store this type of data. The data is being generated very quickly, subsequently needing a flexible and agile structure that can quickly process it.

Table 2 displays the main differences between traditional and modern DWs:

Characteristics Traditional DW DW nowadays

Purpose The principal purpose is to support the The primary purpose is to decision-making process. This is implemented integrate multiple for a specific business area, and the data heterogeneous sources collected is non-volatile, time-variant, and (structured, semi-structured, integrated. and unstructured data) to store, manage, and analyze it.

Data Source Transactional and Operational Databases. Different formats, sources, and standards.

Data size Terabytes Petabytes

Scope Support BI (Business Intelligence) and OLAP Discover insights from Big Data (Online Analytical Processing). using techniques.

Architecture Star schema is the most used approach. No architecture defined, Oriented to ETL tools. depending on the complexity of the DW project.

Schema Static Unstructured data, non- transactional, dynamic schemas

Repositories Often fragmented multiple warehouses Single repository using the concept of a data lake which is constantly gathering and adding data

Technology There are several free and licensed applications The technology must support, and tools in the market. process, and store Big Data.

Processing Scales vertically MPP (Massively Parallel scalability Processing) capacity

Storage Relational data stores Distributed file system

End-User Top management and business analysts Data scientists

Table 2 - Main differences between traditional and modern DW, adapted by (McCue, 2007; Santoso & Yulia, 2017)

14

The evolution of Big Data has affirmed the importance of adopting an effective BI to improve companies’ tactical, strategic management processes, decision-making processes, and increase productivity and efficiency. This set of computing technologies, which is capable of identifying, collecting, storing, and analyzing data with the aim of converting them into actionable and pertinent information, can proffer successful strategic plans to companies. The adoption of an efficacious BI will primarily provide insights leading to the discovery and comprehension of consumer buying trends, which can increase profits through more oriented marketing campaigns (Chugh & Grandhi, 2013).

The constant increase of data brings challenges to traditional decision support systems, which are not sufficient to handle it. So BI tools capable of processing this kind of data captured from multiple sources to be able to analyze them are needed. BI tools can create intelligence for the core business of the organization, converting data into meaningful and useful information (Chugh & Grandhi, 2013).

Table 3 presents the main features that BI is capable of handling:

Categories Key functionalities of BI tools

Data consolidation ▪ Integration of data from both in-house and external sources. ▪ Simplified extraction of data, transformation, and loading through graphical interfaces. ▪ Elimination of unwanted and unrelated data.

Data quality ▪ Sanitize and prepare data to improve the overall accuracy of decisions.

Reporting ▪ User-defined, as well as standard reports, can be generated to serve employees at different levels. ▪ Personalized reports to cater to different individuals and functional units.

Forecasting and ▪ Support in creating forecasts and making comparisons between modeling historical data and real-time data.

Tracking of real-time ▪ Monitor current progress with defined objectives through KPIs or data expected outcomes. ▪ Prioritize scarce resources.

Data visualization ▪ Interactive reports with visualization to understand relationships easily. ▪ Scorecards to improve communication.

Data Analysis ▪ What-if analysis. ▪ Sensitivity analysis. ▪ Goal seeking analysis. ▪ Market basket analysis.

15

Mobility ▪ Portable applications can be installed on mobile devices such as mobile phones and tablet computers to support executives and sales staff while traveling.

Rapid insights ▪ Drill down features allow users to dig deeper into data. ▪ Through dashboards, it is possible to identify and correct negative trends, monitor the impact of newly made decisions, , and improve overall business performance.

Report delivery & ▪ Deliver reports to view in most commonly used office applications shareability such as Microsoft Office (Word, Excel, and so forth). ▪ Email reports in different formats

Ready-to-use ▪ Pre-built metadata with defined mappings considering applications performance & security needs. ▪ Pre-built reports and alerts to support management in real-time.

Language support ▪ Multiple language support.

Table 3 - Principal features of BI, adapted by (Chugh & Grandhi, 2013)

However, BI projects still face some issues when implementing a DW architecture, with some existing factors that can lead to these problems, as presented below:

Figure 4 - Implementation problems in Business Intelligence projects, (BI-Survey.com, n.d.)

16

Notwithstanding BI project issues, these programs, processes, and tools allow organizations to have more informed information to make decisions. These determinations are focused on an integrated enterprise data view for the whole company because they do not work with only one unit, so it is essential to maintain the whole perspective (Hultgren, 2012). However, without an appropriate DBWI initiative, the integration of data to extract pertinent insights cannot be possible.

The DWBI framework is confronted with the dynamic changing of the requirements, so the challenge is to be more real-time oriented.

2.5. DATA VAULT ENSEMBLE MODELING

A DW is a fundamental concept in an Enterprise due to the possibility of evaluating its performance over time, facilitating decision-making support (IBM, 2011).

In order to store large amounts of multiple heterogeneous data sources and ensure the historical data, a data model that represents the physical structure of a DW, able to consume the data, reconcile the different sources, and be resilient to changes which may occur is mandatory (Bojičić et al., 2016).

The CWM defines approaches that propose that the data should be organized according to 3NF or in multi-dimensional models; however, they have limitations with respect to the maintenance of the DW. A new approach - Data Vault, has recently emerged to overcome these limitations (Yessad & Labiod, 2017).

When building a DW, one of the things needed is to measure the agility to adapt to changes because an EDW is continually changing, due to new sources and attributes, new requirements and business rules, deliveries, and expansion of subject areas. Thus, it is crucial to think that the database model must be agile for possible future changes and ensure that maintenance costs are not unsustainable (Linstedt & Olschimke, 2015).

The DW needs to be based on central business data, which can easily adapt for future changes/modifications, integrating multiples sources into one structure and track information history, providing truthful and auditable information.

The Data Vault approach created in the early 2000s by Dan Linstedt (Linstedt & Olschimke, 2015) came to compete with the Inmon and Kimball approaches. Linstedt defines the Data Vault as "a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business" (Linstedt & Olschimke, 2015).

A Data Vault is an empirical model, and its modeling approach consists of a form of ensemble modeling, with the fundamental that the things must be broken into parts (Hultgren, 2012). When we refer to the term Ensemble Modeling, we associate it with Unified Decomposition.

The implementation of a DW should be oriented to a subject, and that is on what the Ensemble Modeling concept is based. The goal is either to divide into multiple components because of

17

flexibility, adaptability, agility, and to facilitate the way things are interpreted, differently, or by those that change independently of each other (Hultgren, 2012).

However, although we intend to encapsulate the information, to be grouped by subject, on the other hand, we also intend to integrate all the data into a single view (Hultgren, 2012).

Figure 5 - Data Vault EDW, (Hultgren, 2012

A Data Vault represents the business processes with their ties, through the business keys, which are crucial to the model due to an indication of how the business can access, connect and integrate the systems (Inmon & Linstedt, 2015).

The methodology being studied presents some characteristics, in the way of adapting to changes in the business and organizational processes. One of them is to separate the descriptive attributes to be more flexible and respond to new changes - incremental build (Hultgren, 2012), which allows the possibility of parallel load data, the data is traceable from the data source. It provides the exploitation of data (Yessad, 2016).

The Data Vault approach is ideal for organizations that need to react to constant changes in the business requirements and integrate multiple sources when the business environment is very complex. So, a centralized DW, which takes advantage of the market, is flexible, increments the business, and can extract information for decision-making is needed (Inmon & Linstedt, 2015).

2.5.1. Data Vault Fundamentals

The Data Vault 2.0 approach is based on three components: Hubs, Links, and Satellites, each one with a specific function. Hubs consist of the natural business keys. Links are the natural business relationships, and finally, Satellites cover all the business contexts, descriptive data, and history (Linstedt & Olschimke, 2015).

18

Figure 6 - Data Vault EDW, (Hultgren, 2012)

The identification of three levels is required for the development of the modeling process of a Data Vault. First, the business keys and business concepts, second, the identification and modeling of the existing natural business relationships, and finally, the design of the correct attribute context for the creation of the satellites (Hultgren, 2012).

The principal tasks needed when building a DW with the Data Vault approach are as follows (Hultgren, 2013):

1. Identify Business concepts; 1.1. Establish EWBK for Hubs; 1.2. Model Hubs; 2. Identify Natural Business Relationships; 2.1. Analyze Relationships Units of Work; 2.2. Model Links; 3. Gather context attributes to define keys; 3.1. Establish Criteria and design satellites; 3.2. Model Satellites.

19

Figure 7 - Data Vault EDW, (Hultgren, 2012)

2.5.1.1. Hubs

In operational systems, users access data through business keys, which refer to the business objects. So, the business keys have central importance in identifying the business objects, and it is why the Data Vault model separates them from the rest of the model (Linstedt & Olschimke, 2015).

The business keys are defined to identify, track, and locate information, which must be unique and with very low propensity (Linstedt & Olschimke, 2015).

Hubs are the central pillar of the Data Vault model (Linstedt & Olschimke, 2015), which represents the core business concept or the business objects (Lans, Business, & Analyst, 2015). Hub entities do not contain any descriptive information or even foreign keys, and their cardinality must be 1:1 (Hultgren, 2013). The Hub table is an essential entity in tracking the arrival of a new business key in the DW (Linstedt & Olschimke, 2015) and incorporate business key(s) to refer the business object, which can be a composite key (Cox, 2014).

The Hub structure, Figure 8, is composed of the following attributes (Linstedt & Olschimke, 2015):

20

▪ Surrogate Key: based on the business key, which corresponds to the primary key of the Hub, to improve the lookup performance within the DW. It is also used as a foreign key, to reference in Link and Satellite entities; ▪ Business key: this attribute is a central element in the Hub and should be a unique index and can be a composed key used by the business object; ▪ Load Date: is generated in the ETL process to load the DW and indicates that the business key initially arrived in the DW. Allows to trace errors and find technical load problems, which can affect data when loaded; ▪ Record Source: describes the master data source or the origin of the source of the business key, allowing traceability of the information.

Figure 8 - Hub table, adapted by (Hultgren, 2018)

Hubs use a “unique list of business keys and provide a soft-integration point of raw data that is not altered from the source system but is supposed to have the same semantic meaning. The business keys in the same hub should have the same semantic granularity” (Linstedt & Olschimke, 2015).

However, in some cases, when multiple sources populate the Hub, the business key cannot be unique in the Hub context, so other identification attributes, called metadata, are used (Inmon & Linstedt, 2015). This metadata consists of two attributes, the record source and the load date. The first identifies and tracks the source system, while the second gives the arrival date and time of the business key in the DW (Cox, 2014).

The hash key is another attribute that is used to reference (foreign key) the business object in the Links and Satellites of Data Vault elements, to enhance the performance of DW load and the joins between the business keys in the model (Linstedt & Olschimke, 2015).

The hash key is useful to increase the join speed process and the lookup performance in the Data Vault DW, and it is based on the business key, becoming the primary key of the Hub (Linstedt & Olschimke, 2015).

2.5.1.2. Links

Another element of the Data Vault methodology is the Link entity, which represents the natural business relationships between Hubs (Lans et al., 2015) or sometimes with other Links (Hultgren, 2013).

These entities are responsible for modeling transactions, associations, hierarchies, and redefinitions of business terms (Linstedt & Olschimke, 2015), which connect two or more hubs, through the

21

operational business processes that use business objects in the execution of their tasks (Inmon & Linstedt, 2015).

As Hub entities, Links also have hash keys, in order to substitute the joins needed to reference the Hubs, corresponding a combination of all business keys. Besides, this helps the ETL jobs, when loaded the data from the staging area, to confirm if it is no duplication in the Links entries to represent the same relationship, instead of comparing all the Hub’s business keys with the Links business keys (Linstedt & Olschimke, 2015).

The number of Hubs that connect provides the granularity of the Links; therefore, a new grain is added whenever a new Hub is added to a Link entity. The more Hubs a Link connects, the thinner the granularity (Hultgren, 2012).

They are instrumental to store relationships records from the past, present, and future of the data and is composed by the Hask Key, which corresponds to the primary key of the Link, in order to be identifiable in the DW, ensuring the scalability of the Data Vault model (Linstedt & Olschimke, 2015).

The cardinality of the relationship is many-to-many. This characteristic transforms into an associative entity, that allows many being on both sides of the relationship (Linstedt & Olschimke, 2015), with the respective foreign keys of the Hubs (hash keys) and metadata variables (Load Date and Record Source). It does not contain any descriptive information (Hultgren, 2013).

The Link structure, Figure 9, is composed of the following attributes (Linstedt & Olschimke, 2015):

• Link Surrogate key: combines all the business keys of the link to make the identification of this entity and the performance of the join more accessible; ▪ Load Date: metadata attribute used for technical and informative reasons; ▪ Record Source: metadata attribute to refer the origination of the source; ▪ Hub Surrogate key: Foreign key(s) referenced by the Hub entity.

Figure 9 - Link table, adapted by (Hultgren, 2018)

The use of the many-to-many cardinality provides some advantages, especially given by the flexibility of the Links in the Data Vault model. If the business rules change, it is easy for the developers to respond to these new requirements, by adding new Hubs to existing Hubs, to be connected through the Links entities, without the need of re-engineering all the model (Linstedt & Olschimke, 2015).

Links entities are a crucial element in the physical model because they help mitigate business requirements and business rules changes without any impact on the existing data sets (historical) or even on the existing processes (Hultgren, 2012).

22

2.5.1.3. Satellites

The last Data Vault model element missing is the Satellites, which contains attributes of the Hubs (Lans et al., 2015). This entity provides all the context and descriptive information of the business object. It is possible to have many satellites to describe a single business key, but only one Satellite can describe one key, a Hub or Link (Hultgren, 2013).

The Satellites can store a business object, relationship or transaction, (Inmon & Linstedt, 2015) by giving a particular context for Hubs and Links over a period of time (Linstedt & Olschimke, 2015).

A Satellite can only depend on one Hub or Satellite entity because a Satellite cannot depend on more than one hub or link (Hultgren, 2012). It is recommended to have at least one Satellite entry for every Hub or Link key; otherwise, an outer join is required, and this should be avoided due to performance and complexity problems (Linstedt & Olschimke, 2015).

One of the functions of the DW is to provide historical data, and, in the Data Vault 2.0 model, the Satellites allow to store every change to the raw data, giving a historical view of the data (Hultgren, 2012).

The Satellite structure, Figure 10, is composed of the following attributes (Linstedt & Olschimke, 2015):

▪ Parent surrogate key: corresponds to the Hubs hash keys and it is part of the primary key together with the load date attribute, providing the context and the date and time of the change; ▪ Load date: indicates the date and the time that a change in the entries of the Satellite occurs, and it also corresponds to the primary key. The date and time described are related to the time of the record that is inserted on the DW. The load date is a metadata attribute. ▪ Record source: hard-coded and applied to maintain traceability of the arriving data set, it should indicate the master data source. This metadata attribute is the key for maintaining auditability of the DW; ▪ Load end date: this attribute indicates the date and time when the satellite entry becomes invalid. Corresponds to the only attribute updatable in a satellite, every time that a new entry load occurs from the source system.

Figure 10 - Satellite table, adapted by (Hultgren, 2018)

A good practice when creating Satellite entities is to split the data among multiple Satellites, so all the descriptive information is not stored into a single Satellite. It is therefore recommended to split

23

the descriptive attributes by source system, which means that each incoming data set is kept in individual Satellites, that are dependent on their parent (Hub or Link), (Linstedt & Olschimke, 2015).

The raw data from a denormalized source data set would be distributed in different Satellites to be kept dependent on the appropriate business object, relationship, or transaction. This aspect provides some benefits:

▪ Allows developers to add new sources without changing existing satellites entities; ▪ Removes the need to alter the incoming data to fit in existing structures; ▪ Enables Data Vault model to keep the history of the source system and consequently keep the system auditable; ▪ Maximizes load parallelism (MPP) because there is no competition for the satellite. The data can be inserted into the satellite immediately without taking the arrival of data from other systems into account; ▪ It allows for the integration of real-time data without the need to integrate it with raw data loaded from batches. There are no dependencies across multiple systems that could force the system to have both types of data ready at the same time.

Another good practice is to split the data by rate of change, storing the attributes that are frequently changing in one Satellite and the attributes that change less frequently into another. This procedure is useful to separate these kinds of attributes in order not to consume unnecessary storage in new records (Linstedt & Olschimke, 2015).

2.5.2. Data Vault Architecture

The primary purpose of an EDW is to collect and make available useful information for the business core of the organization, in which the data is aggregated, summarized, and consolidated according to the context of the business (Linstedt & Olschimke, 2015).

Data Vault modeling corresponds to a hybrid approach, where its architecture is composed of three layers (Linstedt & Olschimke, 2015):

▪ Staging Area, which stores all the raw data provided by the source systems; ▪ EDW, which is modeled by the Data Vault 2.0 Ensemble approach; ▪ Information delivery layer that corresponds to the information mart.

The EDW layer incorporates three Vaults. The Metrics Vault that contains the runtime information, the Business Vault that applies business rules according to the business in the raw data, to transform them into relevant information (information mart) and finally, the Operational Vault, which stores data fed from operational systems into the DW, (Hultgren, 2012).

The Raw Data Vault incorporates all unfiltered data from the operational data sources that are afterward loaded into Hubs, Links, and Satellites entities, through the business keys (Hultgren, 2012).

24

The Business Vault corresponds to an extension of a Raw Vault, applying business rules, denormalizations, calculations, and other query assistance functions to facilitate user access and reporting (Hultgren, 2012).

Figure 11 - Data Vault Architecture, (Linstedt & Olschimke, 2015)

Figure 11 presents the Data Vault 2.0 architecture, which contains the three layers mentioned previously. So the flow is to integrate the source data provided by the operational systems into the staging area layer. Through the ETL tools, the EDW layer is loaded, and the business requirements and rules of the business organization are applied in the Business Vault. The consolidated data in the Business Vault serves as a source for information mart layers, allowing end-users to explore and perform reporting based on data (Hultgren, 2012).

The Data Vault 2.0 architecture cannot be directly accessed by end-users (Kambayashi, Winiwarter, & Arikana, 2002), so the information mart layer is provided.

The information mart provides subject-oriented information, which can be represented in a star schema form or multidimensional OLAP cubes, to make reporting easy, (Linstedt & Olschimke, 2015).

Other examples of information marts are an Error Mart, which consists of a central location for errors in the DW and a Meta Mart that is also a central location but for the metadata. The two types of information marts are not rebuilt from the Raw Data Vault or any operational data source. End-users, like administrators, use these types of marts to analyze errors in the ETL processes when loading the DW and the metadata collected for the DW, to trace the data sources (Linstedt & Olschimke, 2015).

25

2.5.3. Benefits, disadvantages, and limitations of the Data Vault Approach

When we talk about building a DW, it is necessary to look at factors for the final product to be efficient, allow integration optimization, historicization, and agility of the requirements of specific organizational business.

The architecture is agile and flexible. Therefore this approach can be used when the organization wants to integrate diverse sources and complex data. It allows the management and storage of historical data, data traceability, adaptation to business changes where more requirements can be added, and provides a central enterprise data view. (Inmon & Linstedt, 2015).

Compared with other dimensional models used to build an EDW, the Data Vault 2.0 approach has advantages at three levels, such as business, project, and architecture (Hultgren, 2018).

At the business level, this methodology is oriented to the business, becoming an accessible model that quickly adapts to new business needs for business analysts to understand. The data is traceable, allowing the storage of fully auditable data and allows the EDW to assimilate data in real-time. It quickly adapts to changes in requirements or business rules or even newly added sources without high implementation costs and finally provides a DW with lower total costs of ownership (TCO) (Hultgren, 2018).

The approach implemented in the Data Vault project is mostly followed by an agile methodology in order to have lower risks and multiple deliverables. An essential technical benefit is that it is possible to build incrementally depending on the new business needs, without compromising the architecture and supporting terabytes and petabytes of data (Linstedt, 2015).

The architecture is characterized by parallel loading and the possible expansion of the model, reaching large sizes and applicable to emerging architectures. Data Vault is a data-based architecture typically derived from transactional systems (Linstedt, 2010).

However, it does not guarantee the quality of the data or ensure the type of information that is obtained, because most of the times, the data from sources need data quality transformations.

In the Data Vault 2.0 approach, the use and quality of business information are not considered, nor can it discern whether the information is correct or wrong because it depends on the business perspective (Linstedt, 2010). The quality of the information is not supported by this data architecture, having to be managed by a team of Quality Management.

When we build a Data Vault, we face some problems, both at the business and technical level.

The Data Vault presents some limitations at the business level because the data is not possible to be accessed by end-users. It is only used by data experts capable of using data mining and analytics tools. The data is not cleaned, and its quality is not confirmed; an initial work effort is mandatory, and often at the beginning of the implementation, business analysts consider that they do not need data backups (Hultgren, 2018).

The business churn is more important than the elegance of the model (Hultgren, 2012). In the initial analysis, a focus is required on the analysis of the business processes and data from the sources, making business analysts responsible for the reliability of the analyses. The agreement between the

26

elements of the several business areas is necessary before implementing the Data Vault Architecture (Linstedt & Olschimke, 2015).

Regarding the technical problems that this methodology presents, one of the most common is the fact that for querying, the Data Vault model requires too many joins, which makes data processing obsolete (Linstedt & Olschimke, 2015). This methodology is based on MPP computing instead of SMP computing, which is not a clustered architecture (Kambayashi et al., 2002).

2.5.3.1. Data Vault and MPP Computing

MPP (Massively Parallel Processing), as the name indicates, allows processes to run in parallel on all associated machines. MPP systems are based on a divide and conquer architecture due to the possibility to split the processes needed to be done to each machine and gather the results after (Linstedt, 2010).

This architecture allows the parallelism of the activities that must be performed, splitting the work into several parts using parallel processing. When the expected outcome is just a single result during the process involved, the directory expects all activities to end up integrating the result into an output (Daeng Bani, Suharjito, Diana, & Girsang, 2018).

The idea of using MPP computing in Data Vault is to do a vertical partition, dividing data set by a specific column through Hub components, which enables the division of data over physical hardware without too much effort (Linstedt, 2010).

A good practice to improve computational, memory and data query performance in a DW which has constant changes and increase in its volume, is the compression of columns, pages or rows, allowing the reduction of repeated values and the increased coverage of the index (Daeng Bani et al., 2018).

Figure 12 - Parallel load in the Data Vault 2.0 approach, (Hultgren, 2012)

27

2.5.4. Comparison with other dimensional models

When we look at dimensional models, compared to the Data Vault model, we realize that there are advantages and drawbacks in all of them.

However, when we talk about an organizational DW, it needs to be able to satisfy the business needs and be able to respond to the requirements and changes in them.

The differences between the various dimensional models, regarding the objects, relationships, attributes, and identifiers are presented below.

Object Relationship Attribute Identifier

Normalized Relation Foreign Key Domain Primary Key model

Data Vault Hub Link Satellite Business/Primary model Key

Anchor model Anchor/Knot Tie Attribute Primary Key

Dimensional Dimension Fact Attribute Business/Primary model Key

Table 4 – Different concepts in different Data Models (Bojičić et al., 2016)

2.5.4.1. Comparing Data Vault approach with Inmon and Kimball’s approaches

The following table describes the main differences, according to each category presented, between the Inmon and Kimball’s approaches and the Data Vault 2.0 Ensemble approach.

Category Inmon Data Vault Kimball

Storage utilization Data is stored in a 3NF with a Data is stored in a 3NF Data is stored in the structure that closely hub-link-satellite final consumption resembles the source system structure with several format optimized for with the addition of time-stamped copies of reporting. Changes are timestamp keys that allow data that capture tracked via slowly capturing of changes over changes over time. The changing dimensions time. The structure is not structure is complicated optimized for direct queries to query directly and and requires dimensional data requires dimensional marts for reporting. These data marts for data marts need to be reporting. These data persisted (physical) for all but marts need to be the smallest data volumes for persisted (physical) for performance reasons. This all but the smallest data

28

aspect essentially doubles volumes for storage requirements performance reasons. compared to the Kimball This feature essentially approach doubles storage requirements compared to the Kimball approach

ETL complexity The model requires two ETL The model requires two The model requires a processes: loading from ETL processes: loading single ETL process that source systems and building from source systems loads the final data reporting data marts and building reporting model used for data marts reporting

ETL scalability and The architecture supports the The architecture The architecture loading performance loading of multiple data supports the loading of supports the loading of sources in parallel. Within multiple data sources in multiple subject areas in each source, table loads need parallel. Within each parallel (dimensions to be sequenced based on source, the model first, then facts). Two- dependencies in the source structure supports tier architecture 3NF model better parallelism (hubs requires a single ETL loaded first, then links, layer which delivers then satellites) faster processing

Auditing, traceability, Historical information is Historical changes are Uses the concept of and compliance captured by inserting new captured by inserting slowly changing records each time source data new links and satellites. dimensions to track changes. Change tracking is It provides the most historical changes. easy as the data warehouse detailed and auditable Requires business to structure closely resembles capture of changes. identify attributes the source Change tracking is requiring tracking prior complex due to the to loading. Adding new highly normalized attributes is possible but structure will not re-create historical changes

Modeling Modeling is generally not Modeling can be Modeling complexity complex as the structure complex as link/satellite varies by essentially copies the source structure can be industry/subject area. system with some modeled in multiple Well-established denormalizations and addition different ways. Requires methodology exists with of timestamps to each primary additional modeling for guidelines and modeling key to track changes. Requires dimensional data marts frameworks for each additional modeling for subject area type by dimensional data marts industry. Since the model itself is exposed for final consumption, it

29

does not need additional modeling to capture source data as in the Inmon and Data Vault approaches.

Query performance A direct query is very slow due A direct query is very Model is designed for to the 3NF structure of the slow due to the highly the highest query data. Requires data marts normalized structure of performance by (virtual or materialized) for the data. Every join denormalizing querying and reporting requires a date dimensions containing component which filter attributes and makes queries very hierarchies and keeping complex Requires data fact data in 3NF for marts (virtual or optimal performance materialized) for and storage. Large fact querying and reporting tables can be easily partitioned (typically on date key) and indexed to support high performance. Newest column store in- memory indexing technologies work very well for fact tables

Table 5 – Comparison of Inmon, Data Vault and Kimball approaches, (adapted by Orlov, 2014)

30

4. CASE STUDY

The case study conducted aims to compare the traditional DW model, Star Schema, with the Data Vault 2.0 model, present the limitations that the Data Vault 2.0 approach faces, and finally, propose improvements in this approach.

So, in order to be able to perform this case study, it was necessary to obtain data from a real organization, with real business processes, to fit in this case study.

The case study is related to a Hotel Group, in which the data were provided by a Consulting Company, to support the analysis under study and gather results for this specific scenario.

This example is an illustrative case, using the data provided for application in the case study, which means that it can be extrapolated to other organizations’ projects, by adjusting the business core.

The case study is based on a one entity-relationship (ER) model, with a Hotel chain data, which are divided into two business concepts, presented in the next chapter.

Therefore, briefly, the idea is, through the two ER models provided, to capture the business concepts of the organization and first create a Star schema model and a Data Vault model, to compare the differences studied among the two approaches.

Secondly, the limitations of the Data Vault 2.0 approach will be presented with the model performed to support and substantiate the literature review.

Finally, from the limitations found, the Data Vault model will be optimized to meet the business needs better, especially in the optimization of joins, which makes it difficult for end-users to use the model.

The Hotel Group provides one ER model, which is currently the structure where its business processes data is stored. This organization needs to obtain more timely and useful information to improve the management of their bookings and services, their direct marketing campaigns, and enhance the support for the decision-making of their Top Management. Besides, this organization has another business process that is not associated with the current ER model.

The information is disorganized and in several different sources. Hence, it is necessary to create a DW solution to have a centralized, subject-oriented, integrated and non-volatile repository, allowing the harmonization and standardization of information in the Hotel chain, capable of providing detailed and pertinent information to the business in real-time.

A SWOT (Strengths, Weaknesses, Opportunities, and Threats) analysis is presented to understand the technological needs of this specific organization better and to gain insight into the company's strategic diagnosis.

31

Weaknesses: Decentralized and inconsistent information Strengths: Redundancy of information Online bookings Implementation costs technological platform Adaptation of information Current system documentation Single supplier Large volume migration of data from an obsolete platform SWOT Analysis Opportunities: Adapt to change Ease of access to information Threats: Speed of information Competition Greater control Loss of relevant information Better data processing Inconsistency of data Possibility of integration with other applications Process integration Possibility of meeting new business requirements Obsolete system Quality of information for decision making

Figure 13 - SWOT analysis

It is important to note that this case study is related to a specific organization. However, all companies have similar business needs when it comes to the need to gain insights and standards from their customers in real-time to improve the decision-making process.

When an organization wants to build a DW, it usually pays attention to the necessary implementation and maintenance costs. Besides, the model needs to support future changes due to the dynamics of the business and adapt to the business changes to be auditable and contain historical data of the main processes and the customer information. Besides, the quantity of the data that is received by organizations is nowadays on a large scale, so they need to handle it.

The primary data source is described, and a data dictionary was performed to provide better knowledge about the data used to the accomplishment of the case study.

4.1. DATA SOURCES AND DATA COLLECTION

The central data source provided contains information about the bookings and services of the hotel chain.

32

Figure 14 - ER model data source from the Hotel Chain

The ER model, presented in Figure 14, corresponds, currently, to the operational system of the Hotel Group, which receives a large amount of data. It is a structure unable to meet business needs.

The relational model is mainly used for the operational and transactional systems of organizations because they process and perform many transactions, which most of the time are executed concurrently. Transactions are related to the business processes of organizations, where data is continuously being inserted and updated. So, the relational model needs to trace the execution of the transactions in the system and constitute a business process flow model (Varge, 2001).

4.1.1. Business Entities

In order to understand what information this data source contains the main business entities stored in the ER model is described.

Type of Information Description

Bookings This entity stores all information regarding the Bookings of each customer of each hotel in the group. It stores information such as the hotel, the room, room type, guest, date of the booking starts and ends, the booking reservation platform, the type of the trip made by the customer, booking status, discount on booking, and the rating. Further information that it is stored is if the booking was canceled and the respective reason for cancellation.

Trip types This entity collects the type of trip made by the guests when they reserve the booking.

33

Platforms This entity stores the description of the platform, where the guests made the booking reservation. It corresponds to the partners that are associated with the hotel chain.

Cancellation Detail This entity collects the description detail of the cancellation of the booking, if it occurs, made by the guest.

Discount This entity stores the discount on the booking, if it exists.

Ratings This entity collects the rating of each booking, provided by the customers.

Booking Status This entity stores the booking status associated with each booking of each hotel.

Hotel This entity collects the information of each hotel in the group, such as the hotel name, address, post-code, and the derived city.

Country This entity stores information of each country associated with each hotel in the group. It stores data like the country name and the country currency in order to calculate the exchange rate to USD.

Room This entity collects information about the rooms in each hotel.

Room type This entity stores data related to the hotel room type, such as the room type description, rate, and capacity of the room.

Service This entity collects information about the services provided by the hotel. It contains data related to the service description, service type (classification), and the associated sector.

Sector This entity categorizes the different services provided by each hotel into sectors.

Booking Service This entity links service types provided by the hotel to the guest booking, collecting service date, and rating data.

Hotel Service This entity associates the services to each hotel in the group, giving the service type and cost, in order to make service profit calculation easier.

Guest This entity collects guest information, such as guest name, address, city, age, gender, job, education, county, marital status, and the number of children. This information is useful to characterize the customer.

34

Marital Status This entity stores the marital status description to link each customer.

Education This entity stores the education description to link each guest.

Table 6 - Business entities of the ER model

4.1.2. Data dictionary of ER model

The qualitative data collected was acquired by a SQL Server database, with the structure shown above.

With the business entities described, it is also essential to understand which attributes characterize them. The attributes corresponding to the business processes context of this organization are presented below. The table demonstrates the stored data and relevance of them to achieve a deeper understanding of the hotel chain business core.

Nomenclature Description Data type Allows Nulls

Id_Booking Unique identifier to identify the Hotel Int No bookings

dtBookingStart Identifies the booking start date Date No

dtBookingEnd Identifies the booking end date Date No

dtCancellation Identifies the cancelation date of Date Yes booking (only for canceled bookings)

Id_Service Unique identifier to identify the Hotel Int No services

dtServiceDate Identifies the date that the service was Date No consumed

Id_BookingStatus Identifies the booking status (booked or Int No canceled)

Id_Cancellationdetail Unique identifier to identify the Int No booking cancellation detail

dsCancellationDescription Cancellation detail description Varchar (50) No

Id_Country Unique identifier to identify the Hotel Int No country

dsCountryCurrency Hotel country currency description Char (10) Yes

35

dsCountryName Identifies the hotel country name Varchar (50) Yes

Id_Discount Unique identifier to identify the Int No discount campaigns dsDiscountDescription Hotel discount campaigns description Varchar (50) No nrDiscountPercentage Percentage of applied discount Decimal (18,6) No depending on the description

Id_Education Unique identifier to identify the Int No education level of customers dsEducationDescription Education level description of Varchar (50) Yes customers

Id_Guest Unique identifier to identify the clients Int Yes dsGuestName Identifies the client name Varchar (50) Yes dsGuestAddress Identifies the client address Varchar (50) Yes dsGuestCity Identifies the city of the client Varchar (50) Yes nrGuestAge Identifies the client age Int Yes flGuestGender Identifies the client gender Bit Yes dsGuestOccupation Identifies the client job Varchar (50) Yes dsGuestCountry Identifies the country of the client Varchar (50) Yes nrNumberOfChildren Identifies the number of children that Int Yes the client has

Id_MaritalStatus Identifies the client marital status Int No dsMaritalStatusDescription Client marital status description Varchar (50) Yes

Id_Hotel Unique Identifier to identify the Hotels Int No ds_Hotel_Name Identifies the Hotel name Varchar (50) No dsHotelAddress Identifies the Hotel address Varchar (50) Yes dsHotelPostCode Hotel postal code Varchar (50) Yes dsHotelCity Identifies the Hotel city Varchar (50) Yes

Id_Service Unique Identifier to identify the Hotel Int No services

36

nrServicePrice Service price Decimal (18,6) Yes nrServiceCost Service cost Decimal (18,6) Yes dsServiceDescription Description of the hotel service Varchar (50) Yes flServiceType Identifies if is an outdoor or an indoor Bit Yes service

Id_Sector Unique Identifier to identify the sector Int No associated with the service dsSectorDescription Description of the Sector Varchar (50) Yes

Id_Platform Unique identifier for the type of Int No platform on which the client made the booking dsPlatformDescription Identifies the name of Varchar (50) No platform/partners

Id_Rating Unique identifier to identify the Int No booking rating nrRatingValue Identifies the rating value (1 to 5) Int No dsRatingDescription Description of the rating according to Varchar (50) Yes the value

Id_Room Unique identifier for Hotel rooms Int No nrRoomFloor Identifies the Hotel rooms floor Int No

Id_Room_Type Unique identifier for the Hotel room Int No type dsRoomTypeDescription Description of the Hotel room Varchar (50) No nrRoomTypeRate Identifies the rate per Hotel room type Decimal (18,6) Yes nrRoomCapacity Identifies the capacity by room type Int Yes

Id_TripType Unique Identifier to identify the type of Int No trip of the client dsTripTypeDescription Description of the trip type made by the Varchar (50) Yes client

Table 7 - Case study attributes, data dictionary of the ER model

37

4.2. DIFFERENCES BETWEEN A RELATIONAL MODEL AND A DIMENSIONAL MODEL

Before starting to present the dimensional models for this case study, it is important to discuss the differences between a relational model and a dimensional one.

The business support of the organization under study is based on an ER model, which makes it challenging to extract useful information, insights, and relevant standards of their customers, get KPIs for the business, or even support the decision-making according to proper information. All these factors caused because of the model that they have implemented that is not capable of supporting the .

So, this is the reason that the entities presented in section 4.1.1 have different entities associated with each other. The presented model constitutes fundamental business entities (strong, independent entities), to which weak (dependent) entities may be associated. This model reveals importance in the relationship between the entities, focusing on the execution of the necessary transactions to the business process (Varge, 2001). Besides, this model constitutes a normalized model, with a 3NF.

However, in the dimensional model, the interest is to produce the effects of the transactions of the business processes, in order to get relevant measures and useful information for the business decisions.

Although the dimensional models are instantiated in database management systems, they differ compared to 3NF models (ER models), especially in the normalization level and by removing data redundancies (Kimball & Ross, 2013).

The dimensional models correspond to a technique to represent analytic data, which delivers understandable data to business users with fast query performance (Kimball & Ross, 2013).

Figure 15 below shows the main differences between these two models:

Figure 15 - Main differences between relational and dimension modeling,(Varge, 2001)

38

4.2.1. Traditional DW model - Star schema

The Star Schema model, created by Kimball in 1998, corresponds to a well-known dimensional model, which resembles a “star” structure (Kimball & Ross, 2013).

This model is composed of a , which represents the center of the star, with small dimension tables around the central table (fact),(Moody & Kortink, 2000), as presented in Figure 16.

Figure 16 - Star Schema model, (Moody & Kortink, 2000)

The fact table contains measures and KPIs related to the business core of the organization, and the dimension tables store the data for aggregation of data in the fact table. The cardinality between the fact table and the dimension tables are one-to-many associations, which the primary key of the fact table corresponds to all primary keys of dimensions (Kimball & Ross, 2013).

In brief, the fact table collects all the aggregations and business rules, like metrics and KPIs relevant to the business, which can be additive, semi-additive, or non-additive measures. On the other hand, the dimension tables store the descriptive attributes associated with the business objects and processes, answering the following questions: “who, what, where, when, how, and why.”

From the ER model data source presented in section 4.1, a Star Schema model was designed, in order to demonstrate the main differences to the Data Vault 2.0 Ensemble approach.

The star schema created contains two fact tables and eleven dimension tables, split by Figure 17 and Figure 18. It should be noted that four dimensions are shared between the two fact tables.

39

Figure 17 - Star schema model for Bookings Management

Figure 18 - Star schema model for Services Management

40

4.2.1.1. Star Schema Fact tables

Two Fact tables were created, one for the Bookings and another for the Services. Table 8 presents the main measures that each Fact table stores, according to the business under study.

In chapter 4.2.2.3 – Star Schema ETL processes, the calculations to achieve these measures are presented, as well as the load of the Fact tables.

Booking Fact Service Fact

Measures Description Measures Description nrStandardRate This metric corresponds to the nrStantardRate This metric corresponds to the total value of the Booking per total value of the service (s) guest. This value is the price of paid by a guest during their the booking, without the booking. discount rate, if it exists. nrDiscountRate This metric corresponds to the nrServiceCost This metric corresponds to the final price of the bookings with value of the service costs. the discount rate associated per guest if it exists. Otherwise, the value is equal to that represented in the nrStandardRate metric. nrDayDuration This metric corresponds to the ---- total of days of a guest's booking.

Table 8 – Fact Tables Booking and Service measures

4.2.1.2. Star Schema Dimensions

The Star Schema model presented, as mentioned before, contains eleven dimensions:

▪ Hotel dimension; ▪ Cancellation detail dimension; ▪ Discount dimension; ▪ Booking status dimension; ▪ Trip type dimension; ▪ Date dimension; ▪ Room type dimension; ▪ Guest dimension; ▪ Platform dimension;

41

▪ Rating dimension; ▪ Service dimension;

Between the Fact Booking table and the Fact Service table, four common dimensions exist, linked to the two Fact tables: the Guest, Hotel, Date, and Rating dimensions.

After that, each dimension is presented with the respective description of each attribute, that composes each Dimension table.

All the Dimension tables have a primary key, a surrogate key, generated incrementally by the DW, which corresponds to a unique identifier for each Dimension entity. The surrogate key is not derived by the natural key, which corresponds to the business key of the Dimension table to connect with the source. The next chapter presents how the loading of Dimension tables are done, using these types of keys, that are useful to load the Fact tables.

Hotel Dimension

The Hotel Dimension contains all the Hotel information, descriptive attributes, related to each Hotel in the Hotel Chain.

Attribute Description

Sk_id_Hotel Represents the surrogate key of the Hotel dimension table. Corresponds to the primary key.

Nk_Id_Hotel Represents the natural key provided by the ER model. Used to connect the two fact tables. dsHotelName Descriptive attribute, which describes the name of each Hotel of the Group.

dsHotelAddress Descriptive attribute, which indicates the address of each Hotel in the Group.

dsHotelPostCode Descriptive attribute, which indicates the post-code of each Hotel in the Group.

dsHotelCity Descriptive attribute, which indicates the city of each Hotel in the Group.

dsCountryName Descriptive attribute, which indicates the country of each Hotel in the Group.

dsCountryCurrency Descriptive attribute, which indicates the currency of each Hotel in the Group.

nrCapacity Descriptive attribute, which describes the total capacity of rooms for each Hotel in the Group.

Table 9 - Hotel dimension attributes

42

Cancellation Dimension

The Cancelation Dimension is related to the reasons that led customers to cancel their bookings.

Attribute Description

Sk_Id_CancellationDetail Represents the surrogate key of the Cancellation dimension table. Corresponds to the primary key.

Nk_Id_ CancellationDetail Represents the natural key provided by the ER model. dsCancellationDetailDescription Descriptive attribute, which describes the reasons for the cancellation of a customer booking.

Table 10 - Cancellation dimension attributes

Discount Dimension

The Discount dimension is related to the discounts associated with the Bookings. Guests can use discounts or through the partners of the Hotel Group, which can charge prices with discounts.

Attribute Description

Sk_Id_Discount Represents the surrogate key of the Discount dimension table. Corresponds to the primary key.

Nk_Id_Discount Represents the natural key provided by the ER model. nrDiscountPercentage Descriptive attribute, which describes the percentage of the discount associated with each booking. dsDiscountDescription Descriptive attribute, which indicates the description of the discount.

Table 11 - Discount Dimension attributes

Booking status Dimension

The Booking status Dimension represents the status of the associated booking in the current moment. Indicates if the booking is canceled or booked.

Attribute Description

Sk_Id_BookingStatus Represents the surrogate key of the Booking Status dimension table. Corresponds to the primary key.

Nk_Id_BookingStatus Represents the natural key provided by the ER model. dsBookingStatusDescription Descriptive attribute, which describes the status of the Booking.

Table 12 - Booking Status dimension attributes

43

Trip type Dimension

The Trip type Dimension describes the type of trip realized by the guests, which can have the following descriptions: fun, holiday, music, finalist trip, bachelor party, business trip, family trip, festival, competition, alone, others, not stated, conference

Attribute Description

Sk_Id_TripType Represents the surrogate key of the Trip Type dimension table. Corresponds to the primary key.

Nk_Id_TripType Represents the natural key provided by the ER model. dsTripTypeDescription Descriptive attribute, which describes the type of the trip made by guests in each Booking.

Table 13 -Trip type dimension attributes

Date Dimension

The Date dimension represents the temporality of the bookings, in which the data is partitioned by day, month, and year.

Attribute Description

Id_Date Represents the natural key of the Date dimension table. Corresponds to the primary key. dtFullDateAlternateKey Represents the Full date separated by ‘-. ‘ nrDay It contains the day of the date. nrMonth It contains the month number extracted from the date. dsMonthName It contains the month name extracted from the date. nrYear Contains the year extracted from the date

Table 14 - Date dimension attributes

Room type Dimension

The Room type Dimension presents the types of rooms associated with each Hotel in the Group. The room types are composed of five categories: Standard Single Bed, Standard Twin Bed, Deluxe Double Bed, Suite Room, and Penthouse.

44

Attribute Description

Sk_Id_RoomType Represents the surrogate key of the Room type dimension table. Corresponds to the primary key.

Nk_Id_RoomType Represents the natural key provided by the ER model. dsRoomTypeDescription Descriptive attribute, which describes the type of the room. nrRoomTypeRate Descriptive attribute, which indicates the rate per room type. nrRoomTypeCapacity Descriptive attribute, which indicates the total capacity per room type. nrRoomQuantity Descriptive attribute, which indicates the number of guests per room type.

Table 15 - Room type dimension attributes

Guest Dimension

The Guest Dimension stores all the characteristics of the customers of the Hotel Group, allowing to each Hotel know their customers, categorize them by age, country, gender, children number, and others.

Attribute Description

Sk_Id_Guest Represents the surrogate key of the Customer dimension table. Corresponds to the primary key.

Nk_Id_Guest Represents the natural key provided by the ER model. dsGuestName Descriptive attribute, which contains the name of the customer of the Hotel Group. dsGuestAddress Descriptive attribute, which contains the address of the customer of the Hotel Group. dsGuestCountry Descriptive attribute, which contains the country of the customer of the Hotel Group. dsGuestCity Descriptive attribute, which contains the city of the customer of the Hotel Group. nrGuestAge Descriptive attribute, which contains the age of the customer of the Hotel Group. dsGuestGender Descriptive attribute, which contains the gender of the customer of the Hotel Group.

45

dsGuestEducation Descriptive attribute, which contains the education description of the customer of the Hotel Group. dsGuestOccupation Descriptive attribute, which contains the job of the customer of the Hotel Group. dsGuestMaritalStatus Descriptive attribute, which contains the marital status of the customer of the Hotel Group. nrGuestChildrenNumber Descriptive attribute, which contains the number of children of the customer of the Hotel Group.

Table 16 - Customer dimension attributes

Platform Dimension

The Platform Dimension refers to the ways that customers reserve their bookings. It can be booked by using platforms such as the Booking, Agoda, Trivago, Momondo, E-Dreams, GetaRoom, and Prestigia or on the Hotel Website and Physical Reservation mode.

Attribute Description

Sk_Id_Platform Represents the surrogate key of the Platform dimension table. Corresponds to the primary key.

Nk_Id_Platform Represents the natural key provided by the ER model. dsPlatformDescription Descriptive attribute, which describes the platform that the customer reserves the Hotel booking.

Table 17 - Platform dimension attributes

Rating Dimension

The Rating Dimension presents the rating of the bookings and the services provided by the guests of the Hotel. The ratings can be categorized by five assessments: Awful, Bad, Average, Good, Excellent, represented by numbers from 1 to 5, respectively.

Attribute Description

Sk_Id_Rating Represents the surrogate key of the Rating dimension table. Corresponds to the primary key.

Nk_Id_Rating Represents the natural key provided by the ER model. nrRatingValue Descriptive attribute, which indicates the rating value of the bookings or services. dsRatingsDescription Descriptive attribute, which describes the rating associated with the bookings or services.

Table 18 - Rating dimension attributes

46

Service Dimension

The Service Dimension presents all the descriptive information regarding the services of the Hotel. The Hotel services can be associated to the following: Pool, Heated Pool, Free Wifi, Wifi, Room Service, Restaurants, 24-hour service, Bar, Garden, Golf Course, Spa, Jacuzzi, Thermal Bath, Concierge, Kids Space, Parking Space, Valet, Airport Service, Cleaning Service, Gym, Air Condition, Cable TV, Reduced mobility, Bike Rental, Babysitting Service, Electric Car Charging, ATM, Sport Courts, and Casino.

Attribute Description

Sk_Id_Service Represents the surrogate key of the Service dimension table. Corresponds to the primary key.

Nk_Id_Service Represents the natural key provided by the ER model. dsServiceDescription Descriptive attribute, which describes the name of the service. dsServiceType Descriptive attribute, which describes the service type associated with the Hotel services. dsSectorDescription Descriptive attribute, which describes the sector that each service corresponds.

Table 19 - Service dimension attributes

4.2.1.3. Star Schema ETL process

The ETL process is a crucial tool for data optimization and integration. The tool provides valuable benefits for the construction of the DW by providing significant data quality and being able to solve highly complex problems by using metadata, allowing it to be generated and maintained automatically. This aspect prevents incorrect information problems at the end of the project. It enhances its level of performance by extracting, transforming, and loading a large volume of data. It facilitates connectivity to multiple data sources and finally provides stability and security.

As mentioned in chapter 3.1.1.1 – Case Study Methods and Tools, the traditional Star Schema DW model, was designed by using Microsoft tools, namely SQL Server Management Studio. The software was utilized to design the DW model and the SQL Server Data Tools to apply the ETL processes, which include the extraction, transformation, and loading of the data into the final Star Schema DW.

A new project in SSIS was created, which is composed of two packages, presented in Figure 19:

• LoadDimensionTables.dtsx • LoadFactTables.dtsx

Figure 19 - Load Dimension and Fact tables dtsx

47

Load Fact tables The Fact tables package is composed of a Sequence container, which contains an Execute SQL task component and a Data Flow component for each Fact table, as presented in Figure 20.

The Execute SQL task component is used to truncate the Fact table, before being loaded, by the Data Flow component, which is used to load each Fact table (Booking and Service Facts).

Figure 20 - Load Fact Tables package in SSIS

An OLE DB Source component was used to load the Fact Booking table, in order to get the data from the tbBooking source, provided by the ER model. After, the Lookup component was used for looking up all the surrogate keys in the dimensions tables, which are inherited as foreign keys in the Fact table. A derived column component was used to create relevant measures for this Fact, presented in Figure 21.

Figure 21 - Fact Booking measures, through derived column component

Figure 22 represents the ETL process to load the Fact Booking, with the components described previously.

48

Figure 22 - ETL process to Load Fact Booking

The source of the Fact Service is composed of an SQL query, in order to join the tbBooking, tbBookingService, and tbHotelService given by the ER model, as shown in Figure 23.

Figure 23 - OLE DB Source, using a SQL command to extract services data from the source

As used in Fact Booking, the same Lookup component is used in Fact Service to search the surrogate keys of all the dimensions associated with this Fact table to be loaded into them.

Figure 24 shows the ETL process used to load the Fact Service table.

49

Figure 24 - ETL process to Load Fact Service

Load Dimension Tables

The Dimension Tables package is constituted by eleven Data Flows components inside of a Sequence container, and each one loads a Dimension table.

In the process of loading the Dimension tables, the SCD (Slowly Changing Dimension) was used that allows us to store and manage the historical data and the current one, in the DW, over time. In the ETL processes of the Star Schema model, it is mandatory to use this type of dimension, in order to track the data and update the dimensions with the new records that it receives.

The SCD type used to load the Dimension table was type 2, which consists of creating a new additional record and retaining the full history of the data. If the value of an attribute in the Dimension table changes, a new is added in the Dimension with a new surrogate key, storing the old one as historical, and the new row becomes the current record.

Therefore, the Slowly Changing Dimension component was used in the SSIS to select the business key of each dimension and select the attributes that can change over time, in order to record the new ones.

50

Figure 25 - Load Dimension Tables package in SSIS

The ETL process used to load each Dimension table is presented in the Annexes, in Figure 51 to Figure 61, respectively.

4.2.2. Traditional Data Vault 2.0 Ensemble Modeling

Through the data source provided, a DW was implemented using the Data Vault 2.0 Ensemble approach. Nevertheless, before presenting the model, the business concepts and their attributes are identified, to describe how this model differs from the Star Schema better.

Note that this first model presented does not correspond to the proposed model.

4.2.2.1. Identify the Business Objects and define the Hubs entities

One of the critical steps before starting a Data Vault 2.0 model is to identify the main business objects of the business organization under study. By identifying the principal business objects, one quickly discovers the business keys.

So, according to the ER model provided by the Hotel Group, it is simple to identify the business objects:

▪ Bookings; ▪ Services; ▪ Hotel; ▪ Guests; ▪ Rooms.

These five business objects are essential to the business core of this organization and correspond to the primary data elements that must be stored in the DW.

Through these indicated business objects, the Hub entities can be defined as:

51

Hub Entity Business key Source in ER model

Booking Hub Id_Booking tbBooking

Service Hub Id_Service tbService

Hotel Hub Id_Hotel tbHotel

Guest Hub Id_Guest tbGuest

Room Hub Id_Room tbRoom

Table 20 - Identification of Hubs and business keys

4.2.2.2. Define Satellites

With the Hubs recognized, it is possible to define the main descriptive attributes or relevant context that defines the identified Hubs, in order to determine the Satellite entities associated to each Hub.

Booking Satellites

The Booking Hub is an entity that contains multiple Satellites. This Hub entity has seven associated satellites. These are related to the context provided by the source, in order to split them according to the source table and the contextual attributes.

Besides, these satellites another good practice was considered to take into account, i.e., splitting the attributes by the rate of change, dividing the contextual attributes with fewer changes – frequent and high changes, in order to minimize the consumption of unnecessary storage in new records.

Hub Entity Satellite table entity Satellites descriptive attributes

Booking S_Booking_Dates Booking_Start_Date

Booking_End_Date

S_Booking_Status Booking_Status_Description

S_Booking_Rating Rating_Value

Rating_Description

S_Booking_Cancellation Cancelation_Description

Cancelation_Date

S_Booking_Discount Discount_Description

Discount_Percentage

52

S_Booking_Platforms Platform_Description

S_Booking_Trip_Type Trip_Type_Description

Table 21 - Booking Satellites

Services Satellites

The Service Hub contains two Satellites, one that aggregates all the descriptive data related to the services information and another associated with the ratings of the services provided by the guests.

Once again, the splitting of the attributes into two Satellites derives from the rate of change of the contextual attributes and the different sources provided by the ER model.

Hub Entity Satellite table entity Descriptive Attributes

Service Hub S_Service_Characteristics Service_Name

Service_Type_Description

Service_Price

Service_Cost

Service_Date

Sector_Description

S_Service_Rating Rating_Value

Rating_Description

Table 22 - Service Satellites

Hotel Hub

The Hotel Hub contains two associated Satellites, one related to attributes that characterize this entity, and the second refers to the country exchange rates associated with each Hotel.

The country exchange rate Satellite allows knowing the exchange rate associated with each Hotel, in order to allow the calculation of the Booking according to the respective country, and this is the reason why this information is stored in a different Satelite.

Hub Entity Satellite table entity Satellites attributes

Hotel Hub S_Hotel_Characteristics Hotel_Name

Hotel_Address

Hotel_PostCode

53

Hotel_City

S_Hotel_Country Hotel_Country_Currency

Country_Name

Hotel_Country_Exchange_Rate_To_USD

Table 23 - Hotel Satellites

Guest Hub

The Guest Hub entity only contains one Satellite, providing information related to the characteristics of the guests and useful information that can afterward be associated to each Hotel, for marketing strategies. This Satellite stores descriptive data from the guests who booked and consumed services in the Hotel Group.

Hub Entity Satellite table entity Satellites attributes

Guest Hub S_Guest_Characteristics Guest_Name

Guest_Address

Guest_City

Guest_Age

Guest_Gender

Guest_Occupation

Guest_Country

Number_Of_Children

Marital_Status_Description

Guest_Education_Description

Table 24 - Guest Satellites

Room Hub

Finally, the Room Hub entity aggregates one Satellite, which stores contextual attributes related to the rooms' information.

Hub Entity Satellite Entity Satellites attributes

Room Hub S_Room_Characteristics Room_Type_Description

Room_Floor

54

Room_Type_Rate

Room_Capacity

Table 25 - Room Satellite

4.2.2.3. Connect Hubs with Links

The Links aims to connect the Hubs to each other, being the key for the flexibility and the scalability of the Data Vault 2.0 approach. These entities allow the changes and additions of new entities or relationships in the model over time, without the need to re-engineer the whole DW model.

The goal of the Links entities is to capture and collect the relationships between the business object at the lowest grain and must have at least two parent tables (foreign keys).

Hub entities Link table entity

H_Hotel and H_Service L_H_Hotel_L_H_Service

H_Hotel and H_Room L_H_Hotel_L_H_Room

H_Hotel and H_Booking L_H_Hotel_L_H_Booking

H_Guest and H_Service L_H_Guest_L_H_Service

H_Guest and H_Booking L_H_Guest_L_H_Booking

H_Booking and H_Service L_H_Booking_L_H_Service

H_Booking and H_Room L_H_Booking_L_H_Room

Table 26 - Link entities

4.2.2.4. Traditional Data Vault model

The traditional Data Vault 2.0 model approach is based on the Hub, Entities, and Satellites elements, as described previously. It is a hybrid data modeling methodology, allowing for the tracking of data and storing historical data, using a set of normalized tables linked to support the business areas of an organization. Besides, this approach has a flexible design, scalability. It is consistent and quickly adapts to the business changes.

It provides benefits due to the adaptation of the changes in the business environment, supports big data sets, simplifies the EDW design, being an incremental building model, allowing to add new data sources, without impacting the whole design.

However, this model presents differences compared with the Star Schema, presented in section 4.2.1. Especially given that the Star Schema uses the Fact and Dimensions tables to model the DW. In contrast, the Data Vault uses Hub entities to store the business keys and the respective metadata to

55

track the origin of the data (where and when the data are provided). Link entities connect Hubs or even establish relationships with other Links. These entities handle business changes in data granularity and minimize the impact of adding new Hubs to the architecture.

The Satellite entities store all the contextual and temporal attributes related to the business keys identified of the business, as well as metadata. Similar to the Dimension tables that use the SCD Type II, as presented in the Star Schema model, they keep (historical data) the changes of attributes and update the record every time that an attribute changes.

In the Satellites, another type of metadata was also added. This metadata comprises the following attributes: IsCurrent, which corresponds to a flag to know if that specific attribute is the current one (Flag=1), and two date attributes: ValidFrom, which contains the effective date/time and the ValidTo attribute contains the expiry date/time. This type of metadata allows improving the speed, reusability, and parallelism in the identification of the current records.

Figure 26, presents the typical Data Vault 2.0 Ensemble model, based on the case under study.

S_Guest_Characteristics H_Guest_SID L_H_Guest_L_H_Service Load_Date L_Guest_Service_SID Guest_Name H_Guest_SID Guest_Address H_Service_ID Guest_City Load_Date Guest_Age Record_Source Guest_Gender Guest_Occupation Guest_Country Number_Of_Children Marital_Status_Description Guest_Education_Description Record_Source Is_Current Valid_From H_Guest H_Guest_SID Valid_To Bk_ID_Guest Load_Date Record_Source

L_H_Guest_L_H_Booking L_Guest_Booking_SID H_Guest_SID H_Booking_SID Load_Date Record_Source

S_Service_Rating H_Service_SID S _Service _Characteristics H_Service_SID Load_Date Load_Date S_Booking_Trip_Type Rating_Value H_Service Service_Type_Description H_Booking_SID Rating_Description H_Service_SID Service_Price Load_Date Record_Source Bk_ID_Service Service_Cost Trip_Type_Description Is_Current Load_Date Service_Date Record_Source Valid_From Record_Source Sector_Description Is_Current Valid_To Record_Source Valid_From Valid_To Is_Current S_Booking_Dates Valid_From H_Booking_SID Valid_To Load_Date

Booking_Start_Date S_Booking_Rating L_H_Hotel_L_H_Service Booking_End_Date S_Booking_Discount H_Booking_SID L_Hotel_Service_SID L_H_Booking_L_H_Service Record_Source H_Booking_SID Load_Date L_Booking_Service_SID H_Hotel_SID Is_Current Load_Date Rating_Value H_Booking_SID H_Service_SID Valid_From Discount_Description Rating_Description H_Service_SID Load_Date Valid_To Discount_Percentage Record_Source Load_Date Record_Source Record_Source Is_Current Record_Source Is_Current Valid_From Valid_From Valid_To Valid_To L_H_Hotel_L_H_Booking L_Hotel_Booking_SID S_Hotel_Exchange_Rate H_Hotel H_Hotel_SID H_Hotel_SID H_Hotel_SID H_Booking S_Booking_Platforms H_Booking_SID Load_Date Bk_ID_Hotel H_Booking_SID H_Booking_SID Load_Date Country_Name Load_Date Load_Date Bk_ID_Booking Platform_Description Hotel_Country_Currency Record_Source Record_Source Load_Date Hotel_Country_Exchange_Rate_To_U... Record_Source Record_Source Record_Source Is_Current Valid_From Is_Current S_Booking_Cancelation Valid_To Valid_From L_H_Hotel_L_H_Room H_Booking_SID L_Hotel_Room_SID Load_Date Valid_To L_H_Booking_L_H_Room H_Hotel_SID L_Booking_Room_SID Cancelation_Description H_Room_SID H_Booking_SID Cancelation_Date Load_Date H_Room_SID Record_Source Record_Source Is_Current S_Hotel_Characteristics Load_Date H_Hotel_SID Record_Source Valid_From S_Booking_Status Valid_To Load_Date H_Booking_SID Hotel_Name Load_Date H_Room Hotel_Address Booking_Status_Description H_Room_SID Hotel_PostCode Record_Source Bk_ID_Room Hotel_City Is_Current Load_Date Record_Source Valid_From S_Room_Type Record_Source Is_Current H_Room_SID Valid_To Valid_From Load_Date Valid_To Room_Type_Description Room_Floor Room_Type_Rate Room_Capacity Record_Source Is_Current Valid_From Valid_To

Figure 26 - Traditional Data Vault 2.0 Model

After the implementation of the Data Vault 2.0 approach, it is possible to observe the complexity of the model, with many tables aggregated.

56

This DW model adapts quickly to the requirements changes without too much effort and facilitates the addition of new sources or even new Hubs, Links or Satellites without compromising the whole architecture. Nevertheless, it is still easy to remark the gaps existing with this approach, through the model represented in Figure 26.

This approach faces limitations regarding the data quality, due to the data not suffering any transformation. It is loaded to the Raw Vault layer, directly from the sources of the operational system. So, the type of information provided is not ensured, and this is a problem because most of the time, the data sources need data quality transformations.

However, the can be made in the Business Vault layer, after the loading of data to the Raw Vault. In this layer, which represents the Raw Vault layer, it is possible to remove the noise, missing values by cleaning the data in order to provide better pieces of information and consolidate and prepare the data for later loading in the Data Marts. However, this is not mandatory; it depends on the business rules of the business itself.

Compared to the Star Schema model presented in Figures 17 and 18, the data is cleaned before the load in the DW being transformed, aggregated, and consolidated to create the measures in the Fact tables.

Another limitation evidenced is that the model cannot be accessed by end-users. The model presented, confirms that the model is complex for the key-users to use them with the aim to produce reporting or OLAP cubes to support the organization's decision-making. So, only data scientist experts are capable of applying data mining and data analytics tools.

Finally, the major limitation presented is the fact of requiring many joins to combine all the tables with each other, compromising the performance of the model, by processing the merges between big data.

4.2.2.5. Data Vault ELT process

In contrast with Kimball’s approach, which uses the ETL process to load the Star Schema EDW, the Data Vault 2.0 approach, uses the ELT process.

This aspect means that to load a Data Vault EDW, the data are extracted from the operational system, in this case from the operational systems, presented in section 4.1 – Data Sources and Data Collection, and this data is directly loaded into the Data Vault DW without transforming any data.

The data is stored into the Raw Vault layer, which consists of the Data Vault incorporated of raw data.

The SQL Server Data Tool was also used to execute the ELT process, which created three packages in SSIS:

▪ LoadHubsTables.dtsx ▪ LoadLinksTables.dtsx ▪ LoadSatellitesTables.dtsx

57

Figure 27 - Load Hubs, Links and Satellites tables dtsx

Load Hubs entities A Sequence Container was used to load the Hub tables, which contains a Data Flow component for each Hub table, as shown in Figure 28.

Figure 28 - Load Hubs entities package in SSIS

All the Hubs are loaded in the same way, by using an OLE DB source component to extract the raw data, a derived column component in order to add the metadata information (Load Date and Record Source attributes), a Lookup component, to search the business key provided by the source and lastly, an OLE DB destination to load all the data into the Data Vault EDW.

Figure 29 represents an example of the ELT process of loading a Hub entity.

Two variables were created to record all the metadata attributes in the Hub entity, in Load Date and Record Source attributes, respectively, which track all the changes made over time, as shown in Figure 30.

58

Figure 29 - Example of load a Hub table in SSIS

Figure 30 - Adding metadata attributes in the Hub entity

59

Load Links Entities After all Hub tables are loaded, it follows the Links entities, that need the surrogate keys created in the Hub table, to be kept in the Link table.

Similar to the Hubs package, in the Links, a Sequence container was also used that contains multiple Data Flows components, responsible for loading each Link table designed.

Figure 31 - Load Link tables package in SSIS

The Link entities also store metadata, Load Date and Record Source attributes, similar to the Hub entities, by using a derived column component.

Figure 32 - Adding metadata to the Link table

The Lookup component is used to search the surrogate keys created in the Hub tables, in order to be stored in the Link. These entities help when business requirements are added, or changes are found, minimizing the impact of re-designing the whole model.

Figure 33 presents an example of loading a Link table.

60

Figure 33 - Example of load a Link table in SSIS

Loading Satellites entities The loading of Satellite tables in the SSIS package is similar to the Hubs and Links load tables, Figure 34. The Satellites need the surrogate keys of the Hubs, which corresponds to the primary key of these tables, combined with the Load Date attribute.

Figure 34 - Load Satellite tables package in SSIS

61

The Satellite tables store all the contextual attributes related to the identified Hubs, also using metadata useful to track the rate of change of the attributes and the sources. Four metadata attributes are used, as shown in Figure 35.

Figure 35 - Adding metadata to the Satellite tables

The flag Is_Current allows faster identification of what the recent attributes are, combined with the Valid_From and Valid_To attributes, the effective date, and the expiry date, respectively. As mentioned before, the Satellite entities have the same process of the SCD Type II in the Dimension tables, of Star Schema models. Hence, it is used as a Lookup component, apart from the Lookup for the Hub surrogate key, which compares the attributes already inserted in the Satellite table and checks if it was updated.

If new records exist, and the lookup cannot find a match, this means that the attribute has changed. In these cases, a new current record will be inserted by setting the actual Load Date with the Valid_From date. However, the historical data is kept, and the Valid_To attribute is updated, which is set to the actual Load_Date, the flag is updated to “0” the attributes added. The metadata attributes are also updated to tracking the changes in these attributes.

Figure 36 presents an example of an update of records in a Satellite table, using an OLE DB command component.

62

Figure 36 - Update new records in SSIS

Figure 37 presents an example of how a Satellite is loaded.

Figure 37 - Example of loading a Satellite table in SSIS

63

4.2.3. The proposal for the optimized Data Vault 2.0 model

The traditional Data Vault 2.0 model, represented in Figure 26, shows the complexity of the design of this model. Although it is easy to add more Hubs, Links, and Satellites tables, without compromising the whole DW structure, adapting quickly to the business changes, it is clear that the end-users are not able to use this model. This impediment is due to the quantity of the joins that must be performed to querying all the contextual attributes needed to support the decision making of an organization.

In the Star Schema, the measures and KPI’s are all stored in the Fact table, which becomes simple to access the information, because they are already aggregated. However, in the case of Data Vault 2.0, it is not. Retrieving the measures created for this case study in the traditional DW - Star Shema approach, shown in Figures 17 and 18, allowed us to try to create them using the Data Vault approach designed and presented in Figure 26.

If we want to aggregate all the information related with the Bookings in order to calculate the booking price, the discount rate, the duration of the booking, using the Data Vault model, we need to utilize a query to join four Hub entities, through three Link entities and join the Hubs to the respective Satellites.

The result of this query to the Data Vault 2.0 model would require sixteen (16) joins, which will have a significant impact in terms of the performance of the EDW, by processing large quantities of data. Besides, only experts in SQL tools could make these complex queries. Figure 38 shows the complexity of the query by aggregating sixteen joins to aggregate all the necessary data to create the measures described.

64

Figure 38 - Case 1 - query result in the Data Vault 2.0 model

Besides, the example above, another example could be, if the Hotel Group wants to know their customers and collect some information in order to apply some marketing strategies and campaigns.

In this case, it is necessary to aggregate three Hubs entities (Booking, Hotel, and Guest Hubs) through two Links and join each Hub the respective Satellite(s).

Ten (10) joins are needed to query this data, which once again reveals a high number of joins and performance.

65

Figure 39 - Case 2 - Query result in Data Vault 2.0 model

By analyzing these cases, the limitations of this approach are very noticeable in terms of aggregating the data (joins), accessing data, and the creation of reporting through them. It requires users with knowledge in SQL or other database languages, to be allowed to transform this data into useful information and make them available for all the stakeholders.

So, in order to respond to the Research Questions and according to the goal of this Dissertation, that consists in a proposing a way of optimizing the Data Vault 2.0 approach, a method of reducing the joins required to aggregate data and to access and display reporting through them will be presented.

The process of optimizing this model is creating Bridge tables in the Business Vault layer, allowing to apply some rules and aggregating the data needed, taking into account the business core, and the type of information that is relevant to represent by using reporting tools.

The Bridge tables store the surrogate keys of the Hubs and the Links that connect the Hubs data that we want to join and the contextual attributes from Satellites that we want to use.

These types of structures are beneficial to transform the data into relevant information to get answers for the decision-making questions.

In addition, hash keys were also used in Hubs, Links, and Satellites tables, instead of the typical surrogate keys, since these keys could be generated consistently by any system that knows the enterprise-wide unique business key. Besides, this supports the reporting by preparing the data for BI consumption, and it becomes easy to access the data, reducing the joins in the queries.

Figure 40 represents the optimized model by proposed the creation of this type of table on the Data Vault model.

66

S _Service _Characteristics H_Service_HK Load_Date Service_Type_Description S_Service_Rating Service_Price H_Service_HK Service_Cost Load_Date Service_Date Rating_Value Sector_Description Rating_Description Record_Source Record_Source Is_Current Is_Current Valid_From Valid_From Valid_To Valid_To

H_Service H_Service_HK Bridge_Booking_Sales Bk_ID_Service Bridge_Bookings_HK L_H_Hotel_L_H_Service Load_Date Bridge_Load_Date L_Hotel_Service_HK Record_Source H_Booking_HK H_Hotel_HK H_Service_HK H_Service_HK H_Room_HK Load_Date H_Hotel_HK Record_Source L_Booking_Service_HK L_Booking_Room_HK L_Hotel_Booking_HK Cancelation_Description Cancelation_Date Booking_Start_Date Booking_End_Date S_Booking_Dates Discount_Percentage H_Booking_HK Platform_Description Load_Date Rating_Value Booking_Start_Date Booking_Status_Description Booking_End_Date Service_Cost Record_Source Service_Date Is_Current Service_Price Valid_From Service_Type_Description Valid_To Room_Type_Description S_Booking_Discount * Room_Type_Rate L_H_Booking_L_H_Service H_Booking_HK Hotel_Country_Exchange_Rate_To_USD L_Booking_Service_HK Load_Date H_Booking_HK Discount_Description H_Service_HK Discount_Percentage Load_Date Record_Source L_H_Booking_L_H_Room Record_Source Is_Current L_Booking_Room_HK Valid_From H_Booking_HK Valid_To S_Booking_Rating H_Room_HK H_Booking_HK Load_Date Load_Date Record_Source Rating_Value Rating_Description Record_Source Is_Current Valid_From Valid_To

L_H_Guest_L_H_Booking L_Guest_Booking_HK

H_Guest_HK H_Booking * H_Booking_HK H_Booking_HK Load_Date Bk_ID_Booking Record_Source Load_Date Bridge_Booking_Guest Record_Source Bridge_Booking_Guest_HK S_Booking_Status Bridge_Booking_Guest_Load_Date H_Booking_HK H_Booking_HK Load_Date H_Guest_HK Booking_Status_Description H_Hotel_HK Record_Source L_H_Hotel_L_H_Booking H_Room L_Booking_Guest_HK Is_Current L_Hotel_Booking_HK H_Room_HK L_Booking_Hotel_HK Valid_From H_Hotel_HK Bk_ID_Room Discount_Description Valid_To H_Booking_HK Load_Date Trip_Type_Description Load_Date Record_Source Rating_Decription Record_Source Hotel_Name Hotel_Address S_Booking_Cancelation L_H_Hotel_L_H_Room Hotel_City H_Booking_HK L_Hotel_Room_HK S_Booking_Platforms Guest_Country Load_Date H_Hotel_HK H_Booking_HK Guest_City Cancelation_Description H_Room_HK S_Booking_Trip_Type Load_Date Guest_Age H_Booking_HK Cancelation_Date S_Room_Type Load_Date Platform_Description H_Room_HK Number_Of_Children Load_Date Record_Source Record_Source Record_Source Load_Date Marital_Status_Description Trip_Type_Description Is_Current Is_Current Room_Type_Description Record_Source Valid_From Valid_From Room_Floor Is_Current Valid_To Valid_To Room_Type_Rate Valid_From L_H_Guest_L_H_Service Room_Capacity Valid_To L_Guest_Service_HK Record_Source H_Guest_HK Is_Current H_Service_HK Valid_From Load_Date Valid_To Record_Source H_Hotel H_Hotel_HK Bk_ID_Hotel Load_Date S_Hotel_Exchange_Rate Record_Source H_Hotel_HK Load_Date H_Guest S_Guest_Characteristics Country_Name H_Guest_HK H_Guest_HK Hotel_Country_Currency Bk_ID_Guest Load_Date Hotel_Country_Exchange_Rate_To_U... Load_Date Guest_Name Record_Source Record_Source Guest_Address Is_Current Guest_City Valid_From Guest_Age Valid_To Guest_Gender Guest_Occupation S_Hotel_Characteristics Guest_Country H_Hotel_HK Number_Of_Children Load_Date Marital_Status_Description Hotel_Name Guest_Education_Description Hotel_Address Record_Source Hotel_PostCode Is_Current Hotel_City Valid_From Record_Source Valid_To Is_Current Valid_From Valid_To

Figure 40 - Proposal for an optimized Data Vault 2.0 model

The use of Bridge tables brings benefits by eliminating the outer-joins from the ad-hoc queries, providing scalability in views, enhancing the partitioning of the data, and the performance of the EDW.

According to the cases described previously, by creating the Bridge tables in the Business Vault layer, we can note the reduction of the joins needed.

If we go back to Case 1, that consists of aggregating all booking data, to create the same measures as presented before in the Star Schema, through the Bridge_Booking_Sales table, it is possible to obtain the following result:

Bridge Booking Sales

Attributes Description

Bridge_Bookings_SID Surrogate key and primary key of the table

Bridge_Load_Date Metadata attribute to track the data loaded

67

H_Booking_SID Lookup of the surrogate key of the Booking Hub

H_Service_SID Lookup of the surrogate key of the Service Hub

H_Room_SID Lookup of the surrogate key of the Room Hub

H_Hotel_SID Lookup of the surrogate key of the Hotel Hub

L_Booking_Service_SID Lookup of the surrogate key of the L_H_Booking_L_H_Service Link

L_Booking_Room_SID Lookup of the surrogate key of the L_H_Booking_L_H_Room Link

L_Hotel_Booking_SID Lookup of the surrogate key of the L_H_Hotel_L_H_Booking Link

Cancelation_Description Satellite attribute provided by S_Booking_Cancelation

Cancelation_Date Satellite attribute provided by S_Booking_Cancelation

Booking_Start_Date Satellite attribute provided by S_Booking_Dates

Booking_End_Date Satellite attribute provided by S_Booking_Dates

Discount_Percentage Satellite attribute provided by S_Booking_Discount

Platform_Description Satellite attribute provided by S_Booking_Platform

Rating_Value Satellite attribute provided by S_Booking_Rating

Booking_Status_Description Satellite attribute provided by S_Booking_Status

Service_Cost Satellite attribute provided by S_Service_Characteristics

Service_Date Satellite attribute provided by S_Service_Characteristics

Service_Price Satellite attribute provided by S_Service_Characteristics

Service_Type_Description Satellite attribute provided by S_Service_Characteristics

Room_Type_Description Satellite attribute provided by S_Room_Type

Room_Type_Rate Satellite attribute provided by S_Room_Type

Hotel_Country_Exchange_Rate_To_USD Satellite attribute provided by S_Hotel_Exchange_Rate

Table 27 - Bridge Booking Sales table

68

Figure 41 - Bridge Booking Sales table

The following SQL statement shows the stored procedure to load the Bridge Booking Sales table:

69

Figure 42 - SQL Stored Procedure to load the Bridge Booking Sales table

So, now when we query the data, it is just needed to join with the respective Satellite from where the attribute came from, as represented in Figure 43.

Figure 43 - Query result using the Bridge Booking Sales table in the Data Vault optimized model

70

By looking at Figure 43, we observe that with the Bridge table, only ten joins are with the Satellite tables, instead of sixteen joins, as seen in Figure 38.

The same occurs, in Case 2, presented before. If another Bridge table is created to store all the data that we intend to aggregate related to the booking guests, according to each Hotel, we obtain the following result:

Bridge Booking Guests

Attributes Description

Bridge_Booking_Guest_SID Surrogate key and primary key of the table.

Metadata attribute to track the data loaded Bridge_Booking_Guest_Load_Date

H_Booking_SID Lookup of the surrogate key of the Booking Hub

H_Guest_SID Lookup of the surrogate key of the Guest Hub

H_Hotel_SID Lookup of the surrogate key of the Hotel Hub

Lookup of the surrogate key of the L_Booking_Guest_SID L_H_Guest_L_H_Booking Link

Lookup of the surrogate key of the L_Booking_Hotel_SID L_H_Hotel_L_H_Booking Link

Discount_Description Satellite attribute provided by S_Booking_Discount

Trip_Type_Description Satellite attribute provided by S_Booking_Trip_Type

Rating_Description Satellite attribute provided by S_Booking_Rating

Hotel_Name Satellite attribute provided by S_Hotel_Characteristics

Hotel_Address Satellite attribute provided by S_Hotel_Characteristics

Hotel_City Satellite attribute provided by S_Hotel_Characteristics

Guest_Country Satellite attribute provided by S_Guest_Characteristics

Guest_City Satellite attribute provided by S_Guest_Characteristics

Guest_Age Satellite attribute provided by S_Guest_Characteristics

Number_Of_Children Satellite attribute provided by S_Booking_Dates

Table 28 - Bridge Booking Guests table

71

Figure 44 - Bridge Booking Guest table

The following SQL statement shows the stored procedure to load the Bridge Booking Guest table:

Figure 45 - SQL Stored Procedure to load the Bridge Booking Guest table

72

With the Bridge table created when we intend to query the data, the number of joins reduced, to become only six joins used, instead of ten joins, as shown in the query of Figure 46.

Figure 46 - Query result using the Bridge Booking Guest table in Data Vault optimized model

Typically, after developing the Data Vault 2.0 EDW, on top of this approach, the information mart layer is built, which is responsible for delivering information to present reporting through the data, according to the Data Vault Architecture.

As in the Data Vault approach, end-users, cannot access directly to the data, this information mart layer allows the creation of reporting, by creating subject-oriented data marts using the Star Schema model, OLAP cubes or even Error Marts.

However, by building the Bridge tables presented above, this is no longer necessary because it is possible to create views across these tables, allowing the production measures and KPIs, which enables the reporting process through the BI tools.

With the two Bridge tables previously represented, the Bridge Booking Sales and the Bridge Booking Guests is simple to build views using these tables, for end-users to create measures useful for supporting the decision-making process.

The views can provide benefits in Data Warehousing, being possible to avoid the proliferation of redundant data stored in downstream of the architecture, bringing high performance and agility to access the data. Nevertheless, it also has disadvantages as performance issues with large data sets and traceability and auditability issues, provided by the non-persisting result set based on business rules, due to the changes of them over time (Hultgren, 2012).

Another option, instead of creating views, is creating data virtualization, which allows us to create virtual tables for query results or even create virtual data marts.

According to this case study, and by using the Bridge tables built, the following views can be created, Figure 47, in order to perform measures similar to the metrics represented in the Star Schema model presented in section 4.2.2.

73

Figure 47 - Creation of views using Bridge tables

Figure 48 - SQL query to create the Booking Sales view by using the Bridge Booking Sales table

Figure 49 - SQL query to create the Booking information view by using Bridge Booking Sales table

74

Figure 50 - SQL query to create the Guest Information view by using Bridge Booking Guest

Through the optimized Data Vault 2.0 model, presented in Figure 40, which uses Bridge tables, which acts as a higher-level fact-less fact table and stores the Hub and Link hash keys, it is possible to minimize three limitations of this approach. Namely, by increasing the performance of the model, reducing the joins needed to combine Hubs, Links, and Satellite tables and, more importantly, providing direct access to end-users, allowing the production of BI reporting.

75

5. RESULTS AND DISCUSSION

The goal of the case study was accomplished through a classic BI project, to demonstrate the differences between traditional DW models, Star Schemas, proposed by Kimball, and the emerged approach created by Dan Linstedt, the Data Vault model. Furthermore, to propose improvements in the Data Vault approach, in order to decrease the joins required when querying data, through the model, by minimizing the complexity of the model and allowing end-users to use them to apply reporting techniques.

With the design of the Star Schema model, presented in section 4.2.1, it is possible to perceive that, although this model is often frequently used in DW projects, it does not provide the auditability, traceability, scalability, and the flexibility that the Data Vault model presents, section 4.2.3. Besides, the Star Schema model does not adapt easily to business and rules changes, with high costs of maintenance and development being necessary to design and implement the new requirements, bringing a high impact on the whole model requiring re-engineering.

However, compared with the Data Vault Model, the Star schema has the advantage of being capable of producing and aggregating measures and KPIs in Fact tables, which simplifies the access and query of data, without much effort from end-users. It is also a model where it is possible to apply BI tools to support the decision-making process for organizations.

The Data Vault 2.0 approach is known for its incremental building, which easily adapts to the business organizations' changes, by adding new Hub, Link, and Satellite tables every time that a new requirement is needed and is therefore associated to an agile methodology. Data Vault 2.0 has high levels of traceability and audibility in the data by using metadata attributes that store all the modifications done. Nonetheless, as demonstrated in section 4.2.3 and presented in the Literature review chapter, section 2.5.3, the model design carries limitations. According to the Data Vault architecture, these limitations are overcome at the top of this model in the information delivery layer, by creating subject-oriented Data Marts, represented through Star Schemas or OLAP cubes, to allow the data reporting.

The main limitations of the Dan Linstedt approach are the complexity of the model. These include the exhaustive joins needed to perform ad-hoc queries to the data, the difficulty of end-users access to the data, which only expert people with database know-how could access, and the inability of created directly reports from the data.

So, based on these facts, a proposal for improvements was created in the Data Vault 2.0 process model, in order to overcome these gaps.

The strategy consists of creating auxiliary tables, Bridge tables, to store and aggregate Satellite attributes from multiple Hubs, which in turn may contain multiple Satellites, and collect the Hub and Link hash keys to relate more easily the data.

These tables help to reduce the joins, as presented in the queries represented by Figures 43 and 46, because they aggregate attributes from multiple Satellites from different Hubs. Bridge tables also contribute to minimizing the complexity impact on the model, providing end-users direct access to the model, by creating a view, without the need of building Data Marts.

76

The following table characterizes the achievements met for each characteristic to provide more insight into the results achieved:

Characteristics Star Schema Traditional Data Vault Optimized Data Vault

Load Data This approach uses This approach uses the This proposal, as the ETL tools, which ELT process by traditional Data Vault extract data from data extracting the data approach, also uses sources, transform the from the operational the ELT process. The data in order to system's sources and data is extracted from consolidate, loading directly into the operational aggregate, and clean the Raw Vault layer, sources, loaded them to create without transforming directly to the Raw relevant measures for the data. After, Vault architecture, the Fact Tables. After according to the and after the data business rules and implementing the data transformation, data is business transformations into loaded to a staging requirements, it is the Business Vault area to ensure the possible to transform architecture, in order correct load of the the data and load to for end-users to be data, and then they the Business Vault capable of accessing are loaded to the final layer. However, this is the data to perform DW architecture. not mandatory, and reporting, instead of the Data Vault model creating data marts. design does not suffer any data transformation.

Traceability The Star Schema uses In the Data Vault Similar to the SCD (Slowly Changing approach, the traditional Data Vault Dimensions), which traceability of the data model, this proposal makes the ETL process is performed by also tracks the data difficult. The SCD is adding the Link and through Hubs, Links, used to track the Satellite entities, and Satellites, using historical changes of which stores metadata appropriate metadata the data. For the attributes in order to attributes stored in implementation, collect and track the each table. identification of the existing changes in business attributes more detail. Besides, the Satellite wanted to be tracked tables work similarly required before the to the SCD type II data load into the final presented in the Star DW. However, it Schema model, demands maintenance however, with a high level of tracking and configures them with the business performance, storing

77

changes over time. all the changes in the Note that if the DW attributes. has huge amounts of data, the SCD implementation is complicated and the performance decreases.

Auditability It is possible to add The auditability of the Similar to the metadata for the model is very high, traditional Data Vault, auditability of the data responding to the this proposal responds sources and loads of changes of the to the changes occurs the attributes. attributes all the time, in the attributes, However, it is tracking all tracking all impossible to know information from the information from the the time where the extracted data source extracted data source data is used in data to where it was used. to where it was used. marts.

Scalability and The business The Data Vault Same as the Flexibility requirements or approach adapts easily traditional Data Vault business rule changes to the business and approach, this imply the re-design of requirements changes proposal also adapts the DW architecture, by adding new Hubs, easily to the business modifying the ETL Links, and Satellites. and requirements packages, which bring The DW architecture changes by adding many changes to the does not need to be new Hubs, Links, and model, due to the modified, and the cost Satellites. The DW aggregation and of adding new architecture does not relationship of all requirements is low. need to be modified, tables (dimensions This approach is very and the cost of adding and facts tables). scalable due to the new requirements is advantage of low. This approach is The costs are very incremental building. very scalable due to high. the advantage of incremental building.

Joins Composed of star This approach requires This proposed Data joins, which allows the many joins, in order to Vault model brings end-user to analyze associate the tables to benefits in terms of and access the data get the information, reducing the joins and facilitates the load which decreases the required to combine of the data to the data performance of the the Hub, Link, and marts, providing more model, due to the Satellite tables, by complexity of the adding a new auxiliary

78

compact and model to gather useful table – Bridge table. summarized data. data. The Bridge table helps the performance of the model by storing the Hub, and Link hash keys and Satellite attributes relevant to the business core. These tables can also support data aggregations and transformations, which minimize the complexity of ad hoc queries.

Access to the model The Star Schema The Data Vault model This proposal allows allows end-users to is not designed for end-users to access use the model to end-users to access, the model and perform reporting or being complicated to retrieve information even OLAP cubes. The get information from through the Bridge data is well prepared, the tables. Besides, tables, which store and it is easy to access the complexity of the relevant attributes to them. model is high, so only perform relevant experts can retrieve metrics and KPIs information from the according to the model, due to the high business organization. complexity of the Besides, it is possible joins. to create views to transform the data into useful information through the Bridge tables, and it is also possible to create reporting, using BI tools, without the need to create Data Marts.

Model design The design of the The design of the Data The Data Vault model model is based on Vault model is easy to proposed as the Dimension and Fact implement by creating traditional Data Vault tables, in which the Hub, Link, and Satellite has the same benefits Fact tables aggregate tables, bringing of adaptation to all the surrogate keys advantages when the business changes,

79

of the Dimension business rules change. incremental building tables. This element It is an incremental model, being simple to implies that the data is building model, which add more Hub, Link, all aggregated, and turns simple to add and Satellite tables. the Fact tables contain more Hubs, Links or However, on the the measures and KPIs Satellites, without contrary, to the pertinent to the compromise the traditional Data Vault, business organization. whole model or even this presents another The model is easy to re-design them, which advantage. By using design and more turns this model very Bridge tables, which comprehensive for flexible, scalable, and store the Hub and Link end-users to use adaptable to the hash keys, these them. changes. However, it tables also store is a complex model, relevant attributes to being difficult for the connect multiple end-users to use the attributes stored into model. multiple Satellites, decreasing the joins complexity when querying the data and allow end-users to access the data more efficiently, and create reporting through them to support decision-making.

Keys The Star Schema The traditional Data The Data Vault model uses the Vault uses as the Star proposed uses hash surrogate keys to join Schema, surrogate keys that bring the Dimension tables keys to link Hub, Link, benefits compared with the Fact tables. and Satellite tables. with the surrogate keys. The hash keys allow better data load performance, consistency, and auditability and enable the MPP

Table 29 - Results of the case study

Briefly, through the results obtained from the case study that illustrate a typical DW project in organizations, it was possible to demonstrate the main differences between the traditional organizational DW, the Star Schema, and the Data Vault 2.0 model. The case study presents a method of optimizing the last model in order to improve the limitations studied in order to obtain

80

better performance and show that this approach can easily be used by other organizational businesses.

81

6. CONCLUSIONS

To achieve the goals of this Dissertation and to answer the Research Questions under study, the results reported here are reliable, demonstrate a proposal of a Data Vault 2.0 model through a case study, and minimize the limitations presented in traditional Data Vault 2.0 models. The case study presented is a typical BI project case, so, it is believed that this case can be easily extrapolated to other DW projects.

Designing a conceptual data model is essential in organizational lives because it represents their business world, and it consists of an iterative process, which becomes more detailed as the entities and relationships are added. However, this becomes difficult due to business dynamics and the complexity of organizational business cores.

Besides, with Big Data, the design process becomes more difficult to organize and represent due to the volume of the data, the uncertain veracity of the data, the variety of the sources, and the fast velocity that data arrives and changes.

For that reason, a DW capable of adjusting to business requirement changes and, at the same time, flexible and scalable to allow the addition of new entities and relationships to respond to business and data growth is crucial. The traditional modeling approaches, mostly Star Schema, are not designed to receive business changes, because their source systems are constant, and the project scope is restricted to specific requirements.

Although Star Schema is well-known in the delivery of DW projects, with the case study conducted, it was demonstrated that Data Vault provides benefits compared with Star Schema.

The Data Vault 2.0 modeling approach is considered one of the most effective, being oriented for business requirements, integrating multiple heterogeneous sources, especially unstructured data (semi-structured, multi-structured), providing agility and traceability of data, rapidly absorbing business changes and managing and storing historical data.

In addition, even new sources added do not have high implementation costs, capable of providing a lower total cost of DW ownership, in contrast to the Star Schema model, and it follows an agile methodology in order to have lower risks and multiple deliverables. It is an incremental building model, where new requirements can easily be added to the model, without compromising the architecture and supporting terabytes and petabytes of data. In opposition, Star Schema needs to re- engineer the whole model to increment an additional requirement, and the costs of maintenance and development of change the model are very high and require too much effort.

Nevertheless, the Data Vault, as presented in the research studies and demonstrated through the case study, has limitations, the main objective of this Dissertation being to represent a way of overcoming these limitations.

The high level of complexity of joining the data between the Hub, Link, and Satellite table smake the model very complex and low performance ensues, end-users cannot access the data to retrieve relevant information to perform reports or even ad-hoc queries to support the decision-making process. High computational complexity and efficient algorithms capable of handling the joins complexity to merge Big Data is required, making the process obsolete.

82

So, to transcend the Data Vault constraints, an optimized Data Vault 2.0 model represented in section 4.2.4, was proposed, which uses Bridge tables, created with the purpose of aggregating Satellite attributes from multiple Hubs and data from Hubs which contains many Satellites. These tables also store the Hub and Link hash keys.

Adding Bridge tables to the Data Vault 2.0 model brings benefits and reduces the limitations of this approach. The proposed model, minimizes the complexity of the model, reducing the joins needed between the Hub, Link, and Satellite tables because all the contextual attributes are stored in this type of table.

Furthermore, end-users can have direct access to the data with the model suggested, through views or data virtualization, and apply BI tools to implement reports useful for the decision-making of organizations, without requiring the creation of subject-oriented Data Marts using Star Schemas or OLAP cubes. End-users can access and transform all the useful information by using views or data virtualization and can build reporting through them.

However, by adding the Bridge tables, it is necessary to notice that some effort is needed between analysts and developers in order to perceive what kind of information and contextual attributes the organization wants to aggregate into these tables. Similar to the Fact tables, provided by the Star Schema model, Bridge tables have the same effect.

With the case study performed, it is possible to conclude that the Data Vault 2.0 model can bring benefits to organizations and DW projects. The proposed model proposed adds value to the existing model by improving existing gaps in this approach.

To conclude, it is pertinent to mention that all the Research questions are achieved, and once more enhance that, the presented case study is based on a classic DW project. This proposed model also can add value in other DW projects and other business organizations.

7. LIMITATIONS

The contribution of this Dissertation will lead to an improvement in the existing Data Vault 2.0 Ensemble approach, bringing benefits for organizations, BI developers, experts, and DW model design.

Despite proposing improvements in the Data Vault 2.0 model, it is relevant to consider that there may be limitations when implementing these. One of them is the fact that the design and implementation of the Data Vault 2.0 model are based on an agile methodology, so the BI project to be developed should proceed following the methods of this methodology. Otherwise, it can be challenging to manage and deliver results.

Another identified limitation consists of the tools used for large-scale data processing, which can be obsolete, according to the data that the organization handles.

Besides, the data quality, in some cases, cannot be ensured due to the ELT processes. The data is loaded directly from the sources into the Data Vault, so in the majority of cases, the data present inconsistencies, they are not clean and can contain some noise.

83

The dimension of the organization and their business core can also bring challenges in modeling the Data Vault 2.0 optimized model, due to the complexity of the model, which can require some effort in analyzing all tables in order to aggregate the pertinent data into Bridge tables, to decrease the joins, bringing a better model performance and prepare the data to use BI reporting tools.

8. RECOMMENDATIONS FOR FUTURE WORKS

For future recommendations, it would be interesting to perform market research in Portuguese companies, in order to understand if the Data Vault 2.0 approach, emerged by Dan Linstedt, is well- known and when organizations consider building an EDW for their business models, if they think about using it, or if they instinctively use the traditional model – Star Schema.

Although it is a limitation, it is also a recommendation for future work, to use appropriate tools that can handle Big Data, considering the market trends, as Map Reduce, Hadoop, which can process, manage and store these massive amounts of data with better performance levels.

Another recommendation for future research is performing some reporting, through the designed optimized model, by using reporting BI tools, to achieve the goal of demonstrating that Bridge tables can really support information delivery, without the need of building Data Marts with Star Schema models or OLAP Cubes.

Finally, research can be conducted in order to understand the contribution and the impact of these auxiliary tables (Bridge tables) on the current Data Vault 2.0 Architecture.

84

BIBLIOGRAPHY

Almeida, F. (2017). Concepts and Fundaments of Data Warehousing and OLAP. Research Gate, (September), 39. Retrieved from https://www.researchgate.net/publication/319852408_Concepts_and_Fundaments_of_Data_ Warehousing_and_OLAP%0Ahttp://cs.ulb.ac.be/public/_media/teaching/infoh415/dwnotes.pd f

Anderson, D. (2015). What is “The Data Vault” and why do we need it? Retrieved from Talend website: https://www.talend.com/blog/2015/03/27/what-is-the-data-vault-and-why-do-we- need-it/

Ballard, C., Herreman, D., Schau, D., Bell, R., Kim, E., & Valencic, A. (1998). Data modelling techniques for Data Warehousing. Redbooks.Ibm.Com.

BI-Survey.com. (n.d.). The most common Business Intelligence Problems - 2,500 Users Responses Analyzed. Retrieved from https://bi-survey.com/business-intelligence-problems

Bojičić, I., Marjanović, Z., Turajlić, N., Petrović, M., Vučković, M., & Jovanović, V. (2016). A comparative analysis of data warehouse data models. 2016 6th International Conference on Computers Communications and Control, ICCCC 2016, (Icccc), 151–159. https://doi.org/10.1109/ICCCC.2016.7496754

Bolder-Boos, M. (2015). Der Krieg und die Liebe - Untersuchungen zur römischen Venus. Klio, 97(1), 81–134. https://doi.org/10.1515/klio-2015-0004

Bouaziz, S., Nablil, A., & Gargouri, F. (2017). From Traditional Data Warehouse To Real Time Data Warehouse. 1(February), 0–10. https://doi.org/10.1007/978-3-319-53480-0

Brown, T. (2019). 3 Biggest Challenges for Data Integration. Retrieved October 28, 2019, from ITChronicles website: https://www.itchronicles.com/big-data/3-biggest-challenges-for-data- integration/

Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., & Rosati, R. (2002). Data Integration in Data Warehousing. International Journal of Cooperative Information Systems, 10(03), 237–271. https://doi.org/10.1142/s0218843001000345

Chaudhuri, S., & Dayal, U. (1998). An Overview of Data Warehousing and OLAP Technology. (March 1997).

Chugh, R., & Grandhi, S. (2013). Why Business Intelligence ? Significance of Business Intelligence. (November 2015). https://doi.org/10.4018/ijeei.2013040101

Cox, N. (2014). Data Vault Design: Hub Tables. Retrieved October 29, 2019, from Optimal BI website: https://optimalbi.com/blog/2014/10/10/data-vault-design-hub-tables/

Daeng Bani, F. C., Suharjito, Diana, & Girsang, A. S. (2018). Implementation of Database Massively Parallel Processing System to Build Scalability on Process Data Warehouse. Procedia Computer Science, 135, 68–79. https://doi.org/10.1016/j.procs.2018.08.151

Eberendu, A. C. (2016). Unstructured Data: an overview of the data of Big Data. International Journal of Computer Trends and Technology, 38(1), 46–50. https://doi.org/10.14445/22312803/ijctt- v38p109

EY. (2014). Big data Changing the way businesses. International Journal of Simulation: Systems,

85

Science and Technology, 16(April), 28. https://doi.org/10.5013/IJSSST.a.16.5B.22

Gartner. (2012). Big Data. Retrieved October 20, 2019, from https://www.gartner.com/en/information-technology/glossary/big-data

Gartner. (2019). Data Warehouse. Retrieved October 21, 2019, from Gartner Glossary website: https://www.gartner.com/en/information-technology/glossary/data-warehouse

Gemino, A., & Wand, Y. (2003). Evaluating modeling techniques based on models of learning. Communications of the ACM, 46(10), 79–84. https://doi.org/10.1145/944217.944243

Gil, D., & Song, I. Y. (2016). Modeling and Management of Big Data: Challenges and opportunities. Future Generation Computer Systems, 63, 96–99. https://doi.org/10.1016/j.future.2015.07.019

Hashem, H., & Ranc, D. (2015). An integrative modeling of BigData processing. International Journal of Computer Science and Applications, 12(1), 1–15.

Hultgren, H. (2012). Modeling the Agile Data Warehouse with Data Vault. New Hamilton.

Hultgren, H. (2013). Introductory Guide to Data Vault Modeling.

Hultgren, H. (2018). Data Vault Modeling Certification 2018. New Hamilton.

IBM. (2011). Overview of Data Warehousing. Retrieved from IBM Knowledge Center website: https://www.ibm.com/support/knowledgecenter/en/SSGU8G_11.50.0/com.ibm.whse.doc/ids_ ddi_344.htm

Inmon, W. H. (2002). Building the Data Warehouse.

Inmon, W. H., & Linstedt, D. (2015). Introduction to Data Vault Modeling. In Data Architecture: a Primer for the Data Scientist. https://doi.org/10.1016/b978-0-12-802044-9.00022-2

Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014). A requirement-driven approach to the design and evolution of data warehouses. Information Systems, 44, 94–119. https://doi.org/10.1016/j.is.2014.01.004

Kambayashi, Y., Winiwarter, W., & Arikana, M. (2002). Data Warehousing and Knowledge Discovery. In G. Goos, J. Hartmanis, & J. van Leeuwen (Eds.), 4th International Conference, DaWaK 2002 Aix-en-Provence, France, September 4-6, 2002 Proceedings (Vol. 9). https://doi.org/10.1016/0020-7101(78)90038-7

Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to (Third edit; I. Indianapolis: John Wiley & Sons, Ed.). Retrieved from فرهنگ و رسانه های =http://www.ghbook.ir/index.php?name option=com_dbook&task=readonline&book_id=13650&page=73&chkhashk=ED9C9491B4&نوین &Itemid=218&lang=fa&tmpl=component

Laney, D. (2001). 3d Data management: controlling data volume, velocity and variety, Appl. Delivery Strategies Meta Group. Information and Software Technology, 51(4), 769–784. https://doi.org/10.1016/j.infsof.2008.09.005

Lans, R. F. Van Der, Business, I., & Analyst, I. (2015). Data Vault and Data Virtualization : Double Agility. (March).

Lin, Y., Wang, H., Li, J., & Gao, H. (2019). Data source selection for information integration in big data era. Information Sciences, 479, 197–213. https://doi.org/10.1016/j.ins.2018.11.029

86

Linstedt, D. (2010a). Data Vault Model & MPP Architecture. Retrieved from DanLinstedt.com website: https://danlinstedt.com/allposts/datavaultcat/data-vault-model-mpp-architecture/

Linstedt, D. (2010b). Potencial Data Vault Issues. Retrieved from DanLinstedt.com website: https://danlinstedt.com/allposts/datavaultcat/potential-data-vault-issues/

Linstedt, D. (2015). Data Vault Basics. Retrieved from DanLinstedt.com website: https://danlinstedt.com/solutions-2/data-vault-basics/

Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Retrieved from http://eds.a.ebscohost.com/eds/ebookviewer/ebook/bmxlYmtfXzEwNjU1MDRfX0FO0?nobk=y &sid=c9983dab-c64c-42e5-a2c9-0a5455d14e0b@sdc-v-sessmgr02&vid=5&format=EB&rid=1

McCue, C. (2007). Data Mining and Predictive Analysis. Retrieved from http://www.sciencedirect.com/science/article/pii/B9780750677967500428

Mcnulty, E. (2014). Understanding Big Data: The Seven V’s - Dataconomy. Retrieved October 20, 2019, from Dataconomy website: https://dataconomy.com/2014/05/seven-vs-big-data/

Moody, D. L., & Kortink, M. A. R. (2000). From Enterprise Models to Dimensional Models : A Methodology for Data Warehouse and Design Objectives of Dimenfile:///media/Windows/Users/pedro88/Documents/Faculdade/MEI/Data%20Warehousin g/Wiley%20Publishing%20-%20The%20Data%20Warehouse%20ETL%20. 2000, 1–12.

Naamane, Z., & Jovanovic, V. (2016). Effectiveness of Data Vault compared to Dimensional Data Marts on Overall Performance of a Data Warehouse System. International Journal of Computer Science Issues, 13(4), 16–31. https://doi.org/10.20943/01201604.1631

Orlov, V. (2014). Data Warehouse Architecture: Inmon CIF, Kimball Dimensional or Linstedt Data Vault? - The Blend: A West Monroe Partners Blog. Retrieved October 15, 2019, from https://blog.westmonroepartners.com/data-warehouse-architecture-inmon-cif-kimball- dimensional-or-linstedt-data-vault/

Oumkaltoum, B., Mohamed Mahmoud, E. B., & Omar, E. B. (2019). Toward a business intelligence model for challenges of interoperability in egov system: Transparency, scalability and genericity. 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems, WITS 2019, 1–6. https://doi.org/10.1109/WITS.2019.8723756

Rao, T. R., Mitra, P., Bhatt, R., & Goswami, A. (2018). The big data system, components, tools, and technologies: a survey. In Knowledge and Information Systems (Vol. 60). https://doi.org/10.1007/s10115-018-1248-0

Run, J. (2018). Scalable Data Warehouse Architecture. Retrieved November 6, 2019, from https://jerryrun.wordpress.com/2018/09/11/chapter-2-scalable-data-warehouse-architecture/

Santoso, L. W., & Yulia. (2017). Data Warehouse with Big Data Technology for Higher Education. Procedia Computer Science, 124, 93–99. https://doi.org/10.1016/j.procs.2017.12.134

Sarker, K. U., Bin Deraman, A., Hasan, R., & Abbas, A. (2019). Ontological practice for big data management. International Journal of Computing and Digital Systems, 8(3), 265–273. https://doi.org/10.12785/ijcds/080306

Shivtare, P. S., & Shelar, P. P. (2015). Data Warehouse with Data Integration : Problems and Solution. 67–71.

87

Simons, H. (2009). Case study research in practice. London: SAGE.

Smallcombe, M. (2019). ETL vs ELT: Top Differences. Retrieved October 18, 2019, from Xplenty website: https://www.xplenty.com/blog/etl-vs-elt/

Standards – Data Vault & Ensemble Modeling Standards. (2018). Retrieved November 14, 2019, from Genesee Academy LLC website: http://dvstandards.com/standards/

Starman, A. (2013). The case study as a type of qualitative research. Journal of Contemporary Educational Studies, 1(2013), 28–43.

Storey, V. C., & Song, I. Y. (2017). Big data technologies and Management: What conceptual modeling can do. Data and Knowledge Engineering, 108(February), 50–67. https://doi.org/10.1016/j.datak.2017.01.001

Sturman, A. (1997). Case study methods. In J. P. Keeves (Ed.), Educational research, methodology and measurement: an international handbook (2nd ed.). Pergamon.

Teorey, T., Jagadish, H. V, Modeling, D., & Edition, D. F. (2011). Conceptual Data Modeling Requirements Analysis and Conceptual Data Modeling.

Varge, M. (2001). On the Differences of Relational and Dimensional Data Model. The 12th International Conference on Information and Intelligent Systems IIS 2001, 245–251. Retrieved from https://bib.irb.hr/datoteka/102195.t09r02.pdf

Wannalai, N., & Mekruksavanich, S. (2019). The application of intelligent database for modern information management. ECTI DAMT-NCON 2019 - 4th International Conference on Digital Arts, Media and Technology and 2nd ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, 105–108. https://doi.org/10.1109/ECTI- NCON.2019.8692242

Whishworks. (2017). Understanding the 3 Vs of Big Data - volume, velocity and variety. Retrieved from https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume- velocity-and-variety

Yessad, L. (2016). Comparative Study of Data Warehouses Modeling Approaches : Inmon , Kimball and Data Vault. 2016 International Conference on System Reliability and Science (ICSRS), 95–99. https://doi.org/10.1109/ICSRS.2016.7815845

Yessad, L., & Labiod, A. (2017). Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault. 2016 International Conference on System Reliability and Science, ICSRS 2016 - Proceedings, 95–99. https://doi.org/10.1109/ICSRS.2016.7815845

Yin, R. K. (2008). Case study research: Design and methods (4th ed.). Sage Publications Incorporated.

88

ANNEXES

LOAD DIMENSION TABLES – ETL PROCESS

Figure 51 - Load Hotel Dimension table

Figure 52 - Load Discount Dimension table

89

Figure 53 - Load Booking Status Dimension table

Figure 54 - Load Cancellation Detail Dimension table

90

Figure 55 - Load Services Dimension table

Figure 56 - Load Trip Type Dimension table

91

Figure 57 - Load Room Type Dimension table

Figure 58 - Load Rating Dimension table

92

Figure 59 - Load Platform Dimension table

93

Figure 60 - Load Guest Dimension table

Figure 61 - Load Dates Dimension table

94

Page | i