A Proposal for Improvements of the Data Vault Ensemble Process

A proposal for improvements of the Data Vault Ensemble process approach to retrieve Big Data Data Vault limitations and optimization Tahira Jéssica da Silva Ruivo Vissaram Dissertation presented as partial requirement for obtaining the Master’s degree in Information Management NOVA Information Management School Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa A PROPOSAL FOR IMPROVEMENTS OF DATA VAULT ENSEMBLE PROCESS APPROACH TO RETRIEVE BIG DATA by Tahira Jéssica da Silva Ruivo Vissaram Dissertation presented as partial requirement for obtaining the master’s degree in Information Management, with a specialization in Information Systems and Technologies Management Advisor / Co Advisor: Dr. Vítor Santos November 2019 ACKNOWLEDGMENT I would like to express my sincere gratitude to my supervisor, Ph.D. Professor Vítor Santos, for the support, motivation, guidance, and persistence that drove me to complete this dissertation, as well as the knowledge it transmits. I am also extremely grateful to the Nova IMS institution, all the teachers and staff for these years of learning and motivation as a student, who have enabled the conditions for this final work to be executed and helped me in my professional career. A special thanks to my mother and my brother, for their unconditional support, for their encouragement, love, and dedication. Finally, I thank all my friends who directly or indirectly contributed to this work, with words of encouragement and motivation. ABSTRACT Data becomes the most powerful asset in an organization due to the insights and patterns that can be discovered and because it can be transformed into real-time information through BI tools to support decision making. So, it is crucial to have a DW architecture that stores all the business data of an organization in a central repository to be accessible for all end-users, allowing them to query the data for reporting. When we want to design a DW, the most common approach used is the Star Schema, created by Kimball; however, the costs of maintenance and the re-design of the model, when the business requirements and business processes change, or even when the model needs to be incremented are very high and have a significant impact on the whole structure. For that reason, a Data Vault approach invented by Dan Linstedt emerged, which brings a methodology more oriented to auditability, traceability, and agility of the data, which rapidly adapts to the changes of the business rules and requirements, while handling large amounts of data. Therefore, this hybrid modus operandi combines the best of 3NF and Star schema, being flexible, scalable, consistent, whereupon the costs of implementation and maintenance become reduced, without the need to modify all the model structure, allowing increment building of new business processes and requirements. However, as it is still recent, the Data Vault approach has limitations compared to Star Schema, requiring many associations to access and execute ad-hoc queries, which makes end-user access to the model difficult. Consequently, the model has low performance, and more storage is required due to denormalization. Although both are competitors, when we refer to building an EDW capable of providing a central view of all business, the Star Schema and Data Vault 2.0 approaches complement each other according to Data Vault Architecture. On the top of the Data Vault, in the information delivery layer, as the Data Vault cannot be accessed for end-users, Data Marts are created using Star Schemas or OLAP cubes to apply BI tools to perform reports for organizational decision-making. So, briefly, the purpose of this Dissertation is, through a case study, to compare the Star Schema model with the Data Vault 2.0 Ensemble model. Also, to demonstrate the limitations of Data Vault 2.0 studied and present an optimized way of designing a Data Vault 2.0 model, reducing the joins required to query the data, minimizing the complexity of the model, and allowing users to access directly to the data, instead of creating Data Marts. KEYWORDS Big Data; Data Vault; Database Modeling; limitations; optimization INDEX 1. Introduction .................................................................................................................. 1 1.1. Problem justification ............................................................................................. 1 1.2. Problem (Research Question) / General objective (Main goal) ............................ 2 1.2.1. Specific objectives .......................................................................................... 2 1.3. Methodology ......................................................................................................... 3 1.4. Case Study Research .............................................................................................. 3 1.5. Case study strategy ............................................................................................... 4 1.6. Methodology and Tools......................................................................................... 4 2. Literature review .......................................................................................................... 5 2.1. Data Warehouse and Big Data Concepts .............................................................. 5 2.1.1. Big Data Concept ............................................................................................ 5 2.1.2. Data Warehouse definition ............................................................................ 7 2.2. Data Modelling and Big Data challenges ............................................................... 8 2.3. Data Integration problems .................................................................................... 9 2.4. Problems with Traditional Data Warehousing and Business Intelligence .......... 13 2.5. Data Vault Ensemble Modeling ........................................................................... 17 2.5.1. Data Vault Fundamentals ............................................................................. 18 2.5.2. Data Vault Architecture ................................................................................ 24 2.5.3. Benefits, disadvantages and limitations of Data Vault Approach ................ 26 2.5.4. Comparison with other dimensional models ............................................... 28 4. Case study ................................................................................................................... 31 4.1. Data Sources and Data Collection ....................................................................... 32 4.1.1. Business Entities ........................................................................................... 33 4.1.2. Data dictionary of ER model ......................................................................... 35 4.2. Differences between a Relational model and a Dimensional model .................. 38 4.2.1. Traditional DW model - Star schema ........................................................... 39 4.2.2. Traditional Data Vault 2.0 Ensemble Modeling ........................................... 51 4.2.3. The proposal for the optimized Data Vault 2.0 model................................. 64 5. Results and Discussion................................................................................................ 76 6. Conclusions ................................................................................................................. 82 7. Limitations ........................................................................................................... 83 8. Recommendations for future works ................................................................... 84 Bibliography..................................................................................................................... 85 Annexes ........................................................................................................................... 89 Load Dimension tables – ETL process ......................................................................... 89 LIST OF FIGURES Figure 1 - The three V's of Big Data, (Whishworks, 2017) ......................................................... 6 Figure 2 - Big Data drivers and risks, (EY, 2014) ......................................................................... 7 Figure 3 - ETL Pipeline, (Hultgren, 2012) .................................................................................. 10 Figure 4 - Implementation problems in Business Intelligence projects, (BI-Survey.com, n.d.)16 Figure 5 - Data Vault EDW, (Hultgren, 2012 ............................................................................ 18 Figure 6 - Data Vault EDW, (Hultgren, 2012) ........................................................................... 19 Figure 7 - Data Vault EDW, (Hultgren, 2012) ........................................................................... 20 Figure 8 - Hub table, adapted by (Hultgren, 2018) .................................................................. 21 Figure 9 - Link table, adapted by (Hultgren, 2018) .................................................................. 22 Figure 10 - Satellite table, adapted by (Hultgren, 2018) .......................................................... 23 Figure 11 - Data Vault Architecture, (Linstedt & Olschimke, 2015) ......................................... 25 Figure 12 - Parallel load in Data Vault 2.0 approach, (Hultgren, 2012) ..................................

Load more