Data Vault Modeling & Methodology

Data Vault Modeling & Methodology Technical Side and Introduction © Dan Linstedt, 2010, http://DanLinstedt.com Technical Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. Architected specifically to meet the needs of today’s enterprise data warehouses 5/28/2010 http://empoweredHoldings.com 2 What Does One Look Like? Records a history Customer of the interaction Product Sat Sat Elements: Sat •Hub •Link Sat Customer Link Product Sat •Satellite F(x) Sat F(x) F(x) Sat Sat Hub = List of Unique Business Keys Order Sat Link = List of Relationships, Associations Satellites = Descriptive Data F(x) Sat Order 5/28/2010 http://empoweredHoldings.com 3 Excel As A Source… HierarchicalHierarchical Level A LinkLink ofof GroupsGroups Level B Level C HubHub GroupingGrouping Item SatSat GroupGroup TypeType Item Item Staging Table LinkLink AcctAcct ToTo GroupGroup User Grouping Structures Flattened Structure Raw Source HubHub AccountAccount Data in DV Do you have a power executive who is technically inclined, who runs the business off a rogue spreadsheet? 5/28/2010 http://empoweredHoldings.com 4 Data Vault Basic Elements CORE ARCHITECTURE 5/28/2010 http://empoweredHoldings.com 5 Data Vault Core Architecture • Hubs, Links, Satellites • Hubs = Unique List of Business Keys • Links = Unique List of Relationships across keys • Satellites = Descriptive Data • Satellites have 1 and only one parent table • Satellites cannot be “Parents” to other tables • Hubs cannot be child tables • Last Seen Dates, Load Dates, Record Sources, and Surrogate keys are not part of the core architecture. They exists to help models and key migration. 5/28/2010 http://empoweredHoldings.com 6 Hub Entity A Hub is a list of unique business keys Hub Structure Hub Product Primary Key Product Sequence ID Unique Index <Business Key> (Primary Index) Product Number Load DTS Product Load DTS Last Seen DTS Product Last Seen DTS Record Source Prod Record Source • A Hub’s business key is a unique index • A Hub’s load date represents the FIRST TIME the EDW saw the data • A Hub’s record source represents: First – the “Master” data source (on collisions), if not available it holds the origination source of the actual key 5/28/2010 http://empoweredHoldings.com 7 Link Entity A Link is an intersection of two or more business keys It can contain Hub keys and other Link keys Link Structure Link Line-Item Primary Key Link Line Item Sequence ID {Hub/Lnk Surrogate Keys 2..N} Unique Index Hub Product Sequence ID Load DTS (Primary Index) Hub Order Sequence ID Last Seen DTS **Line Item Number Record Source Load DTS Last Seen DTS Record Source A Link’s business key is a composite unique index • A Link may or may not have a “**Item Numbering” attribute • A Link’s load date represents the FIRST TIME the EDW saw the data • A Link’s record source represents: first – the “Master” data source (on collisions), if not available, it holds the origination source of the actual key 5/28/2010 http://empoweredHoldings.com 8 Satellite Entity A Satellite is a time-dimensional table housing detailed information about the Hub’s or Link’s business keys Primary Key Unique Index Customer # Load DTS (Primary Index) Load DTS Extract DTS Extract DTS **Load End Date **Load End Date Detail Customer Name Business Data Customer Addr1 Customer Addr2 {Update User} {Update User} {Update DTS} {Update DTS} Record Source Record Source • Satellites are defined by TYPE of data and RATE OF CHANGE • Mathematically – this reduces redundancy and decreases storage requirements over time (compared to a star schema) 5/28/2010 http://empoweredHoldings.com 9 Rules and Standards GOVERN your deployment… THINKING OF BREAKING RULES… 5/28/2010 http://empoweredHoldings.com 10 Some Rules For You • NO Foreign Keys in the Satellites! • NO Hub to Hub (Parent Child relationships) • NO Enforcement of relationships in the data model… • NO Date Time attributes in HUB or LINK Primary Keys… •Why?? – It breaks flexibility – It breaks auditability / accountability – It breaks Scalability – It breaks Performance – It introduces “Decisions” in the architecture, which breaks Patterns! Up Next Æ Links and the Unit Of Work… 5/28/2010 http://empoweredHoldings.com 11 Business Key Definitions… • “The contracts system is responsible for creating customer account numbers. The EDW will never see other systems creating customer account numbers.” (Requirement #101) Sales is clearly creating customer numbers, how do we detect the issue and alert the business? Point: Not all business keys are created EQUAL! 5/28/2010 http://empoweredHoldings.com 12 Link: Unit of Work Hub Category Link Sat Effectivity Prod-Cat Unit Of Work Link Hub Product Link: Product by Line Item Supplier by Category Link Sat Effectivity Prod-Supp Hub Supplier Link Product by Category These links are Optional, used Link Product by Supplier For exploration only 5/28/2010 http://empoweredHoldings.com 13 What Happens When: We Break the Unit of Work Link Product by Supplier Source System UOW Product_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 96 222 12 96 222 93 222 12 93 729 87 729 15 87 Link Product by Category 222 17 93 Product_ID Category_ID Model Normalization 222 12 222 17 729 15 Question: After normalizing, how can you reconstruct the source image EXACLTY as it stands? 5/28/2010 http://empoweredHoldings.com 14 What Happens When: Trying to Rebuild from Two Links Link Product by Supplier Source System UOW Product_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 96 222 12 96 222 93 222 12 93 729 87 222 17 96 Link Product by Category 222 17 93 Product_ID Category_ID Model 729 15 87 Normalization 222 12 222 17 729 15 Re-joining the data, creates a record that does not exist in the original source system, this is the same problem that BI engines will have when putting together Data Mart results. 5/28/2010 http://empoweredHoldings.com 15 Link: Unit of Work Kept Together Source System Data Vault Source Table UOW Link: Product by Category by Supplier Product_ID Category_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 12 96 222 12 96 222 12 93 222 12 93 729 15 87 729 15 87 222 17 93 222 17 93 Commutative Property: Enable reproduction of the source exactly as it stands UOW is properly represented by a single Link in the Data Vault 5/28/2010 http://empoweredHoldings.com 16 What keeps you up at night? CURRENT LOADING PAIN 5/28/2010 http://empoweredHoldings.com 17 Problems with EDW Loads Today Technical Issues: • 2am Wakeup Calls – because “data” won’t fit the business rules • “Emergency Fixes” to Production • Speed, Speed, Speed (shrinking load window + more data) • Can’t load real-time data (business rules in the way!!) • Business won’t buy better, faster, hardware! Business Issues: • Maintenance cycles take too long • Maintenance costs continue to increase • Fixes to “existing mappings” break working logic • Complexity of existing systems become unsustainable to business • IT isn’t using 80%+ of the hardware resources given to them (their jobs are running at 40% utilization when they are “full-bore”) 5/28/2010 http://empoweredHoldings.com 18 Solutions! Technical Solutions • All Parallel Job Streams As much as possible • 1 Target Per Map, Per Action Æ reduces complexity • Generate Data Flows based on patterns (then focus on the real work) • Get some SLEEP at night!! (no more production modifications) Business Solutions • Decrease turn-around time • Increase Performance • Handle Real-Time Data!! • Reduce Complexity = Reduce Costs, Reduce Time to Implement • Get the power back for decision making, discovering and building your own marts 5/28/2010 http://empoweredHoldings.com 19 How? 5/28/2010 http://empoweredHoldings.com 20 Some standards to follow… BASIC LOADING CONCEPTS 5/28/2010 http://empoweredHoldings.com 21 Loading: A Golden Rule 100% of the Data Loaded to the EDW 100% of the time! It’s all about Auditability… 5/28/2010 http://empoweredHoldings.com 22 Load Date / End Date Geology Batch Load Real-Time Loading 5/28/2010 http://empoweredHoldings.com 23 Real Time Loading - DV Acct Hub Stock Trade 123443576 ACCOUNT=123443576 1 TRADE="Buy" STOCK=“DAN" TRADE="Buy" SHARES=100.0 SHARES=100.0 CURRENCY="USD" Trade Link CURRENCY="USD" 3 PRICE=115.52 PRICE=115.52 DATE="Feb 20, 2002“ DATE="Feb 20, 2002“ Comment="Buy Order to Execute" Comment="Buy Order to Execute" 2 “DAN” Transactional Link = Inserts Only, no Updates Stock Hub • As critical mass of current business 75M First Data Set Loaded keys is reached, the insert rates # of 50M decrease rapidly. Inserts 25M New Systems Data Added 10M • New systems add new keys, quickly and efficiently to an existing Hub. 1 2 345678 Months in Production 5/28/2010 http://empoweredHoldings.com 24 Batch Load Date Time Stamp Staging Area EDW – Data Vault CNTRL_DTE LOAD_DTS Load Date Is exactly the same STAGING TABLE For All rows Sequence_ID Stage Load …. Load_DTS Record_Source Stage Load STAGING TABLE Sequence_ID …. Load_DTS Record_Source 5/28/2010 http://empoweredHoldings.com 25 Parallel Load Architecture - Batch Staging Loads Data Vault Loads Data Mart Loads Hub Link Sources Stage Hubs Satellites Satellites Dimensions Facts Links Major Synchronization Points Processing: • All loads are done in parallel • Sets of processes “wait” for the previous set to complete

Data Vault Modeling & Methodology

Star Schema Modeling with Pentaho Data Integration

Data Vault and 'The Truth' About the Enterprise Data Warehouse

QUIPU 1.1 Whitepaper Final

Integration of Heterogeneous Data in the Data Vault Model

Data Vault Modelling

Data Vault Modeling the Next Generation DW Approach

Data Warehouse and Master Data Management Evolution – a Meta-Data-Vault Approach

Building Open Source ETL Solutions with Pentaho Data Integration

Testing Data Vault-Based Data Warehouse

Raman & Cata Data Vault Modeling and Its Evolution

Building a Scalable Data Warehouse with Data Vault 2.0

Effectiveness of Data Vault Compared to Dimensional Data Marts on Overall Performance of a Data Warehouse System