Data Vault Modeling & Methodology

Technical Side and Introduction © Dan Linstedt, 2010, http://DanLinstedt.com Technical Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and . The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

Architected specifically to meet the needs of today’s enterprise data warehouses

5/28/2010 http://empoweredHoldings.com 2 What Does One Look Like? Records a history Customer of the interaction Product

Sat Sat Elements: Sat •Hub •Link Sat Customer Link Product Sat •Satellite F(x) Sat F(x) F(x) Sat

Sat

Hub = List of Unique Business Keys Order Sat Link = List of Relationships, Associations Satellites = Descriptive Data F(x) Sat Order

5/28/2010 http://empoweredHoldings.com 3 Excel As A Source…

HierarchicalHierarchical Level A LinkLink ofof GroupsGroups Level B Level C HubHub GroupingGrouping Item SatSat GroupGroup TypeType Item Item Staging Table LinkLink AcctAcct ToTo GroupGroup User Grouping Structures

Flattened Structure Raw Source HubHub AccountAccount Data in DV

Do you have a power executive who is technically inclined, who runs the business off a rogue spreadsheet?

5/28/2010 http://empoweredHoldings.com 4 Data Vault Basic Elements CORE ARCHITECTURE

5/28/2010 http://empoweredHoldings.com 5 Data Vault Core Architecture

• Hubs, Links, Satellites • Hubs = Unique List of Business Keys • Links = Unique List of Relationships across keys • Satellites = Descriptive Data

• Satellites have 1 and only one parent table • Satellites cannot be “Parents” to other tables • Hubs cannot be child tables

• Last Seen Dates, Load Dates, Record Sources, and Surrogate keys are not part of the core architecture. They exists to help models and key migration.

5/28/2010 http://empoweredHoldings.com 6 Hub Entity A Hub is a list of unique business keys

Hub Structure Hub Product Primary Key Product Sequence ID Unique Index (Primary Index) Product Number Load DTS Product Load DTS Last Seen DTS Product Last Seen DTS Record Source Prod Record Source

• A Hub’s business key is a unique index • A Hub’s load date represents the FIRST TIME the EDW saw the data • A Hub’s record source represents: First – the “Master” data source (on collisions), if not available it holds the origination source of the actual key

5/28/2010 http://empoweredHoldings.com 7 Link Entity A Link is an intersection of two or more business keys It can contain Hub keys and other Link keys Link Structure Link Line-Item Primary Key Link Line Item Sequence ID

{Hub/Lnk Surrogate Keys 2..N} Unique Index Hub Product Sequence ID Load DTS (Primary Index) Hub Order Sequence ID Last Seen DTS **Line Item Number Record Source Load DTS Last Seen DTS Record Source A Link’s business key is a composite unique index • A Link may or may not have a “**Item Numbering” attribute • A Link’s load date represents the FIRST TIME the EDW saw the data • A Link’s record source represents: first – the “Master” data source (on collisions), if not available, it holds the origination source of the actual key

5/28/2010 http://empoweredHoldings.com 8 Satellite Entity A Satellite is a time-dimensional table housing detailed information about the Hub’s or Link’s business keys

Primary Key Unique Index Customer # Load DTS (Primary Index) Load DTS Extract DTS Extract DTS **Load End Date **Load End Date Detail Customer Name Business Data Customer Addr1 Customer Addr2 {Update User} {Update User} {Update DTS} {Update DTS} Record Source Record Source

• Satellites are defined by TYPE of data and RATE OF CHANGE

• Mathematically – this reduces redundancy and decreases storage requirements over time (compared to a star schema)

5/28/2010 http://empoweredHoldings.com 9 Rules and Standards GOVERN your deployment… THINKING OF BREAKING RULES…

5/28/2010 http://empoweredHoldings.com 10 Some Rules For You

• NO Foreign Keys in the Satellites! • NO Hub to Hub (Parent Child relationships) • NO Enforcement of relationships in the data model… • NO Date Time attributes in HUB or LINK Primary Keys…

•Why?? – It breaks flexibility – It breaks auditability / accountability – It breaks Scalability – It breaks Performance – It introduces “Decisions” in the architecture, which breaks Patterns!

Up Next Æ Links and the Unit Of Work…

5/28/2010 http://empoweredHoldings.com 11 Business Key Definitions…

• “The contracts system is responsible for creating customer account numbers. The EDW will never see other systems creating customer account numbers.” (Requirement #101)

Sales is clearly creating customer numbers, how do we detect the issue and alert the business?

Point: Not all business keys are created EQUAL!

5/28/2010 http://empoweredHoldings.com 12 Link: Unit of Work

Hub Category

Link Sat Effectivity Prod-Cat Unit Of Work Link Hub Product Link: Product by Line Item Supplier by Category

Link Sat Effectivity Prod-Supp

Hub Supplier

Link Product by Category These links are Optional, used Link Product by Supplier For exploration only

5/28/2010 http://empoweredHoldings.com 13 What Happens When: We Break the Unit of Work

Link Product by Supplier Source System UOW Product_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 96 222 12 96 222 93 222 12 93 729 87 729 15 87 Link Product by Category 222 17 93 Product_ID Category_ID Model Normalization 222 12 222 17 729 15

Question: After normalizing, how can you reconstruct the source image EXACLTY as it stands?

5/28/2010 http://empoweredHoldings.com 14 What Happens When: Trying to Rebuild from Two Links

Link Product by Supplier Source System UOW Product_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 96 222 12 96 222 93 222 12 93 729 87 222 17 96 Link Product by Category 222 17 93 Product_ID Category_ID Model 729 15 87 Normalization 222 12 222 17 729 15

Re-joining the data, creates a record that does not exist in the original source system, this is the same problem that BI engines will have when putting together results.

5/28/2010 http://empoweredHoldings.com 15 Link: Unit of Work Kept Together

Source System Data Vault

Source Table UOW Link: Product by Category by Supplier Product_ID Category_ID Supplier_ID Product_ID Category_ID Supplier_ID 222 12 96 222 12 96 222 12 93 222 12 93 729 15 87 729 15 87 222 17 93 222 17 93

Commutative Property: Enable reproduction of the source exactly as it stands

UOW is properly represented by a single Link in the Data Vault

5/28/2010 http://empoweredHoldings.com 16 What keeps you up at night? CURRENT LOADING PAIN

5/28/2010 http://empoweredHoldings.com 17 Problems with EDW Loads Today

Technical Issues: • 2am Wakeup Calls – because “data” won’t fit the business rules • “Emergency Fixes” to Production • Speed, Speed, Speed (shrinking load window + more data) • Can’t load real-time data (business rules in the way!!) • Business won’t buy better, faster, hardware!

Business Issues: • Maintenance cycles take too long • Maintenance costs continue to increase • Fixes to “existing mappings” break working logic • Complexity of existing systems become unsustainable to business • IT isn’t using 80%+ of the hardware resources given to them (their jobs are running at 40% utilization when they are “full-bore”)

5/28/2010 http://empoweredHoldings.com 18 Solutions!

Technical Solutions • All Parallel Job Streams As much as possible • 1 Target Per Map, Per Action Æ reduces complexity • Generate Data Flows based on patterns (then focus on the real work) • Get some SLEEP at night!! (no more production modifications)

Business Solutions • Decrease turn-around time • Increase Performance • Handle Real-Time Data!! • Reduce Complexity = Reduce Costs, Reduce Time to Implement • Get the power back for decision making, discovering and building your own marts

5/28/2010 http://empoweredHoldings.com 19 How?

5/28/2010 http://empoweredHoldings.com 20 Some standards to follow… BASIC LOADING CONCEPTS

5/28/2010 http://empoweredHoldings.com 21 Loading: A Golden Rule

100% of the Data Loaded to the EDW 100% of the time!

It’s all about Auditability…

5/28/2010 http://empoweredHoldings.com 22 Load Date / End Date Geology Batch Load

Real-Time Loading

5/28/2010 http://empoweredHoldings.com 23 Real Time Loading - DV Acct Hub Stock Trade 123443576 ACCOUNT=123443576 1 TRADE="Buy" STOCK=“DAN" TRADE="Buy" SHARES=100.0 SHARES=100.0 CURRENCY="USD" Trade Link CURRENCY="USD" 3 PRICE=115.52 PRICE=115.52 DATE="Feb 20, 2002“ DATE="Feb 20, 2002“ Comment="Buy Order to Execute" Comment="Buy Order to Execute" 2 “DAN” Transactional Link = Inserts Only, no Updates Stock Hub

• As critical mass of current business 75M First Data Set Loaded keys is reached, the insert rates # of 50M decrease rapidly. Inserts 25M New Systems Data Added 10M • New systems add new keys, quickly and efficiently to an existing Hub. 1 2 345678 Months in Production

5/28/2010 http://empoweredHoldings.com 24 Batch Load Date Time Stamp

Staging Area EDW – Data Vault

CNTRL_DTE LOAD_DTS Load Date Is exactly the same STAGING TABLE For All rows Sequence_ID Stage Load …. Load_DTS Record_Source

Stage Load STAGING TABLE Sequence_ID …. Load_DTS Record_Source

5/28/2010 http://empoweredHoldings.com 25 Parallel Load Architecture - Batch

Staging Loads Data Vault Loads Data Mart Loads

Hub Link Sources Stage Hubs Satellites Satellites Dimensions Facts

Links

Major Synchronization Points Processing: • All loads are done in parallel • Sets of processes “wait” for the previous set to complete • Processes are run as soon as data is ready • No other “waiting” time is required • Load dependencies are greatly reduced 5/28/2010 http://empoweredHoldings.com 26 Mathematics of Batch Loading Its all about SPEED SPEED SPEED

10 Million Incoming Rows

10%-20% 60% - 80% Updates EDW: Inserts Matched (Never Seen 1 Billion Rows By KEY Before) And growing

5% Deletes

• Inserts are the single fastest Q: Why push 80% of your Insert operation in the ! data through “the heaviest/slowest” • Updates are the single slowest transformation logic? operation in the Database! 5/28/2010 http://empoweredHoldings.com 27 Simple Loading Patterns

Rule: 1 Target Per Data Flow (map/graph) Per Action

Source SQ LKP Target Filter If Exists Target Insert

Source SQ Target Insert Insert View: Select ALL that do not exist By PK in target Source Target (Stage) Update View: Select ALL that exist By PK in target ONLY those with DELTA

5/28/2010 http://empoweredHoldings.com 28 Results of Pattern Tuning

FROM THIS….. TO THIS! • Pass 1: 5m @ 33k RPS = 2.52 mins • 5M rows @ 600 RPS = 2.31 hrs • Pass 2: • OR: 5m @ 7k rps = 11.9 mins •5m @ 33k RPS = 2.52 mins • No parallelism •5m @ 25k RPS = 3.33 mins • Pass 3: •5m @ 50k RPS = 1.66 mins This map must run at a minimum •5m @ 33k RPS = 2.52 mins of 10k rps to beat the parallel times •5m @ 40k RPS = 2.03 mins •5m @ 23k RPS = 3.61 mins 5m @ 10k rps = 8.33 mins • Total Time: •2.52+3.33+3.61 = 9.46 mins

5/28/2010 http://empoweredHoldings.com 29 Patterns Take the Cake! LOADING THE DATA VAULT

5/28/2010 http://empoweredHoldings.com 30 Loading Templates: Hubs

Insert Into Staging Distinct List Exists In No Target Hub Data BK Keys Target? (Gen Surrogate)

Yes

Drop From Feed

• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data • Purpose of the loading template: Find out if the business key exists in the hub, if not – insert it • Use a distinct list (unique) of business keys coming from the staging area

5/28/2010 http://empoweredHoldings.com 31 Loading Templates: Links

Lookup EACH Insert Into Staging Distinct List Hubs Exists In No Target Link Data Busn Keys Surrogate Target? (gen surrogate) Keys

Yes

Drop Row From Feed

• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data • Purpose of the loading template: Find all relationships between business keys, then, is the relationship already recorded in the Link, if not – insert it • Use a distinct list of related business keys

5/28/2010 http://empoweredHoldings.com 32 Loading Templates: Satellites

Lookup EACH Staging Distinct List Hub’s or Link’s Data Sat Rows Surrogate Keys All Columns No Insert Into Satellite Match? Target

Find Latest Sat Row Yes

Drop Row From Feed

• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data • Purpose of the loading template: Gather descriptive data, compare to most recent copy of information in satellite, and if there are any deltas – load, if not, don’t load • Use a distinct list of descriptive fields from the source systems

5/28/2010 http://empoweredHoldings.com 33 How to build your Data Vault… GETTING STARTED… HOW TO

5/28/2010 http://empoweredHoldings.com 34 Step 1: Establish Scope (Build Business Case Model)

5/28/2010 http://empoweredHoldings.com 35 Step 1: Define Business Keys Hub Invoice

Hub Campaign Hub Customer

Hub Products

5/28/2010 http://empoweredHoldings.com 36 Step 2: Define Associations Hub Invoice

Link Campaign by Hub Campaign Hub Customer Invoice by Customer

Link Product on Campaign

Link Invoice Hub Products Line Items

5/28/2010 http://empoweredHoldings.com 37 Step 3: Define Descriptive Data Hub Invoice

Link Campaign by Hub Campaign Hub Customer Invoice by Customer

Sat Effectiveness Sat Effectiveness Sat Dates and Sat Address Sat Details Ratings Dates Amounts Sat Contacts

Link Product on Campaign

Link Invoice Hub Products Line Items

Sat Availability Sat Descriptions Dates Sat Amounts Sat Quantities Sat Defect Reasons Sat Stock Quantities

5/28/2010 http://empoweredHoldings.com 38 Step 4: Build Source Model (PK/FK)

(No Pictures, Sorry) • Ensure the source model (DDL Only) has Primary and Foreign Keys defined • Normalize the source model (if not normalized) • Capture and integrate all source systems involved (if not already captured) • Add Comments to the DDL (tables and fields)

5/28/2010 http://empoweredHoldings.com 39 Step 5: Build Cross-Reference

The purpose of such an exercise is not to identify all the elements, but specifically to identify the target Hubs, (ie: the business keys), target Links, and at LEAST a single Satellite for at least 1 source column…

The engine (SaaS) will automatically assign all other descriptive elements to the first Satellite identified.

SOURCE TABLE SOURCE COLUMN GROUP TARGET TABLE TARGET COLUMN AHLTAT_DIAGNOSIS DOC_REF 1 SAT_AHLTAT_DIAGNOSIS DOC_REF DATAID 1 HUB_DIAGNOSIS DIAGNOSIS_DATAID FACILITYNCID 1 HUB_FACILITY FAC_ID DIAGNOSISNCID 1 SAT_AHLTAT_DIAGNOSIS DIAGNOSISNCID ENCOUNTERNUMBER 1 HUB_EVENT EVNT_ID CLINICIANNCID 1 HUB_CLINICIAN CLINICIAN_NCID UNIT_NUMBER 1 HUB_UNIT UNIT_ID MEDCINID 1 HUB_MEDCIN MEDCIN_ID CREATETIME 1 SAT_AHLTAT_DIAGNOSIS CREATETIME CREATEUSERNCID 1 SAT_AHLTAT_DIAGNOSIS CREATEUSERNCID MODIFYUSERNCID 1 SAT_AHLTAT_DIAGNOSIS MODIFYUSERNCID MODIFYTIME 1 SAT_AHLTAT_DIAGNOSIS MODIFYTIME PRIORITY 1 SAT_AHLTAT_DIAGNOSIS PRIORITY DIAGNOSESCOMMENT 1 SAT_AHLTAT_DIAGNOSIS DIAGNOSESCOMMENT

5/28/2010 http://empoweredHoldings.com 40 Step 6: Generate Baseline ETL/ELT

Cross-Ref Source Mapping Target DDL XLS DDL

Generate Code, Reports, Documentation

Data Flows (Mappings / Graphs)

5/28/2010 http://empoweredHoldings.com 41 What did we learn? CONCLUSIONS / SUMMARY

5/28/2010 http://empoweredHoldings.com 42 Data Vault…

Modeling Is… • Made up of Hubs, Links, and Satellites • Easy to create and build • Hardest thing is to “find/locate” and define the Business Keys • Consistent, Scalable, Repeatable, Pattern Based • RULES BASED / STANDARDS DRIVEN

Loading Is…. • Scalable, Fault-Tolerant, Parallelizable, Pattern Based • Generatable • Performance Based • 100% Restartable • Set Based • Devoid of “Soft” Business Rules!!

5/28/2010 http://empoweredHoldings.com 43 Still - Lots To Learn…

We didn’t cover: •Joins • what to do when… • point-in-time tables • dealing with bad data • building marts • architecting security, • business logic managing governance, components handling • SQL extraction • bridge tables

Contact me for Workshops (training), and Mentoring…

5/28/2010 http://empoweredHoldings.com 44 Questions?

Dan Linstedt SERVICES: President, Empowered Holdings, LLC • Consulting http://EmpoweredHoldings.com • Assessments http://DanLinstedt.com • Product Selection Scorecards Tel: +1 802-524-8566 • Architecture / Design E-Mail: [email protected] • Mentoring and Workshops (training)

5/28/2010 http://empoweredHoldings.com 45