Lecture @DHBW: Data Warehouse 04 Modeling Andreas Buckenhofer
Total Page:16
File Type:pdf, Size:1020Kb
Lecture @DHBW: Data Warehouse 04 Modeling Andreas Buckenhofer Ein Unternehmen der Daimler AG Andreas Buckenhofer Contact/Connect Senior DB Professional xing Since 2009 at Daimler TSS Department: Machine Learning Solutions DOAG DHBW Business Unit: Analytics vcard • Over 20 years experience with • Oracle ACE Associate database technologies • DOAG responsible for InMemory DB • Over 20 years experience with Data • Lecturer at DHBW Daimler TSS GmbH WilhelmWarehousing-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505• -65 Certified99 Data Vault Practitioner 2.0 [email protected]• International / Internet: www.d projectaimler-tss.com experience • Certified Oracle Professional Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle • Certified IBM Big Data Architect © Daimler TSS I Template Revision Change Log Date Changes 24.10.2019 Initial version Daimler TSS Data Warehouse / DHBW 3 What you will learn today • Understand data modeling (logical models) and physical implementation (physical models) • Explain models for a DWH • Dimensional model • Data Vault model • Understand dimensions and facts • Understand ROLAP & MOLAP Daimler TSS Data Warehouse / DHBW 4 Structuring Data Source: https://www.linkedin.com/feed/update/urn:li:activity:6507927816082845696/ Daimler TSS Data Warehouse / DHBW 5 Conceptual – logical – physical level Different levels of abstraction: • Conceptual (domain) model • Focus on (main) entities and its business definitions! • No attributes • Logical data model • Relational data model (independent of a DBMS or technology) • Logic can't affect performance = no performance optimization on this level • Physical implementation • Representation of a data design for a specific DBMS • RDBMS are the closest to physical independance Daimler TSS Data Warehouse / DHBW 6 Physical implementation in OLTP applications Requirements • Efficient update and delete operations • Efficient read operations • Avoid contradiction in the data – don’t store data twice or multiple times • Easy maintenance of the data model • As little redundancy as possible in the data model Daimler TSS Data Warehouse / DHBW 7 Codd‘s normal forms for DB relations: 1NF • First Normal Form (1NF): • A relation is in first normal form if • the domain of each attribute contains only atomic (simple, indivisible) values. • the value of any attribute in a tuple/row must be a single value from the domain of that attribute, i.e. no attribute values can be sets Daimler TSS Data Warehouse / DHBW 8 Codd‘s normal forms for DB relations: 1NF CD_ID Album Founded Titels 11 Anastacia – Not that kind 1999 1. Not that kind, 2. I‘m outta love, 3 Cowboys & Kisses 12 Pink Floyd – Wish you were here 1964 1. Shine on you crazy diamond 13 Anastacia – Freak of Nature 1999 1. Paid my dues CD_ID Album Performer Founded Track Titels 11 Not that kind Anastacia 1999 1 Not that kind 11 Not that kind Anastacia 1999 2 I‘m outta love 11 Not that kind Anastacia 1999 3 Cowboys & Kisses 12 Wish you were here Pink Floyd 1964 1 Shine on you crazy diamond 13 Freak of Nature Anastacia 1999 1 Paid my dues Daimler TSS Data Warehouse / DHBW 9 Codd‘s normal forms for DB relations: 2NF • Second Normal Form (2NF): • In 1st normal form • Every non-key attribute is fully dependent on the key. There are no dependencies between a partial key and a non-key field. Daimler TSS Data Warehouse / DHBW 10 Codd‘s normal forms for DB relations: 2NF CD_ID Album Performer Founded Track Titels 11 Not that kind Anastacia 1999 1 Not that kind 11 Not that kind Anastacia 1999 2 I‘m outta love 11 Not that kind Anastacia 1999 3 Cowboys & Kisses 12 Wish you were here Pink Floyd 1964 1 Shine on you crazy diamond 13 Freak of Nature Anastacia 1999 1 Paid my dues CD_ID Track Titels 11 1 Not that kind CD_ID Album Performer Founded 11 2 I‘m outta love 11 Not that kind Anastacia 1999 11 3 Cowboys & Kisses 12 Wish you were here Pink Floyd 1964 12 1 Shine on you crazy diamond 13 Freak of Nature Anastacia 1999 13 1 Paid my dues Daimler TSS Data Warehouse / DHBW 11 Codd‘s normal forms for DB relations: 3NF • Third Normal Form (3FN): • In 2nd normal form • No functional dependencies between non key fields: a non-key attribute is dependent from a PK only Codd‘s normal forms for DB relations: 3NF CD_ID Album Performer Founded CD_ID Track Titels 11 Not that kind Anastacia 1999 11 1 Not that kind 12 Wish you were here Pink Floyd 1964 11 2 I‘m outta love 13 Freak of Nature Anastacia 1999 11 3 Cowboys & Kisses 12 1 Shine on you crazy diamond 13 1 Paid my dues CD_ID Album Performer 11 Not that kind Anastacia Performer Founded 12 Wish you were here Pink Floyd Anastacia 1999 13 Freak of Nature Anastacia Pink Floyd 1964 Daimler TSS Data Warehouse / DHBW 13 Codd‘s normal forms - Summary from 1NF to 3NF CD_ID Album Founded Titels 11 Anastacia – Not that kind 1999 1. Not that kind, 2. I‘m outta love, 3 Cowboys & Kisses 12 Pink Floyd – Wish you were here 1964 1. Shine on you crazy diamond 13 Anastacia – Freak of Nature 1999 1. Paid my dues CD_ID Album Performer 11 Not that kind Anastacia CD_ID Track Titels 12 Wish you were here Pink Floyd 11 1 Not that kind 13 Freak of Nature Anastacia 11 2 I‘m outta love n n 11 3 Cowboys & Kisses Performer Founded 12 1 Shine on you crazy diamond Anastacia 1999 13 1 Paid my dues Daimler TSS Pink Floyd 1964 Data Warehouse / DHBW 14 Why database design? “Data modeling is the process of learning about the data, and regardless of technology, this process must be performed for a successful application.” • Learn about the data and promote collective data understanding • Derive security classification and measures • Design for performance • Accelerate development • Improve Software quality • Reduce maintenance costs • Generate code • NoSQL Schema-on-read: understand model versions after years Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014 Daimler TSS Data Warehouse / DHBW 15 Importance of a good database design Daimler TSS Data Warehouse / DHBW 16 Measuring the quality of a data model data model scorecard Source: Steve Hoberman - Data Modeling Scorecard, Technics Publication 2015 Daimler TSS Data Warehouse / DHBW 17 Exercise: use OLTP physical implementation for DWH? The diagram shows a typical OLTP data model • Customers and products have unique ids and some descriptive attributes • A customer can place an order on a specific date • The order contains one or more products Daimler TSS Data Warehouse / DHBW 18 Exercise: use OLTP physical implementation for DWH? Now consider DWH requirements like non-volatile and time-variant data • Customer Bush marries and takes her husband’s last name • Product number 5 gets a price increase How would you solve such requirements in a data model for the Core Warehouse Layer? Daimler TSS Data Warehouse / DHBW 19 Exercise: use OLTP physical implementation for DWH? Possible solutions: • Add timestamp column as part of the primary key • For all tables, not only for specific tables (e.g. product, customer) • Composite keys can become inefficient and impractical • New tables with head and version data to avoid redundancy • Head table contains static data that does not change (e.g. customer id, birthdate) • Version table contains data that changes (e.g. last name, comments) • Store every change in log tables • Querying tables can become difficult and slow if history is required ("main" table + log tables) Daimler TSS Data Warehouse / DHBW 20 Bad physical implementations Create a SQL statement for: How many "Order Transactions" have been created by "Person/Organisation"? Source: Corr / Stagnitto: Agile Data Warehouse Design, DecisionOne Press, 2011, page 5 Daimler TSS Data Warehouse / DHBW 21 Disadvantages of a direct physical implementation of 3NF for DWH • 3NF is inefficient for query processing • 3NF models are difficult to understand • 3NF gets even more complicated with history added • 3NF not suited for „new“ data sources (JSON, NoSQL, etc.) • DWH needs own data modeling approaches for the Core Warehouse Layer and the Mart Layer Daimler TSS Data Warehouse / DHBW 22 What are candidates for primary keys? Primary keys Natural keys, Sequences, Hash keys Natural Keys Generated Keys Hash Keys „intelligent“ keys that System-generated, (composite) Natural key have a meaning to the unique values, e.g. run through a hash business user sequences (increments) function, e.g. 1, 2, 3, 4, 5, etc. Md5(VIN) VIN (vehicle identifier) GUIDs (globally unique ISO country codes, e.g. identifiers) contain e.g. DE, US, UK MAC address + timestamp to make an identifier unique. Daimler TSS Data Warehouse / DHBW 23 Natural keys Advantages Disadvantages Have a meaning: can be considered as master Varying length (can be short or very long) keys Same value across (OLTP) systems: valid across Meaning can change over time, e.g. VIN business processes standards changed Allow parallel loads in a DWH or Big Data system Can be composite (several fields) which would make joins slower and more complex [concatenation would be possible] Often sequence-driven in OLTP systems (e.g. customer number; collisions possible when integration into DWH is done) Daimler TSS Data Warehouse / DHBW 24 Generated keys Advantages Disadvantages Small byte size (sequences): less storage and Insert performance can be slow (hot spot on faster joins index) Always unique No business meaning Good B*Tree index clustering Data load into DWH cause lookups (Big Data systems often have no sequences but would fail performance-wise with *sequential* sequence generation) Daimler TSS Data Warehouse / DHBW 25 Hash keys Advantages Disadvantages Allow parallel loads in a DWH or Big Data system Computed value can be longer compared to natural keys Ability to join across platforms (e.g.