Lecture @Dhbw: Data Warehouse Review Questions Andreas Buckenhofer, Daimler Tss About Me

A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE REVIEW QUESTIONS ANDREAS BUCKENHOFER, DAIMLER TSS ABOUT ME Andreas Buckenhofer https://de.linkedin.com/in/buckenhofer Senior DB Professional [email protected] https://twitter.com/ABuckenhofer https://www.doag.org/de/themen/datenbank/in-memory/ Since 2009 at Daimler TSS Department: Big Data http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/ Business Unit: Analytics https://www.xing.com/profile/Andreas_Buckenhofer2 NOT JUST AVERAGE: OUTSTANDING. As a 100% Daimler subsidiary, we give 100 percent, always and never less. We love IT and pull out all the stops to aid Daimler's development with our expertise on its journey into the future. Our objective: We make Daimler the most innovative and digital mobility company. Daimler TSS INTERNAL IT PARTNER FOR DAIMLER + Holistic solutions according to the Daimler guidelines + IT strategy + Security + Architecture + Developing and securing know-how + TSS is a partner who can be trusted with sensitive data As subsidiary: maximum added value for Daimler + Market closeness + Independence + Flexibility (short decision making process, ability to react quickly) Daimler TSS 4 LOCATIONS Daimler TSS Germany 7 locations 1000 employees* Ulm (Headquarters) Daimler TSS China Stuttgart Hub Beijing Berlin 10 employees Karlsruhe Daimler TSS Malaysia Hub Kuala Lumpur Daimler TSS India * as of August 2017 42 employees Hub Bangalore 22 employees Daimler TSS Data Warehouse / DHBW 5 WHICH CHALLENGES COULD NOT BE SOLVED BY OLTP? WHY IS A DWH NECESSARY? • Distributed data: data is spread across applications • Different data structures: each system has its won data model • Missing integrated view: data is not harmonized / standardized • Missing historic data: OLTP normally stores current data only • Technological challenges: OLTP has different infrastructure requirements • System workload: additional workload from DWH users most likely decreases performance for OLTP users Daimler TSS Data Warehouse / DHBW 6 EXPLAIN TWO DEFINITIONS OF THE DWH Ralph Kimball William Harvey „Bill“ Inmon „A data warehouse is a copy of “A data warehouse is a subject- transaction data specifically oriented, integrated, time- structured for querying and variant , nonvolatile collection of reporting“ data in support of management’s decision-making process” Daimler TSS Data Warehouse / DHBW 7 WHICH CHARACTERISTICS DOES A DWH HAVE ACCORDING TO BILL INMON? • Subject-oriented • A DWH is organized around the major themes, not around processes • Integrated • Data in the DWH is harmonized and uniformly standardized across all sources • Non-volatile • Operations in a DW are insert and select (Updates and deletes for technical reasons only) • Time-variant • All data in the data warehouse is accurate as of some moment in time and has to be associated with a time stamp Daimler TSS Data Warehouse / DHBW 8 WHICH LAYERS DOES THE LOGICAL STANDARD ARCHITECTURE HAVE? • Staging (Input) • Integration (Cleansing) • Core Warehouse (Storage) • Aggregation • Mart (Reporting, Output) • and additionally Metadata, Security, DWH Manager, Monitor Daimler TSS Data Warehouse / DHBW 9 LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE Internal data sources Data Warehouse Backend Frontend OLTP Core Mart Layer Staging Integration Warehouse Aggregation (Output Layer Layer Layer Layer Layer) (Input (Cleansing (Storage (Reporting OLTP Layer) Layer) Layer) Layer) External data sources Metadata Management Security DWH Manager incl. Monitor Daimler TSS Data Warehouse / DHBW 10 DESCRIBE STAGING LAYER CHARACTERISTICS • “Landing Zone” for data coming into a DWH • Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery) • Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system) • Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured) • Tables have no referential integrity constraints, columns often varchar Daimleronly TSS Data Warehouse / DHBW 11 DESCRIBE CORE WAREHOUSE LAYER CHARACTERISTICS • Data storage in an integrated, consolidated, consistent and non- redundant (normalized) data model • Contains enterprise-wide data organized around multiple subject-areas • Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data) • Size of database can be several TB and can grow rapidly due to data historization Daimler TSS Data Warehouse / DHBW 12 DESCRIBE MART LAYER CHARACTERISTICS • Data is stored in a denormalized data model for performance reasons and better end user usability/understanding • The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model • Created through filtering or aggregating the Core Warehouse Layer • One Mart ideally represents one subject area • Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database Daimler TSS Data Warehouse / DHBW 13 KIMBALL BUS ARCHITECTURE WHAT ARE MAIN CHARACTERISTICS? • 2-layered architecture • Sum of the data marts constitute the Enterprise DWH • Enterprise Service Bus / conformed dimensions for integration purposes (don’t confuse with ESB as middleware/communication system between applications) • Rather simple approach to make data fast and easily accessible • Lower startup costs (but higher subsequent development costs) • If table structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease) Daimler TSS Data Warehouse / DHBW 14 DATA INTEGRATION WITH AND WITHOUT CORE WAREHOUSE LAYER Layer Core Warehouse Core Warehouse Daimler TSS Data Warehouse / DHBW 15 DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT) WHAT ARE MAIN CHARACTERISTICS? • Data in Raw Data Vault Layer is regarded as “Single version of the facts” • Business rules are implemented down-stream • Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” • Real-Time capability • Write-back of data in Core Warehouse Layer Daimler TSS Data Warehouse / DHBW 16 WHICH TABLE TYPES ARE USED IN DATA VAULT? Daimler TSS Data Warehouse / DHBW 17 HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES? Staging Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red V4 Cabrio 17.01.13 E5 blue Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red Daimler TSS Data Warehouse / DHBW 18 HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES? • H_VEHICLE • 4 rows: V1, V2, V3, V4 • H_ENGINE • 5 rows: E1, E2, E3, E4, E5 • L_PLUGGED_IN_EFFECTIVITY • 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5 Daimler TSS Data Warehouse / DHBW 19 HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES? Staging Data Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red V4 Cabrio 17.01.13 E5 blue Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red Daimler TSS Data Warehouse / DHBW 20 HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES? Staging Data Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color 5 ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue 4 4 V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red 6 5 Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red 5 V4 Cabrio 17.01.13 E5 blue Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red Daimler TSS Data Warehouse / DHBW 21 WHICH TABLE TYPES ARE USED IN A STAR SCHEMA? Daimler TSS Data Warehouse / DHBW 22 WHICH THREE TYPES OF SLOWLY CHANGING DIMENSIONS ARE MOST COMMON? SCD Type 1 • No History • Dimension attributes always contain current data SCD Type 2 • Full Historization • Dimension contains timestamps SCD Type 3 • Historization of latest change only • And storage of current value Daimler TSS Data Warehouse / DHBW 23 WHAT IS THE SIZE OF THE CUBE IN TERMS OF NUMBER OF CELLS? For an MOLAP model with 4 dimensions having 10, 20, 50 and 100 elements and 50 000 facts what is the size of the cube in terms of number of cells? • 1 000 000 (= 10 * 20 * 50 * 100) Daimler TSS Data Warehouse / DHBW 24 MOLAP OR ROLAP Which type of OLAP system would you recommend • for getting fast response times for smaller data sizes MOLAP • for achieving an optimal storage utilization ROLAP • for an enterprise cube of 10 TB ROLAP Daimler TSS Data Warehouse / DHBW 25 WHICH ROLAP ENHANCEMENTS EXIST? • Precomputation of aggregated values • Materialized views / query tables store data physically • Relational Columnar (in-memory) databases Daimler TSS Data Warehouse / DHBW 26 WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE? Trigger-based Replication Log-based Timestamp- Snapshot-based techniques discovery based discovery discovery Performance Medium Low Low Medium High impact on source system Performance Low Low Low Low High impact on target system Load on network Low Low Low Low High Data loss if No Yes Yes No No nologging operations Daimler TSS Data Warehouse / DHBW 27 WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE? Trigger-based Replication Log-based Timestamp- Snapshot-based techniques discovery based discovery discovery Identify DELETE Yes Yes Yes No Yes operations Identify ALL Yes Yes Yes No No changes (changes between extractions) Daimler TSS Data Warehouse / DHBW 28 WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS? Issue Solution Wrong data e.g.

Lecture @Dhbw: Data Warehouse Review Questions Andreas Buckenhofer, Daimler Tss About Me

Design and Integration of Data Marts and Various Techniques Used for Integrating Data Marts

Data Mart Setup Guide V3.2.0.2

Data Warehousing on AWS

Data Warehousing

Data Management Backgrounder What It Is – and Why It Matters

Business Intelligence: Multidimensional Data Analysis

IBM Industry Models and IBM Master Data Management Positioning And

Dynamic Data Fabric and Trusted Data Mesh Using Goldengate

Data Warehousing on Uniprot in Annotated Protein

A Guide to Selecting the Right Customer Data Platform (CDP)

Data Mart and Reporting Guide

© 2011 by the 451 Group. All Rights Reserved Nosql, Newsql and Beyond Open Source Driving Innovation in Distributed Data Management