<<

A company of Daimler AG

LECTURE @DHBW: REVIEW QUESTIONS ANDREAS BUCKENHOFER, DAIMLER TSS ABOUT ME

Andreas Buckenhofer https://de.linkedin.com/in/buckenhofer Senior DB Professional [email protected] https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/ Since 2009 at Daimler TSS Department: Big Data http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/ Business Unit: Analytics https://www.xing.com/profile/Andreas_Buckenhofer2 NOT JUST AVERAGE: OUTSTANDING.

As a 100% Daimler subsidiary, we give 100 percent, always and never less. We love IT and pull out all the stops to aid Daimler's development with our expertise on its journey into the future.

Our objective: We make Daimler the most innovative and digital mobility company.

Daimler TSS INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines + IT strategy + Security + Architecture + Developing and securing know-how + TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler + Market closeness + Independence + Flexibility (short decision making process, ability to react quickly)

Daimler TSS 4 LOCATIONS

Daimler TSS Germany 7 locations 1000 employees* Ulm (Headquarters) Daimler TSS China Stuttgart Hub Beijing Berlin 10 employees Karlsruhe Daimler TSS Malaysia Hub Kuala Lumpur Daimler TSS India * as of August 2017 42 employees Hub Bangalore 22 employees

Daimler TSS Data Warehouse / DHBW 5 WHICH CHALLENGES COULD NOT BE SOLVED BY OLTP? WHY IS A DWH NECESSARY? • Distributed data: data is spread across applications • Different data structures: each system has its won • Missing integrated : data is not harmonized / standardized • Missing historic data: OLTP normally stores current data only • Technological challenges: OLTP has different infrastructure requirements • System workload: additional workload from DWH users most likely decreases performance for OLTP users

Daimler TSS Data Warehouse / DHBW 6 EXPLAIN TWO DEFINITIONS OF THE DWH

Ralph Kimball William Harvey „Bill“ Inmon

„A data warehouse is a copy of “A data warehouse is a subject- transaction data specifically oriented, integrated, time- structured for querying and variant , nonvolatile collection of reporting“ data in support of management’s decision-making process”

Daimler TSS Data Warehouse / DHBW 7 WHICH CHARACTERISTICS DOES A DWH HAVE ACCORDING TO ? • Subject-oriented • A DWH is organized around the major themes, not around processes • Integrated • Data in the DWH is harmonized and uniformly standardized across all sources • Non-volatile • Operations in a DW are insert and select (Updates and deletes for technical reasons only) • Time-variant • All data in the data warehouse is accurate as of some moment in time and has to be associated with a time stamp

Daimler TSS Data Warehouse / DHBW 8 WHICH LAYERS DOES THE LOGICAL STANDARD ARCHITECTURE HAVE? • Staging (Input) • Integration (Cleansing) • Core Warehouse (Storage) • Aggregation • Mart (Reporting, Output) • and additionally , Security, DWH Manager, Monitor

Daimler TSS Data Warehouse / DHBW 9 LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Internal data sources Data Warehouse

Backend Frontend

OLTP Core Mart Layer Staging Integration Warehouse Aggregation (Output Layer Layer Layer Layer Layer) (Input (Cleansing (Storage (Reporting OLTP Layer) Layer) Layer) Layer) External data sources

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 10 DESCRIBE STAGING LAYER CHARACTERISTICS

• “Landing Zone” for data coming into a DWH

• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)

• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)

• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)

• Tables have no constraints, columns often varchar

Daimleronly TSS Data Warehouse / DHBW 11 DESCRIBE CORE WAREHOUSE LAYER CHARACTERISTICS

• Data storage in an integrated, consolidated, consistent and non- redundant (normalized) data model

• Contains enterprise-wide data organized around multiple subject-areas

• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)

• Size of can be several TB and can grow rapidly due to data historization

Daimler TSS Data Warehouse / DHBW 12 DESCRIBE MART LAYER CHARACTERISTICS

• Data is stored in a denormalized data model for performance reasons and better end user usability/understanding

• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model

• Created through filtering or aggregating the Core Warehouse Layer

• One Mart ideally represents one subject area

• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database

Daimler TSS Data Warehouse / DHBW 13 KIMBALL BUS ARCHITECTURE WHAT ARE MAIN CHARACTERISTICS? • 2-layered architecture • Sum of the data marts constitute the Enterprise DWH • Enterprise Service Bus / conformed dimensions for integration purposes (don’t confuse with ESB as middleware/communication system between applications) • Rather simple approach to make data fast and easily accessible • Lower startup costs (but higher subsequent development costs) • If structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease)

Daimler TSS Data Warehouse / DHBW 14 Daimler TSS WAREHOUSE DATA INTEGRATION

Core Warehouse LAYER Layer WITH AND WITHOUT CORE Data Warehouse / DHBW 15 DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT) WHAT ARE MAIN CHARACTERISTICS? • Data in Raw Data Vault Layer is regarded as “Single version of the facts” • Business rules are implemented down-stream • Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” • Real-Time capability • Write-back of data in Core Warehouse Layer

Daimler TSS Data Warehouse / DHBW 16 WHICH TABLE TYPES ARE USED IN DATA VAULT?

Daimler TSS Data Warehouse / DHBW 17 HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES? Staging Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red

Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red

Daimler TSS Data Warehouse / DHBW 18 HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES? • H_VEHICLE • 4 rows: V1, V2, V3, V4 • H_ENGINE • 5 rows: E1, E2, E3, E4, E5 • L_PLUGGED_IN_EFFECTIVITY • 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5

Daimler TSS Data Warehouse / DHBW 19 HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES? Staging Data Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red

Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red

Daimler TSS Data Warehouse / DHBW 20 HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES? Staging Data Data in table stg_vehicle from 15.01.2015 vehicleid model producti engine color 5 ondate V1 SUV 15.01.13 E1 red V2 Cabrio 16.01.13 E2 blue 4 4 V1 SUV 15.01.13 E1 red V3 Cabrio 17.01.13 E3 red 6 5 Staging Data Data in table stg_vehicle from 16.01.2015 V1 SUV 16.01.13 E4 red

5 V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 17.01.2015 V1 SUV 16.01.13 E1 red

Daimler TSS Data Warehouse / DHBW 21 WHICH TABLE TYPES ARE USED IN A ?

Daimler TSS Data Warehouse / DHBW 22 WHICH THREE TYPES OF SLOWLY CHANGING DIMENSIONS ARE MOST COMMON? SCD Type 1 • No History • Dimension attributes always contain current data SCD Type 2 • Full Historization • Dimension contains timestamps SCD Type 3 • Historization of latest change only • And storage of current value

Daimler TSS Data Warehouse / DHBW 23 WHAT IS THE SIZE OF THE CUBE IN TERMS OF NUMBER OF CELLS? For an MOLAP model with 4 dimensions having 10, 20, 50 and 100 elements and 50 000 facts what is the size of the cube in terms of number of cells? • 1 000 000 (= 10 * 20 * 50 * 100)

Daimler TSS Data Warehouse / DHBW 24 MOLAP OR ROLAP

Which type of OLAP system would you recommend • for getting fast response times for smaller data sizes MOLAP • for achieving an optimal storage utilization ROLAP • for an enterprise cube of 10 TB ROLAP

Daimler TSS Data Warehouse / DHBW 25 WHICH ROLAP ENHANCEMENTS EXIST?

• Precomputation of aggregated values • Materialized views / query tables store data physically • Relational Columnar (in-memory)

Daimler TSS Data Warehouse / DHBW 26 WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?

Trigger-based Replication Log-based Timestamp- Snapshot-based techniques discovery based discovery discovery Performance Medium Low Low Medium High impact on source system Performance Low Low Low Low High impact on target system Load on network Low Low Low Low High

Data loss if No Yes Yes No No nologging operations

Daimler TSS Data Warehouse / DHBW 27 WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?

Trigger-based Replication Log-based Timestamp- Snapshot-based techniques discovery based discovery discovery Identify DELETE Yes Yes Yes No Yes operations Identify ALL Yes Yes Yes No No changes (changes between extractions)

Daimler TSS Data Warehouse / DHBW 28 WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references constraint

Duplicates PRIMARY or constraint

Inconsistent data ACID transactions, business logic, additional checks

Daimler TSS Data Warehouse / DHBW 29 WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Daimler TSS Data Warehouse / DHBW 30 WHAT DOES BITEMPORAL DATA MEAN?

Valid time is the time period during which a fact is true in the real world. Transaction time is the time period during which a fact stored in the database was known. Bitemporal data combines both Valid and Transaction Time.

Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database)

Supported in SQL standard

Daimler TSS Data Warehouse / DHBW 31 WHICH PERFORMANCE OPTIMIZING TECHNIQUES FOR A DWH EXIST? • Indexing • Partitioning • Parallelism • Compression • Relational In-memory Columnar DB • Materialized Views

Daimler TSS Data Warehouse / DHBW 32 WHICH TWO APPROACHES ARE KNOWN TO BUILD A DWH?

Top-Down (Inmon) • Comprehensive approach regarding available data • Design Core Warehouse Layer = integrated data model first considering all requirements • Design data marts afterwards Bottom-Up (Kimball) • Approach focusing on fast delivery of first results • Design one data mart first • Next Marts are modeled afterwards usually using Kimball architecture • conformed dimensions to integrate different data marts / fact tables

Daimler TSS Data Warehouse / DHBW 33 WHAT ARE CHALLENGES FOR OPERATIONAL DWHS?

• Shorter and more frequent ETL cycles (Near Real time ETL) • Higher requirements for system availability and performance • Mixed OLAP and OLTP type workload • Data quality mandatory as data is used for automated decisions

Daimler TSS Data Warehouse / DHBW 34 Daimler TSS Data Warehouse / DHBW 35 THANK YOU

Daimler TSS GmbH Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99 [email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS Domicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Daimler TSS Data Warehouse / DHBW 36 WHAT ARE TYPICAL DWH ANALYSIS AND DESIGN WORK PRODUCTS? • Example data table • Hierarchy chart • Timeline diagrams • Event matrix • Enhanced Star schema

Daimler TSS Data Warehouse / DHBW 37 WHICH OPERATION REQUIRES A HIERARCHY?

Selection Slicing Dicing Rotate/Pivot Roll-up/Drill-down

Daimler TSS Data Warehouse / DHBW 38