Lecture @DHBW: 03 Architecture Andreas Buckenhofer

Ein Unternehmen der Daimler AG Andreas Buckenhofer Contact/Connect Senior DB Professional

xing Since 2009 at Daimler TSS

Department: Machine Learning Solutions DOAG DHBW

Business Unit: Analytics vcard

• Over 20 years experience with • Oracle ACE Associate technologies • DOAG responsible for InMemory DB • Over 20 years experience with Data • Lecturer at DHBW Daimler TSS GmbH WilhelmWarehousing-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505• -65 Certified99 Data Vault Practitioner 2.0 [email protected]• International / Internet: www.d projectaimler-tss.com experience • Certified Oracle Professional Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle • Certified IBM Big Data Architect © Daimler TSS I Template Revision Change Log

Date Changes 01.03.2020 Initial version

Daimler TSS Data Warehouse / DHBW 3 What you will learn today

• Explain and draw different DWH architectures • Explain and draw Big Data architectures • Understand Data Lakes and other buzzwords

Daimler TSS Data Warehouse / DHBW 4 Purpose: why are DWH/Big Data architectures useful?

• Specific implementation can follow an architecture • Architecture describes an ideal type. Therefore an implementation may not use all components or can combine components • Better understanding, overview and complexity reduction by decomposing a DWH into its components • Can be used in many projects: repeatable, standardizable • Map DWH tools into the different components and compare functionality • Functional oriented as it describes data and control flow

Daimler TSS Data Warehouse / DHBW 5 5 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • • Kappa architecture 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Standard Data Warehouse architecture

Internal data sources Data Warehouse

Backend Frontend

OLTP Core Mart Layer Staging Integration Warehouse Aggregation (Output Layer Layer Layer Layer Layer) (Input (Cleansing (Storage (Reporting OLTP Layer) Layer) Layer) Layer) External data sources

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 8 8 Data sources

• Providing internal and external data out of the source systems

• Enabling data through Push (source is generating extracts) or Pull (BI Data Backend is requesting or directly accessing data)

• Example for Push practice (deliver csv or text data through file interface; Change Data Capture (CDC))

• Example for Pull practice (direct access to the source system via ODBC, JDBC, API and so on)

Daimler TSS Data Warehouse / DHBW 9 9 Staging layer

• “Landing Zone” for data coming into a DWH

• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)

• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)

• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)

• Tables have no constraints, columns often varchar

Daimler TSS Data Warehouse / DHBW 10 10 Integration layer

• Business Rules, harmonization and standardization of data

• Classical Layer for transformations: ETL = Extract – TRANSFORM – Load

• Fixing data quality issues

• Usually not persistent, therefore regular housekeeping is necessary (for instance after a few days or weeks or at the latest once a correct upload to Core Warehouse Layer is ensured)

• The component is often not required or often not a physical part of a DB

Daimler TSS Data Warehouse / DHBW 11 Core Warehouse layer

• Data storage in an integrated, consolidated, consistent and non-redundant (normalized)

• Contains enterprise-wide data organized around multiple subject-areas

• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)

• Size of database can be several TB and can grow rapidly due to data historization

• Write-optimised layer

Daimler TSS Data Warehouse / DHBW 12 Aggregation layer

• Preparing data for the Layer to the required granularity

• E.g. Aggregating daily data to monthly summaries

• E.g. Filtering data (just last 2 years or just data for a specific region)

• Harmonize computation of key performance indicators (measures) and additional Business Rules

• The component is often not required or often not a physical part of a DB

Daimler TSS Data Warehouse / DHBW 13 Data Mart layer

• Read-optimised layer: Data is stored in a denormalized data model for performance reasons and better end user usability/understanding

• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model

• Created through filtering or aggregating the Core Warehouse Layer

• One Mart ideally represents one subject area

• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database

Daimler TSS Data Warehouse / DHBW 14 management, Security, DWH Manager

• Metadata Management

• Not just “Data about Data”, separate lecture

• Security

• Not all users are allowed to see all data

• Data security classification (e.g. restricted, confidential, secret)

• DWH Manager incl. Monitor

• DWH Manager initiates, controls, and checks job execution

• Monitor identifies changes/new data from source systems, separate lecture

Daimler TSS Data Warehouse / DHBW 15 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Kimball Bus Architecture (Central Data Warehouse based on data marts)

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/ Daimler TSS Data Warehouse / DHBW 17 Kimball bus architecture (Central Data Warehouse based on data marts)

Internal data sources Data Warehouse

Backend Frontend More Business- process oriented Core Warehouse Layer than subject- = Mart Layer oriented, OLTP integrated, Staging Layer Data Mart 1 time-variant, non-volatile (Input Layer) OLTP Data Mart 2 Data Mart 3 External data sources

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 18 Kimball Bus Architecture (Central Data Warehouse based on data marts)

• Bottom-up approach • Dimensional model with denormalized data • Sum of the data marts constitute the Enterprise DWH • Enterprise Service Bus / conformed dimensions for integration purposes • (don’t confuse with ESB as middleware/communication system between applications) • Kimball describes that agreeing on conformed dimensions is a hard job and it’s expected that the team will get stuck from time to time trying to align the incompatible original vocabularies of different groups • Data marts need to be redesigned if incompatibilities exist

Daimler TSS Data Warehouse / DHBW 19 Daimler Daimler TSS IntegrationData

Core Warehouse Layer with and without Core Warehouse Data / Data DHBW Warehouse layer 20 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Data Vault 2.0 architecture – today’s world (Dan Linstedt)

Data Warehouse / DHBW Data Vault 2.0 architecture (Dan Linstedt)

Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2 Daimler TSS Data Warehouse / DHBW Daimler23 TSS Data Vault 2.0 architecture (Dan Linstedt)

Internal data sources Data Warehouse

Backend Frontend

OLTP Mart Layer Raw Business (Output Staging Layer Data Vault Data Vault Layer) (Input Layer) OLTP (Reporting Hard Layer) External data sources Rules Soft Rules only Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 24 Data Vault 2.0 architecture (Dan Linstedt)

• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” ( is a separate lecture) • Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart Layer and not earlier • Alternatively from Raw Data Vault to additional layer called Business Data Vault • Hard Rules don’t change data • Data is fully auditable • Real-time capable architecture • Architecture got very popular recently; also applicable to BigData, NoSQL

Daimler TSS Data Warehouse / DHBW 25 Data Vault 2.0 architecture (Dan Linstedt)

• In the classical DWHs, the Core Warehouse Layer is regarded as “single version of the truth” • Integrates + cleanses data from different sources and eliminates contradiction • Produces consistent results/reports across Data Marts • But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as more and more systems exist • Data in Raw Data Vault Layer is regarded as “Single version of the facts” • 100% of data is loaded 100% of time • Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)

Daimler TSS Data Warehouse / DHBW 26 Data Vault 2.0 architecture (Dan Linstedt)

• Data Vault is optimized for the following requirements: • Flexibility

• Agility

• Data historization

• Data integration

• Auditability • wrote in 2008: “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)

Daimler TSS Data Warehouse / DHBW 27 Data Vault – architecture, methodology, model

Architecture Methodology Model

• Multi-Tier • Repeatable • Flexible • Scalable • Measureable • Hash based • Supports NoSQL • Agile • Hub & Spoke

Lecture Lecture DWH Architectures Implementation: Automation, DWH Data Modeling Pattern based, High speed Daimler TSS Data Warehouse / DHBW 28 DWH Examples Examples of DWHs in the industry

Apple: multiple Petabytes • Customer insights: who’s who and what are the customers up to Walmart: 300TB (2003), several PB today • It tells suppliers, “You have three feet of shelf space. Optimize it.” eBay: >10PB, 100s of production DBs fed in • Get better understanding of customers

Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systems

Sources: https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ Daimler TSS Data Warehouse / DHBW 30 Smart Factory: Operational BI Service Platform

New data sources sending lots of data with high speed (sensor data, Containing and logs, etc.) displaying complex manuals, e.g. during repair

Right-Time data required for automated actions, e.g. Workers getting cordless alarms on their screwdriver watch knows and adjusts torque Source: Gluchowski: Analytische Informationssysteme, 5.Aufl., p. 279 Daimler TSS Data Warehouse / DHBW 31 Digital Twin

• Digital replica of an entity (product, service, factory, etc) • Represent entities during the whole lifecycle (including design phase) Exercise: recommend an architecture

Review the presented Data Warehouse architectures. Which architecture would you recommend for • An online store with real/right-time data integration needs • Marketing department of a bank

List advantages and drawbacks of your proposal.

Daimler TSS Data Warehouse / DHBW 33 Exercise: recommend an architecture

An online store with real-time/right-time data integration needs • Architecture: Data Vault 2.0 • + Integration of many internal and external source systems (e.g. integrate social media data about the online store) • + Fast data delivery in Raw Vault Layer (Real-time/Right-time data integration). Complex data cleansing / transformation / soft rules are delayed downstream towards Mart Layer • - Transformation overhead (Source system data model to Data Vault data model to Dimensional data model)

Daimler TSS Data Warehouse / DHBW 34 Exercise: recommend an architecture

Marketing department of a bank • Architecture: Kimball Bus architecture • + Start small for a department. If other departments are interested, new data and new Marts can be added on demand • - High risk to loose the Enterprise and several DWHs are built

That’s still quite a common scenario nowadays. A single Enterprise DWH is often not achieved (e.g. Mergers & Acquisitions, inflexibility due to a single centralized DWH, rapidly changing conditions, etc.) and therefore very often several DWHs with different architectures exist in parallel within a company.

Daimler TSS Data Warehouse / DHBW 35 Summary DWH architectures

Which layers does the standard DWH architecturehave? • Staging (Input), Integration (Cleansing), Core Warehouse (Storage), Aggregation, Mart (Reporting, Output) and additionally Metadata, Security, DWH Manager + Monitor Which other architectures exist? • Kimball Bus Architecture (Central Data Warehouse based on data marts) • Data Vault 2.0 Architecture (Dan Linstedt) There are even more • DW 2.0: The Architecture for the Next Generation of Data Warehousing • (ODS)

Daimler TSS Data Warehouse / DHBW 36 DWH architectures - advantages

Standard DWH Kimball Bus Data Vault 2.0 Single version of the truth Single version of the truth Single version of the facts • Separation of concerns: long- • Two layers only mean faster • Separation of concerns: term enterprise data storage development and less work long-term enterprise data separated from data • Rather simple approach to storage separated from presentation make data fast and easily data presentation • Changes in requirements and accessible • Store raw data and scope are easier to manage • Lower startup costs (but postpone time-consuming • Lower subsequent higher subsequent data cleansing development costs (but development costs) • Supports flexibility and higher startup costs) agility

Daimler TSS Data Warehouse / DHBW 37 DWH architectures - disadvantages

Standard DWH Kimball Bus Data Vault 2.0 • Data model transformations • Can become inflexible and • Data model transformations from 3NF to Dimensional expensive in the long-run from Data Vault model to model required • If structures change (instable Dimensional model required • More complex as two source systems), high effort to • More complex as two different different data models are implement the changes and reload data models are required required data, especially conformed • Larger team(s) of specialists dimensions (“Dimensionitis” required desease) • High effort to start • Non-metric data not optimal for dimensional model • Dimensional model (esp. ) contains data redundancy • Difficult to optimize: read or write?

Daimler TSS Data Warehouse / DHBW 38 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Standard Data Warehouse architecture

Internal data sources Data Warehouse

Backend Frontend

OLTP Core Mart Layer Staging Integration Warehouse Aggregation (Output Layer Layer Layer Layer Layer) (Input (Cleansing (Storage (Reporting OLTP Layer) Layer) Layer) Layer) External data sources

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 41 What are differences between Hadoop and DWH?

Dump question, but actually there are many comparisons like that at the moment  • Hadoop is a tool / technology (or even many tools) like a RDBMS • DWH is an architecture and concept • Architecture is abstraction and defines a goal • Architecture vs tools / technology

Source: https://de.slideshare.net/mrm0/architecting-a-data-platform-for-enterprise-use-strata-ny-2018 Daimler TSS Data Warehouse / DHBW 42 What is a Data Lake?

Dump anything in and wait?

Hadoop

Hoard 100ths of Petabyte in HDFS?

Daimler TSS Data Warehouse / DHBW 43 Assumptions + requirements have changed

• The data is not always known in advance, so it can’t be modeled in advance. • A variety of data formats are coming like JSON, pdf, text, audio, video, voice, etc. • The data architecture must be read-write from both the back and front, not a one-way data flow. The data written back may be repeatedly used, persistent data, or it may be temporary. • The data may arrive with any frequency, and the rate may not be under your control.

Daimler TSS Data Warehouse / DHBW 44 It‘s very hard to get speed and quality – choose one

Stability & Quality Flexibility & Speed •Predictable •Explanatory •Improving •Experimenting •Save user time •Save developer time

Source: https://www.gartner.com/it-glossary/bimodal/ Daimler TSS Data Warehouse / DHBW 45 Data Library Data Lake on Spark Data Repository Data Reservoir Data Archive Data Swamp Data Lake 3.0 Landing Zone Data Lake on Hadoop Landing Zone – Definition by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/ Daimler TSS Data Warehouse / DHBW 47 Data Lake – Definition by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/ Daimler TSS Data Warehouse / DHBW 48 Data Lake Data Vault 2.0 systems architecture by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/ Daimler TSS Data Warehouse / DHBW 49 Data Lake Zones instead of layers according to Mark Madsen

More effort applied to data Optimized for specific uses / management: cleansed, workloads. Personal folders. curated. Typically schema- Schema-on-write. on-write.

New data of unknown value, simple requests for new data can land here first, with little work by IT. Typically schema-on-read. Source: https://de.slideshare.net/mrm0/architecting-a-data-platform-for-enterprise-use-strata-ny-2018 Daimler TSS Data Warehouse / DHBW 50 Data Lake architecture - summary

Internal data sources Data Lake

Backend Frontend DWH OLTP Persistent Landing Zone Standardized data (enhanced data) (raw Data) OLTP

Sandbox (temporary data) External data sources

Metadata Management Security Data Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 51 Characteristics of a Data Lake architecture

Characteristics of a Data Lake architecture: • Is managed, governed, organized • Deals with data and schema change easily • Does not require up front modeling • Does not limit the format or structure of data • Assumes a full range of data latencies, from streaming to one-time bulk loads • Scales easily for higher data volumes • Leverages lineage and metadata

Daimler TSS Data Warehouse / DHBW 52 Old wine in new skins

Source: https://twitter.com/nheudecker/status/1110252515931508736?s=11 Daimler TSS Data Warehouse / DHBW 53 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Lambda architecture

• Architecture by Nathan Marz • Combine realtime and batch processing • Batch layer stores and historizes raw data • Serving layer contains batch views • Query unions serving and realtime layer • Rather complex • Author recommends graph data model and advises against schema-on-read

Daimler TSS Data Warehouse / DHBW 55 What is the main inevitable problem in data systems?

• Lack of human fault tolerance • Bugs will be deployed • Operational errors will happen, e.g. accidentally delete data • Data loss / corruption is worst case scenario • Without data loss / corruption, mistakes can be fixed if original data is still available • Must design for human error like you’d design for any other fault

Daimler TSS Data Warehouse / DHBW 56 Analytics vs Re-Analytics

“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.

Source: http://www.odbms.org/blog/2014/07/big-data-science-interview-jules-j-berman/ Daimler TSS Data Warehouse / DHBW 57 Immutable data - Master data set

• Atomic, immutable data • Ensure that data can not be deleted (accidentally) • Fundamentally simpler • Easy to implement on top of a distributed filesystem, eg Hadoop • CR instead of CRUD • No updates • No deletes • Create (insert) and read (select) only Immutability restricts the range of errors that can cause data loss/corruption

Daimler TSS Data Warehouse / DHBW 58 What is the Lambda architecture?

Query (merge)

Serving Layer Batch Views

Batch Layer Speed Layer All Data (Master data set) RealTime Views

Data Stream Daimler TSS Data Warehouse / DHBW 59 Lambda architecture – data example compute follower list

Query Follower: 3

Anne Serving Anne George Layer George John

Batch 1.1. insert Jim Speed Layer Layer 1.1. insert Anne insert John (now) 2.1 remove Jim 3.1. insert George 5.1. insert John (now) Data Stream Daimler TSS Data Warehouse / DHBW 60 Batch views and realtime views

Daimler TSS Data Warehouse / DHBW 61 Data absorbtion

https://dzone.com/articles/Lambda-architecture-with-apache-spark Daimler TSS Data Warehouse / DHBW 62 Eventual accuracy

Computing best answer in real time may not always be possible • Can compute exact answer in batch layer and approximate answer in realtime layer • Best of both worlds of performance and accuracy For example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.

Daimler TSS Data Warehouse / DHBW 63 Which tools could be used in the Lambda architecture?

Query (merge)

Serving Layer Batch Views

Batch Layer Speed Layer All Data RealTime Views

Data Stream Daimler TSS Data Warehouse / DHBW 64 Cloud vendors offerings, e.g. Microsoft Azure

https://social.technet.microsoft.com/wiki/contents/articles/33626.Lambda-architecture-implementation-using-microsoft-azure.aspx Daimler TSS Data Warehouse / DHBW 65 Cloud vendors offerings, e.g. Amazon AWS

https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-Lambda-architecture-without-servers/ https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdf Daimler TSS Data Warehouse / DHBW 66 Lambda architecture – pros and cons

Pro Con

Architecture emphasizes to keep data Maintaining code that needs to produce immutable. Mistakes can be corrected via the same result in two complex distributed recomputation systems Reprocessing is one of the key challenges Operational burden of running and of but is very often debugging two systems ignored

Daimler TSS Data Warehouse / DHBW 67 Summary Lambda architecture

• Incoming data is sent to batch and speed layer • Batch layer constantly (re-) computes batch views • Master data is stored in the batch layer in raw format: immutable & append-only • Contains data except most up-to-date data due to high latency • Replaces data in speed layer as soon as data in newer compared to speed layer • Speed layer uses incremental algorithms to refresh real-time Views • Receives data stream for real-time processing • Contains most up-to-date data only • Serving layer contains views on batch layer data • Merge is done in the query or in the serving layer Daimler TSS Data Warehouse / DHBW 68 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Questioning the Lambda architecture

Jay Kreps wrote an article about „Questioning the Lambda architecture“ • He wrote about his experience with the Lambda architecture • It works, but not very pleasant or productive • Keeping code in sync is really hard • Need to build complex, low-latency processing systems • Scalable high-latency batch system • Low-latency stream stream-processing system • Instead of duct taping batch & speed:  Kappa architecture / Log-centric architecture / Stream data platform

Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture

Daimler TSS Data Warehouse / DHBW 70 Kappa architecture

• Architecture by Jay Kreps • Logcentric, write-ahead logging • Each event is an immutable log entry and is added to the end of the log • Read and write operations are separated • Materialized views can be recomputed consistently from data in the log

Daimler TSS Data Warehouse / DHBW 71 What is a message? A log!

Source: Jay Kreps: I heart logs, O’Reilly 2014

Daimler TSS Data Warehouse / DHBW 72 What is a log?

• A bank account’s current balance can be built from a complete list of its debits and credits, but the inverse is not true. • In this way, the log of transactions is the more “fundamental” data structure than the database records storing the results of those transactions.

A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.

Source: https://blog.parse.ly/post/1550/kreps-logs/

Daimler TSS Data Warehouse / DHBW 73 Kappa architecture

Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture

Daimler TSS Data Warehouse / DHBW 74 Reprocessing

Lambda architecture • Complex • It works, but not very pleasant or productive: keeping code in sync is really hard • Addresses the challenge of reprocessing • Batch views to contain updated data (a Lambda architecture rcommendation) • Bugs (true requirement for reprocessing)  new processing logic • New user requirements (true requirement for reprocessing)  new processing logic Kappa architecture • Logs unify Batch processing + Stream processing • In Kappa, reprocessing required only when processing logic has been modified

Daimler TSS Data Warehouse / DHBW 75 Kappa architecture – pros and cons

Pro Con

Flexible architecture that can be used All use cases must be mapped to a • stand-alone streaming problem which can become • combined with a Data Lake or DWH artificial for well-known cases, e.g. sales • in a microservice architecture prognosis. Easy to understand Not suited for analysis of history data: sequential reading and writing is fast, random is not.

Daimler TSS Data Warehouse / DHBW 76 Summary: Enterprise-wide architecture

Source: Jay Kreps: I heart logs, O’Reilly 2014

Daimler TSS Data Warehouse / DHBW 77 1. DWH architectures • Standard DWH architecture • Kimball Bus architecture • Data Vault 2.0 architecture 2. Big Data architectures • Data Lake • Lambda architecture • Kappa architecture Architecture - summary

One size does not fit all • Understand why an architecture is needed • Use + work on an architecture • Don’t reinvent the wheel • But adapt to your needs and constraints (time, budget, skills, etc)

Daimler TSS Data Warehouse / DHBW 79 Theory vs Reality: Gather all data in one place

Source : https://www.linkedin.com/feed/update/urn:li:activity:6440630785102880768/ Daimler TSS Data Warehouse / DHBW 80 Virtual architecture

Internal data sources Data Warehouse

DWH Backend Frontend

Data Weakly+partly subject-oriented, Weakly+partly integrated, Lake Not time-variant, Not non-volatile OLTP Views, Alias names, Query Management External data sources Metadata

Daimler TSS Data Warehouse / DHBW 81 Virtual architecture

• Data not extracted from operational systems and stored separately • Standardized interface for all operational data sources • One "GUI" for all existing data

• Generates combined queries

• Query Processor joins query result data from different sources

• Can also access data in Hadoop (Polybase, Big SQL, BigData SQL, etc)

Daimler TSS Data Warehouse / DHBW 82 Virtual architecture

• Query Management manages metadata about all operational systems • (physical) location of data and algorithms for extracting data from OLTP system

• Implementation easier

• Low cost: can use existing hardware infrastructure • Queries cause significant performance problems in operational systems • Known problems when analyzing operational data directly • Same query is processed multiple times (if queried multiple times) • Same query delivers different results when processed at different times

Daimler TSS Data Warehouse / DHBW 83 Exercise - Architecture

• Now imagine that you prepare an exam. • Identify 1 question about DWH / Big Data architecture that you would ask in an exam. • Write down the question on a stick-it card.

Daimler TSS Data Warehouse / DHBW 84 Daimler TSS GmbH Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99 [email protected] / Internet: www.daimler-tss.com Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision Architecture from a project in the automotive industry

ETL Engine

Datastore Standard Source Reports IIDR ReplEngine Mart Layer Source

Backend Frontend DWH DB Raw + Business Data Vault

Staging Layer OLTP DB Datastore IIDR ReplEngine AdHoc Logs DWH DWH Reports Mirror DB Datastore IIDR ReplEngine (Operational TSM Mirror Mirror Data Store)

Daimler TSS Data Warehouse / DHBW 86 End user sample questions

Which vehicles or aggregates are documented incompletely? (Data quality)

Which vehicles / which control units require SW updates?

Which interiors are most common by region?

Supply data for external simulations, customs clearance, spare part planning, etc.

Daimler TSS Data Warehouse / DHBW 87 according to BARC Flexibility causes heterogenity

Source: https://www.datafestival.de/events/proof-of-concepts-als-strategietool-erfahrungen-aus-der-medienbranche-de/; Jacqueline Bloemen, Datenarchitektur für Business Analytics – was Sie berücksichtigen sollten, DataFestival 2019 Daimler TSS Data Warehouse / DHBW 88 Use case: Analysis battery aging

• JSON data ingested into HDFS, Hive tables on JSON files • Identify breaks (“> 8h”) and compute current drain

Max capacity Current capacity

Daimler TSS Data Warehouse / DHBW 89 Structuring the Data Lake new data sources – sensor data

• Sensor data format change without notice • Sensors get regularly updated with new versions • Names of metrics may change • Sensors with various versions in the field • Sensors from different suppliers • Often many fields >>100 and increasing with new sensor versions • Easy storing of data in HDFS and applying schema later • Data from Robots, vehicles, …

Daimler TSS Data Warehouse / DHBW 90 Use case: Analysis battery aging Data Lake architecture

Structuring the Data Lake “schema- R Python on-read” Hive tables • Sensor data format change Consumption without notice Samp-

• Time consuming and error-prone ling / Data Security Data data integration into the Data filter Archival Data Lake Enhanced data Hive tables • Therefore preparation of data for

usage in the Consumption area Data Governance required: “Data Engineer” Management Metadata Struc- ture Raw data json Daimler TSS Data Warehouse / DHBW 91 Other DWH architectures

• Inmon Corporate Information Factory • DW 2.0: The Architecture for the Next Generation of Data Warehousing • Operational Data Store (ODS) • Virtual architecture

Daimler TSS Data Warehouse / DHBW 92 Exercise: classical DWH architectures

The article http://www.kimballgroup.com/2004/03/differences-of-opinion/ compares THE two classic DWH architectures.

Read the paper and complete the table / questions on the next slide. (Caution: The paper is biased / favors one approach; you may want to read other/more papers for a neutral view.)

Daimler TSS Data Warehouse / DHBW 93 Exercise: classical DWH architectures

How are the approaches called? Who “invented” the approach?

How many layers are used and how are the layers called? Which data modeling approaches are used in which layer? In which layer are atomic detail data stored? In which layer are aggregated / summary data stored? List at least 2 advantages

List at least 2 disadvantages Exercise: classical DWH architectures

How are the Kimball Bus Architecture Corporate Information Factory approaches called? Who “invented” the approach? • • Bill Inmon How many layers are used and • Data Staging • Data Acquisition how are the layers called? • Dimensional Data Warehouse • Normalized Data Warehouse • Data Delivery / Dimensional Mart Which data modeling approaches • Data Staging: variable, corresponds to • Data Acquisition: variable, corresponds to are used in which layer? source system source system • Dimensional Data Warehouse: • Normalized Data Warehouse: 3NF Dimensional Model • Data Delivery: Dimensional Model In which layer are atomic detail • Dimensional Data Warehouse • Normalized Data Warehouse data stored? In which layer are aggregated / • Dimensional Data Warehouse • Data Delivery / Dimensional Mart summary data stored?

Daimler TSS Data Warehouse / DHBW 95 Exercise: classical DWH architectures

Kimball Bus Architecture Corporate Information Factory Advantages • Two layers only mean faster development and • Separation of concerns: long-term enterprise data less work storage separated from data presentation • Rather simple approach to make data fast and • Changes in requirements and scope are easier to easily accessible manage • Lower startup costs (but higher subsequent • Lower subsequent development costs (but higher development costs) startup costs) Disadvantages • If table structures change (instable source • Data model transformations from 3NF to systems), high effort to implement the changes Dimensional model required and reload data, especially conformed • More complex as two different data models are dimensions (“Dimensionitis” desease) required • Non-metric data not optimal for dimensional • Larger team(s) of specialists required model • Dimensional model (esp. Star Schema) contains data redundancy

Daimler TSS Data Warehouse / DHBW 96 Inmon Corporate Information Factory

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/ Daimler TSS Data Warehouse / DHBW 97 Inmon Corporate Information Factory

Internal data sources Data Warehouse

Backend Frontend

subject- oriented, OLTP Core integrated, Warehousetime - Mart Layer Staging Layer variant, Layer (Output Layer) (Input Layer) non-volatile OLTP (Storage (Reporting Layer) Layer) External data sources

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 98 Inmon Corporate Information Factory

• Top-down approach • (Normalized) Core Warehouse is essential for subject-oriented, integrated, time-variant and nonvolatile data storage • Create (departmental) Data Marts as subsets of Core Enterprise DWH as needed • Data Marts can be designed with Dimensional model • The standard DWH architectureis more general compared to CIF, but was mainly influenced by CIF

Daimler TSS Data Warehouse / DHBW 99 DW 2.0: The Architecture for the Next Generation of Data Warehousing

Operational application data model

Integrated corporate data model

Data Data Lifecycle Integrated corporate data model

Archival data model

Source: W.H. Inmon, Dan Linstedt: Data Architecture: A Primer for the Data Scientist, Morgan Kaufmann, 2014, chapter 3.1 Daimler TSS Data Warehouse / DHBW 100 DW 2.0: The Architecture for the Next Generation of Data Warehousing

Main characteristics: • Structured and “unstructured” data, not just metrics • Life Cycle of data with different storage areas • Hot data: High speed, expensive storage (RAM, SSDs) for most recent data

• …

• Cold data: Low speed, inexpensive storage (e.g. hard disks) for old data; archival data model with high compression • Metadata is an integral part of the DWH and not an afterthought

Daimler TSS Data Warehouse / DHBW 101 Operational data store (ODS)

Internal data sources Data Warehouse

Backend Frontend

Core subject- oriented, OLTP Warehouse Mart Layer Staging Layer integrated, Layer time- (Output Layer) (Input Layer) variant, (Storagenon -volatile (Reporting Layer) OLTP Layer)

External data sources Operational Data Store

Metadata Management Security DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 102 Operational data store (ODS)

• ODS: Real-time/Right-time layer • Replication techniques used to transport data from source database to ODS layer with minimal impact on source system • Data in the ODS has no history and is stored without any cleansing and without any integration (1:1 copy from single source) • DWH performance not optimal as data model is suited for OLTP and not for reporting requirements • ODS normally additionally to Staging / Core DWH / Mart Layer but can exist alone without other layers

Daimler TSS Data Warehouse / DHBW 103 Logical architecture (Gartner)

• The LDW is a multi-server / multi-engine architecture • The LDW can have multiple engines, the main ones being: The Data Warehouse(DW), data marts and the Data Lake • We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. Reasons: • Regulatory constraints • Organizational constraints

https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/ Daimler TSS Data Warehouse / DHBW 104 Specific Big Data architectures

• There exist well-known reference architectures for Data Warehouses • Many tools and schema-on-read came with the Hadoop ecosystem • Was a “black box” at the beginning • Gets more and more structure with different layers instead of a “black box” • Structure, modeling, organization, governance instead of tool-only focus

Daimler TSS Data Warehouse / DHBW 105 How does a Data Lake differ from a data swamp?

• Naive idea: dump everything in (“landing zone”) • Data hoarding is not a data management strategy • A Data Lake brings in structure • e.g. create directories in HDFS if Hadoop is used • /transient • /raw • /standardized • /use-case specific

Daimler TSS Data Warehouse / DHBW 106 What is a Data Lake according to Mark Madsen? separate collect / manage / delivery

Daimler TSS Data Warehouse / DHBW 107 Schema-on-write vs schema-on-read revisited

Old approach New approach Model Collect Collect Model Analyze Analyze Promote move data into next zone if required (data has been analyzed and is needed more often)

Daimler TSS Data Warehouse / DHBW 108 Data Vault 2.0 architecture – today’s world (Dan Linstedt) Defining a Data Lake … by Dan Linstedt

https://www.youtube.com/watch?v=tDNjI1Yvqxw Interview with “Father of data warehousing” Bill Inmon

https://www.youtube.com/watch?time_continue=48&v=b9YfsjEjVS8 Data Lake turned into Data swamp

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.12

Daimler TSS Data Warehouse / DHBW 112 Data Lake

• No agreed definition • Characteristics [Madsen]: • Deals with data and schema change easily • Does not always require up front modeling • Does not limit the format or structure of data • Assumes a full range of data latencies, from streaming to one-time bulk loads, both in and

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.11 out including write-back Daimler TSS Data Warehouse / DHBW 113 Bimodal data governance

Two different styles of work: different zones with different degrees of governance • Stability & Quality • Predictable, improving and renovating • Flexibility & Speed • Explanatory

Source: https://www.gartner.com/it-glossary/bimodal/

Daimler TSS Data Warehouse / DHBW 114 Exercise: recommend an architecture

A holding of 3 telecommunication companies • Architecture: Virtual Data Warehouse • + Companies may not want to provide their data to a new storage • + Can easily be extended if new companies join the holding or reduced if a company leaves the holding • - Bad performance • - Not really data integration achieved, low Data Quality • - Firewalls have to be opened

Daimler TSS Data Warehouse / DHBW 115 Where to use Hadoop and spark?

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018 Where to use ?

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018 Examples of Data Warehouses in the industry

Apple: multiple Petabytes • Customer insights: who’s who and what are the customers up to Walmart: 300TB (2003), several PB today • It tells suppliers, “You have three feet of shelf space. Optimize it.” eBay: >10PB, 100s of production DBs fed in • Get better understanding of customers

Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systems https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

Daimler TSS Data Warehouse / DHBW 118 Data Lake vs Hadoop

Data Lake •Architecture, concept

Hadoop, Spark, •Tools (that can be used Elastic Stack to implement a Lake)

Daimler TSS Data Warehouse / DHBW 119 DWH and Data Lake

DWH on RDBMS Data Lake on Hadoop

Slowly Changing Dimension Schema-on-Read ELT vs ETL Agility 3-Layer vs 2-Layer Parquet Kimball Approach Hive Inmon Definition HBase Star Schema SQL-on-Hadoop Data Vault Impala Methods, Oozie Tools, etc Concepts, Zoekeeper Tools, Techniques Tools

Daimler TSS Data Warehouse / DHBW 120 The (ab)use of the words „Data Lake“ technical

• A physical implementation of data solutions in the Hadoop ecosystem. • A reservoir of curated, and often unconnected, datasets for data science, data exploration and reference data management use. • The new Operational Data Store, a pub flap to hide a proper data management architecture. • A name to indicate the staging area or the persistent staging area of a Data Warehouse solution. • A technical solution to process huge volumes of data, for which an RDBMS is often too expensive to deploy.

Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/ Daimler TSS Data Warehouse / DHBW Daimler121 TSS The (ab)use of the words „Data Lake“ cynical reality

• A concept that will spray fairy dust upon data management issues that arise from organisational issues, politics or a lack of understanding what working with data requires of an organisation. • A term to use when you are really out of your comfort zone as an architect and to shelter behind. • A clever marketing and sales trick, used to sell you a “must have” solution next to what you already have. You are told that without a Data Lake you will be unable to do data science, digital transformation or business analytics.

Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/ Daimler TSS Data Warehouse / DHBW 122 Data Lake (Martin fowler)

Source:Daimler https://martinfowler.com/bliki/DataLake.html TSS Data Warehouse / DHBW 123 Data Lake (Martin fowler)

Source:Daimler https://martinfowler.com/bliki/DataLake.html TSS Data Warehouse / DHBW 124 What is a Data Lake?

A Data Lake acquires data from multiple sources in an enterprise in its native form and may also have internal, modeled forms of this same data for various purposes. The information thus handled could be any type of information, ranging from structured or semi-structured data to completely unstructured data. A Data Lake is expected to be able to derive enterprise-relevant meanings and insights from this information using various analysis and machine learning algorithms.

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

Daimler TSS Data Warehouse / DHBW 125 Data Lake life cycle

Source:Daimler Pankaj TSS Misra, Tomcy John: Data Lake for Enterprises Packt 2017 Data Warehouse / DHBW 126 Data Lake (eckerson group)

Source: https://www.eckerson.com/articles/ten-characteristics-of-a-modern-data-architecture Daimler TSS Data Warehouse / DHBW 127 New approach to build data systems

Immutable data (Materialized) Views Query = Application (Source of Truth / (Materializedon data) Views (Materialized) Views Single version of (Materializedon data) Views (Materializedon data) Views facts) on data on data

HDFS / NoSQL (Materialized) Views (Materialized) Views (MaterializedQuery = Application) Views RDBMS on data (Materializedon data) Views (Materializedon data) Views NewSQL on data on data

Daimler TSS Data Warehouse / DHBW 128 Lambda architecture

Batch Layer Batch Engine Serving Layer Serving Backend Queries Real-Time Layer Real-Time Engine Source:

Raw history Result data data

Daimler TSS Data Warehouse / DHBW 129 Overview Lambda architecture

https://de.slideshare.net/gschmutz/big-data-and-fast-data-Lambda-architecture-in-action Daimler TSS Data Warehouse / DHBW 130 Which tools could be used in the Lambda architecture? db requirements

Easier to implement for database vendors compared to random access

Daimler TSS Data Warehouse / DHBW 131 Which tools could be used in the Lambda architecture? db requirements Batch Speed Serving layer layer layer

Write Random write Batch write sequential once More challenging

Bulk sequential read many Random read Random read times More challenging More challenging

Daimler TSS Data Warehouse / DHBW 132 What are the three paradigms of programming?

• Request/Response

• Batch

• Stream processing

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619 Daimler TSS Data Warehouse / DHBW 133 Enterprise-wide architecture using log-centric approach

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ Daimler TSS Data Warehouse / DHBW 134 Enterprise-wide architecture using log-centric approach and CDC

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ Daimler TSS Data Warehouse / DHBW 135 Enterprise-wide architecture today data integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ Daimler TSS Data Warehouse / DHBW 136 Enterprise-wide architecture today data integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ Daimler TSS Data Warehouse / DHBW 137 Enterprise-wide architecture today data integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ Daimler TSS Data Warehouse / DHBW 138 Log centric architecture / Kappa architecture / data streaming platform

Source: Jay Kreps: I heart logs, O’Reilly 2014 Daimler TSS Data Warehouse / DHBW 139 What is stream processing?

Most common understanding (doing the „T“ in ETL): • Stream processing is the parallel processing of data in motion = computing on data directly as it is produced or received. • Not necessarily transient, approximate, lossy (assumptions from Lambda architecture and other event processing systems)

Daimler TSS Data Warehouse / DHBW 140 Messaging systems @Linkedin

1st attempt 2nd attempt • Active MQ, RabbitMQ • Kafka • Problems: • Key abstraction: Logs • Not distributed • build from scratch • Throughput • Distributed system by design • Persistence • Partitioning with local ordering • Ordering • Elastic scaling • Fault tolerance

Daimler TSS Data Warehouse / DHBW 141 Kafka: a modern distributed system for streams

• Scalability • Hundreds of MB/sec/server throughput • Many TB per node • Guarantees of a database • All messages strictly ordered (within a ) • All data persistent • Distributed by default • Replication • Partitioning • Producers + consumers all fault tolerant and horizontally scalable

Daimler TSS Data Warehouse / DHBW 142 ETL revisited with kafka connect and kafka streams

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619 Daimler TSS Data Warehouse / DHBW 143 What are stream processing frameworks?

streaming • (twitter, Nathan Marz) • Apache Samza (linkedin) • Streams • Simple library • Reprocessing • No microbatch = everything is a stream • Local state • Key operations: filter, aggregate, join

Daimler TSS Data Warehouse / DHBW 144 Kafka @Linkedin

Source: https://de.slideshare.net/JayKreps1/i-32858698 Daimler TSS Data Warehouse / DHBW 145 Summary: Stream data platform using Kafka unifying batch and stream processing

Source: https://de.slideshare.net/JayKreps1/i-32858698 Daimler TSS Data Warehouse / DHBW 146 Kafka and microservices

• A message-oriented implementation requires an efficient messaging backbone that facilitates the exchange of data in a reliable and secure way with the lowest latency possible. • Creating small, self-contained, data-driven applications that meld streaming data and microservices together is a good practice to break down large problems and projects into approachable chunks, reduce risk, and deliver value faster. • Think of combinations of data-processing applications with microservices to deliver specific features and insights from a data stream.

Daimler TSS Data Warehouse / DHBW 147 Kappa architecture

Real-Time Layer Serving Layer Real-Time Serving Data Engine Backend Queries

Source:

Raw history Result data data

Daimler TSS Data Warehouse / DHBW 148 PS-3C architecture

Data Library 3C Layer Serving Layer Storage Preparation Delivery Data Engine Engine Engine Queries

Raw history Integrated Result data subjects data

Daimler TSS Data Warehouse / DHBW 149 PS-3C architecture

• Architecture by Rogier Werschkull • Store incoming data in Data Library Layer (Persistent staging = PS) • Prepare data in a 3C layer for “Concept – Context – Connector”-model • Concept + Connector can be virtualized on data in Data Library Layer

Daimler TSS Data Warehouse / DHBW 150 Polyglot warehouse

• Architecture by Joe Caserta • Big Data Warehouse may live in one or more platforms on premises or in the cloud • Hadoop only • Hadoop + MPP or RDBMS • Additionally NoSQL or Search

Daimler TSS Data Warehouse / DHBW 151 Polyglot warehouse

Source: https://www.slideshare.net/CasertaConcepts/Hadoop-and-your-data-warehouse Daimler TSS Data Warehouse / DHBW 152 The extended Data Warehouse architecture (xdw) The Enterprise Analytics Architecture

• Architecture by Claudia Imhoff • combine the stability and reliability of the BI architectures while embracing new and innovative technologies and techniques • 3 components that extend the EDW environment • Investigative computing platform • Data refinery • Real-time (RT) analysis platform

Data Warehouse / Daimler TSS Data Warehouse / DHBW Daimler153 TSS153 DHBW The extended Data Warehouse architecture (xdw) The Enterprise Analytics Architecture

Source: https://upside.tdwi.org/articles/2016/03/15/extending-traditional-data-warehouse.aspx Daimler TSS Data Warehouse / DHBW 154 Gartner Data Lake architecture styles

• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place • Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement. • Data Science Lab: most suitable for data discovery and for developing new advanced analytics models

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Daimler TSS Data Warehouse / DHBW 155 Gartner Data Lake architecture styles

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/ Daimler TSS Data Warehouse / DHBW 156 Gartner – the logical DWH

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to- be-done/

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/ Daimler TSS Data Warehouse / DHBW 157 Summary

Landing Area Data Lake Data Presentation Storage Integration Delivery Data Engine Engine Engine Queries

Raw history Lightly Result data integrated data data

Daimler TSS Data Warehouse / DHBW 158 Data core – Dave Wells

• Architecture by Eckerson Group • DWHs exist together with MDM, ODS, and portions of the Data Lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis

https://www.eckerson.com/articles/the-future-of-the-data-warehouse Daimler TSS Data Warehouse / DHBW 159 Data core – Dave Wells

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 160 Data lifecycle – Dave Wells

https://www.eckerson.com/articles/the-future-of-the-data-warehouse Daimler TSS Data Warehouse / DHBW 161 DWH and Data Lake – Dave Wells In parallel vs inside

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 162 in the age of Big Data

• Dimensional modeling is not dead. • The benefits are still valid in the age of Big Data, Hadoop, Spark, etc: • Eliminate joins • Data model is understandable for end users • Well-suited for columnar storage + processing (e.g. SIMD) • Nesting technique • E.g. tables with lower granularity can be nested into large • Usage in SQL: Flatten(kvgen())

Daimler TSS Data Warehouse / DHBW 163 Self-service data

Functional Why important Self-service approach area Data With shadow analytics, users create The system must be capable of autonomously redundant data copies. identifying the best optimizations and adapting to acceleration emerging query patterns over time. Data Data consumers struggle to find data that In the self-service approach, the catalog is is important to their work. Users keep automatic—as new data sources are brought online. catalog private notes about data sources and data quality, meaning there is no governance. Data It is virtually impossible for an organization Data consumers need to be able to access all data to centralize all data in a single system. sets equally well, regardless of the underlying virtualizatio technology or location of the system. n https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 164 Self-service data

Functional Why important Self-service approach area Data There is no single “shape” of data that Data consumers need the ability to interact with works for everyone. data sets from the context of the data itself, not curation exclusively from simple metadata that fails to tell the whole story. Data consumers should be capable of reshaping data to their own needs without writing any code or learning new languages. Data lineage As data is accessed by data consumers As users reshape and share data sets with one and in different processes, it is important another through a virtual context, a self-service data to track the provenance of the data, who platform can seamlessly track these actions and all accessed the data, how the data was states of data along the way, providing full audit accessed, what tools were used, and what capabilities as well. results were obtained. https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 165 Self-service data

Functional Why important Self-service approach area Open Because data is essential to every area of Self-service data platforms build on open source every business, the underlying data standards like , , and source formats and technologies used to access to store, query, and analyze data and process the data should be open from any source. source Security Organizations safeguard their data assets Self-service data platforms integrate with existing with security controls that govern security controls of the organization, such as LDAP controls authentication (you are who you say you and Kerberos. are), authorization (you can perform specific actions), auditing (a record of the actions you take), and encryption (you can only read the data if you have the right key). https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 166 Is the DWH dead?

Hadoop (and for similar reasons Spark) has its strengths but no as a DWH replacement, e.g. • Fast query reads only possible in HBase with an inflexible (use case specific) data model • No sophisticated query optimizer • Hadoop is very complex with many tools/versions/vendors and no standard • Security is still at the beginning

Daimler TSS Data Warehouse / DHBW 167 Digitization is the process of converting information into a digital (i.e. computer-readable) format Digitization – the digital data explosion

Source: https://slideplayer.com/slide/10254426/

Daimler TSS Data Warehouse / DHBW 168 Future mobility

https://blog.daimler.com/en/2018/07/19/future-mobility-metropolitan-cities-congress/ Daimler TSS Data Warehouse / DHBW 169 Display video Is china the next silicon valley?

Daimler TSS Data Warehouse / DHBW 170 Some statistics bat – baidu + alibaba + tencent

Sources: https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/ https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl Daimler TSS Data Warehouse / DHBW 171 Digitization – challenges & opportunities

Data-driven Ethics Volume Data quality AI Velocity Variety Real-time Data integration Veracity Value Impact Digital Services IoT, Industry 4.0, …

Daimler TSS Data Warehouse / DHBW 172 Digitization public authorities

Daimler TSS Data Warehouse / DHBW 173 Nathan Marz view on Schema

„Many developers go down the path of • Rawness writing their raw data in a schemaless Store the data as it is. No format like JSON. This is appealing because transformations. of how easy it is to get started, but this • Immutability approach quickly leads to problems. Don’t update or delete data, just Whether due to bugs or misunderstandings add more. between different developers, data corruption inevitably occurs“ • Graph-like schema recommended (see page 103, Nathan Marz, „Big Data: Principles and best practices of scalable realtime data systems", Manning Publications)

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015 Daimler TSS Data Warehouse / DHBW 174 source: Markus Schmidberger - Big Data ist tot – Es lebe Business Intelligenz?, TDWI 2016 Daimler Daimler TSS Data / Data DHBW Warehouse 175 source: Markus Schmidberger - Big Data ist tot – Es lebe Business Intelligenz?, TDWI 2016 AWS Lambda # Lambda Remark • • Lambda@ : own requirements architecture to Adapt Lambda requirements real processing with driven data Enrich batch - time glomex architecture -