Designing a Modern Data Warehouse + Data Lake Strategies & Architecture Options for Implementing a Modern Data Warehousing Environment

Designing a Modern Data Warehouse + Data Lake Strategies & architecture options for implementing a modern data warehousing environment Melissa Coates Blog: sqlchick.com Solution Architect, Twitter: @sqlchick BlueGranite Presentation content last updated: March 29, 2017 Designing a Modern Data Warehouse + Data Lake Agenda Discuss strategies & architecture options for: 1) Evolving to a Modern Data Warehouse 2) Data Lake Objectives, Challenges, & Implementation Options 3) The Logical Data Warehouse & Data Virtualization Evolving to a Modern Data Warehouse Data Warehousing Data is inherently more valuable once it is integrated from multiple systems. Full view of a customer: o Sales activity + o Delinquent invoices + Data o Support/help requests Warehouse A DW is designed to be user-friendly, utilizing business terminology. A DW is frequently built with a denormalized (star schema) data model. Data modeling + ETL processes consume most of the time & effort. Transaction System vs. Data Warehouse OLTP Data Warehouse Focus: Focus: ✓ Operational transactions ✓ Informational and analytical ✓ “Writes” ✓ “Reads” Scope: Scope: One database system Integrate data from multiple systems Ex. Objectives: Ex. Objectives: ✓ Process a customer order ✓ Identify lowest-selling products ✓ Generate an invoice ✓ Analyze margin per customer Traditional Data Warehousing Loan Repayment Scenario: Organizational data: Operational Operational • Customer Data Store Reporting application Analytics & (income, assets) Batch ETL Organizational reporting • Loan history Data tools • Payment activity OLAP Historical Data Analytics Warehouse Semantic Layer Third party data: Third Party Data • Credit history Data Marts Master Data Modernizing an Existing DW Loan Repayment Scenario: Streaming Data Alerts Near-Real-Time Monitoring Predictive Analytics: Devices & Advanced • Model to predict Sensors Data Lake Analytics repayment ability Social Media Self-Service Phone Records: Operational Reports & • Sentiment analysis Data Store Hadoop Machine Models Demographics Learning Data E-mail Records: Batch ETL • Text analytics Federated Queries Data OLAP Organizational Operational Social Media: Data Warehouse Semantic Reporting Layer • Personal Data Historical comments Third Party Master Marts In-Memory Analytics Data Data Model What Makes a Data Warehouse “Modern” Variety of data Coexists with Coexists with Larger data Multi-platform sources; Data lake Hadoop volumes; MPP architecture multistructured Data Support all Deployment Flexible Governance virtualization + user types & decoupled deployment model & MDM integration levels from dev Promotion of Near real-time Cloud Advanced Agile self-service data; Lambda integration; analytics delivery solutions arch hybrid env Analytics Automation & Data catalog; Scalable Bimodal sandbox w/ APIs search ability architecture environment promotability Ways to Approach Data Warehousing SQL-on-Hadoop Batch Processing & The DW *is* Hadoop: Interactive Queries ✓ Many open source projects Hive ✓ Can utilize distributions (Hortonworks, Cloudera, MapR) Impala ✓ Challenging Spark implementation Distributed Presto Analytics & Storage Reporting Tools OR etc… The DW & Hadoop co-exist & complement each other: ✓ Generally an easier path ✓ Augment an existing DW environment Focus of the ✓ Additional value to existing DW investment remainder of this presentation Growing an Existing DW Environment Data Warehouse Growing a DW: ✓ Data modeling strategies ✓ Partitioning ✓ Clustered columnstore index ✓ In-memory structures ✓ MPP (massively parallel processing) Larger Scale Data Warehouse: MPP Massively Parallel Processing (MPP) operates on high volumes of data across distributed nodes Control Node Shared-nothing architecture: each node has its own disk, memory, CPU Compute Compute Compute Decoupled storage and compute Node Node Node Scale up compute nodes to increase parallelism Data Storage Integrates with relational & non-relational data Examples: Azure SQL DW, APS, Amazon Redshift, Snowflake Growing an Existing DW Environment Data Warehouse Data Lake Hadoop In-Memory NoSQL Model Growing a DW: Extending a DW: ✓ Data modeling strategies ✓ Complementary data storage & ✓ Partitioning analytical solutions ✓ Clustered columnstore index ✓ Cloud & hybrid solutions ✓ In-memory structures ✓ Data virtualization (virtual DW) ✓ MPP (massively parallel processing) -- Grow around your existing data warehouse -- Multi-Structured Data Objectives: 1 Storage for multi- Organizational Data structured data Data 2 Warehouse (json, xml, csv…) Big Data & Analytics with a ‘polyglot Infrastructure Analytics & Web persistence’ Logs reporting 3 tools strategy Social Data Lake Media 2 Integrate portions of the data into Images, data warehouse Audio, Video Hadoop 3 Federated query Devices, access (data Sensors NoSQL Spatial, virtualization) GPS 1 Speed Layer: Lambda Architecture Low latency data Streaming Speed Dashboard Batch Layer: Data processing to Data Data Layer Devices & Ingestion Processing support complex Sensors analysis Data Lake Batch Advanced Serving Layer: Analytics Layer Raw Data Curated Data Responds to queries Machine Learning Analytics & Objectives: reporting Master • Support large Data Data tools volume of high- Warehouse Reports, Batch ETL Organizational Dashboards velocity data Data Serving Layer • Near real-time analysis + persisted history Hybrid Architecture Objectives: 1 Scale up MPP 3 Machine compute nodes Social Media Hadoop Learning during: Data o Peak ETL data MPP Data loads, or Warehouse o High query CRM 1 volumes Cloud Analytics & reporting 2 Utilize existing On-Premises tools on-premises data 2 structures 2 Data Mart Sales System 3 Take advantage of cloud services for advanced Inventory analytics System Data Lake Objectives, Challenges & Implementation Options Data Lake Data Lake A repository for analyzing large Repository quantities of disparate sources of data in its native format Web One architectural platform to house all Sensor types of data: Log o Machine-generated data (ex: IoT, logs) o Human-generated data (ex: tweets, e-mail) Social o Traditional operational data (ex: sales, inventory) Images Objectives of a Data Lake ✓ Reduce up-front effort by ingesting Data Lake data in any format without requiring a Repository schema initially Web ✓ Make acquiring new data easy, so it can Sensor be available for data science & analysis quickly Log Social ✓ Store large volume of multi-structured data in its native format Images Objectives of a Data Lake ✓ Defer work to ‘schematize’ after value & Data Lake requirements are known Repository Web ✓ Achieve agility faster than a traditional data warehouse can Sensor Log ✓ Speed up decision-making ability Social ✓ Storage for additional types of data Images which were historically difficult to obtain Strategy: Data Lake as a Staging Area for DW • Reduce storage needs in data warehouse • Practical use for Data Lake Store data stored in the CRM data lake Raw Data: Data Staging Area Warehouse Analytics & 1 Utilize the data Corporate reporting Data 1 tools lake as a landing area for DW Social Media staging area, Data instead of the relational database Devices & Sensors Strategy: Data Lake for Active Archiving Data archival, with query ability available when needed CRM Data Lake Store 1 Archival process Raw Data: based on data Data Corporate Staging Area Analytics & retention policy Warehouse Data reporting tools 2 Federated query Social Media to access current Data 1 & historical data Active Archive 2 Devices & Sensors Iterative Data Lake Pattern 1 2 3 For data of value: Ingest and Analyze in place to integrate with the store data determine value data warehouse indefinitely in of the data (“schema on write”), is native (“schema on read”) or use data format virtualization Acquire data Analyze data Deliver data with cost- on a case-by-case basis once requirements effective storage with scalable parallel are fully known processing ability Data Lake Implementation Data Lake A data lake is a conceptual idea. It can be Object implemented with one or more technologies. Store HDFS (Hadoop Distributed File Storage) is a NoSQL very common option for data lake storage. However, Hadoop is not a requirement for a data lake. A data lake may also span > 1 Hadoop cluster. HDFS NoSQL databases are also very common. Object stores (like Amazon S3 or Azure Blob Storage) can also be used. Coexistence of Data Lake & Data Warehouse Data Lake Enterprise Data Warehouse Data Lake Values: DW Values: ✓ Agility ✓ Governance ✓ Flexibility ✓ Reliability ✓ Rapid Delivery ✓ Standardization ✓ Exploration ✓ Security Less effort Data acquisition More effort More effort Data retrieval Less effort Zones in a Data Lake Data Lake Transient/ Analytics Temp Zone Sandbox Raw Data/ Curated Data Staging Zone Zone Metadata | Security | Governance | Information Management Raw Data Zone ✓ Raw data zone is immutable to change ✓ History is retained to accommodate future unknown needs ✓ Staging may be a distinct area on its own ✓ Supports any type of data o Streaming o Batch o One-time o Full load o Incremental load, etc… Transient Zone ✓ Useful when data quality or validity checks are necessary before data can be landed in the Raw Zone ✓ All landing zones considered “kitchen area” with highly limited access o Transient Zone o Raw Data Zone o Staging Area Curated Data Zone ✓ Cleansed, organized data for data delivery: o Data consumption o Federated queries o Provides data to other systems ✓ Most self-service data access occurs from the Curated Data Zone ✓ Standard

Designing a Modern Data Warehouse + Data Lake Strategies & Architecture Options for Implementing a Modern Data Warehousing Environment

Design and Integration of Data Marts and Various Techniques Used for Integrating Data Marts

Data Mart Setup Guide V3.2.0.2

Data Warehousing on AWS

Data Warehousing

Lecture @Dhbw: Data Warehouse Review Questions Andreas Buckenhofer, Daimler Tss About Me

Data Management Backgrounder What It Is – and Why It Matters

Business Intelligence: Multidimensional Data Analysis

IBM Industry Models and IBM Master Data Management Positioning And

Dynamic Data Fabric and Trusted Data Mesh Using Goldengate

Data Warehousing on Uniprot in Annotated Protein

A Guide to Selecting the Right Customer Data Platform (CDP)

Data Mart and Reporting Guide