Differentiate vs Data Warehouse use cases for a cloud solution

James Serra Big Data Evangelist [email protected] Blog: JamesSerra.com About Me

▪ Microsoft, Big Data Evangelist ▪ In IT for 30 years, worked on many BI and DW projects ▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer ▪ Been perm employee, contractor, consultant, business owner ▪ Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference ▪ Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions ▪ Blog at JamesSerra.com ▪ Former SQL Server MVP ▪ Author of book “Reporting with Microsoft SQL Server 2012” I tried to understand the Microsoft data platform on my own… And felt like I was body slammed by Randy Savage:

Let’s prevent that from happening… Agenda

• Data Lake Driven Analytics • Compute technologies • Architecture Patterns Enterprise Data Maturity Stages

STAGE 4: Transformative STAGE 3: Predictive Data transforms STAGE 2: business to drive Informative Data capture is desired outcomes. STAGE 1: Any data, any Reactive comprehensive and Structured data is scalable and leads source, anywhere at managed and business decisions scale Structured data is analyzed centrally based on advanced transacted and and informs the analytics locally managed. business Data used reactively

Laggards Leaders (rear-view (Real-time mirror) intelligence) Benefits of the cloud for Big Data

Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security

Total cost of ownership calculator: https://www.tco.microsoft.com/

Need to collect any data Harness the growing and changing nature of data Structured Semi-structured Streaming

“ ”

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data. Requires a data lake

Get the right information to the right people at the right time in the right format Two Approaches to getting value out of data: Top-Down + Bottoms-Up

How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis Hypothesis it happen? Diagnostic Pattern What Observation Analytics happened? Observation Confirmation Descriptive Analytics Data Warehousing Uses A Top-Down Approach

Understand Gather Implement Data Warehouse Corporate Requirements Reporting & Strategy Reporting & Analytics Design Analytics Business Development Requirements

Dimension Modelling Physical Design

ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune The “data lake” Uses A Bottoms-Up Approach

Ingest all data Store all data Do analysis regardless of requirements in native format without Using analytic engines schema definition like Hadoop

Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse Exactly what is a data lake?

A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

• Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements Enterprise data warehouse (EDW) • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data (archive) Needs data governance so your data lake does not turn into a data swamp! Objectives ✓ Plan the structure based on optimal data retrieval ✓ Avoid a chaotic, unorganized data swamp

Common ways to organize the data: Time Partitioning Data Retention Policy Probability of Data Access Year/Month/Day/Hour/Minute Temporary data Recent/current data Permanent data Historical data Applicable period (ex: project lifetime) etc… Subject Area etc… Confidential Classification Security Boundaries Business Impact / Criticality Public information Department High (HBI) Internal use only Business unit Medium (MBI) Supplier/partner confidential etc… Low (LBI) Personally identifiable information (PII) etc… Sensitive – financial Sensitive – intellectual property Downstream App/Purpose Owner / Steward / SME etc… Data Lake + Data Warehouse Better Together

What happened? What will happen? Descriptive Predictive Analytics Analytics

Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do System of Record: Well-understood data to do experimentation / data discovery operational reporting Any type of data Limited set of data types (ie. relational)

Skills are limited Skills mostly available All workloads – batch, interactive, streaming, Optimized for interactive querying machine learning Complementary to DW Can be sourced from Data Lake Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions A data lake is just a glorified file folder with data files in it – how many end- users can accurately create reports from it?

What is Azure Databricks?

A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) Fully-managed Hadoop and Spark Azure for the cloud HDInsight 100% Open Source data platform Hadoop and Spark as a Service on Azure Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises*

*IDC study “The Business Value and TCO Advantage of in the Cloud with Microsoft Azure HDInsight” Hortonworks Data Platform (HDP) 3.0 (under the covers of HDInsight 4.0 – public preview)

Simply put, Hortonworks ties all the open source products together (20)  Have a mature Hadoop ecosystem already established that you want to migrate  Want to be 100% open source  Want to use Hadoop tools and make them available 24/7 at a lower cost  Want to use certain tools like Kafka/Storm/HBase/R Server/LLAP/Pig/Beam/Flink KNOWING THE VARIOUS BIG DATA SOLUTIONS

CONTROL EASE OF USE

Azure Databricks

Azure HDInsight

Azure Marketplace

HDP | CDH | MapR

Reduced Reduced Administration

BIG DATA DATA BIG ANALYTICS Any Hadoop technology, Workload optimized, Frictionless & Optimized any distribution managed clusters Spark clusters

IaaS Clusters Managed Clusters Analytics

Azure Data Lake Store Gen2 STORAGE Azure Storage BIG DATA Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster. SMP vs MPP Azure Analysis Services

Microsoft Products vs Hadoop/OSS Products Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc Cosmos DB MongoDB, MarkLogic, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Event Hub , Apache Spark Streaming, , , Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML (Machine Learning) , Apache Spark MLib, Apache PredictionIO Microsoft R Open R SQL Data Warehouse/Interactive queries LLAP, Presto, Apache Spark SQL, , IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Airbnb Airflow, , Apache Azkaban Azure Data Lake Storage/WebHDFS HDFS Ozone

Azure Analysis Services/SSAS , , AtScale (pay)

SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search , Apache ElasticSearch (Azure Search build on ES) SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration

Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Others Hadoop clusters), (collecting log data)

Note: Many of the Hadoop/OSS products are available in Azure Custom apps BI Analytics

SQL Server SQL master instance

Compute pool Compute pool Compute pool Directly read from SQL Compute SQL Compute SQL Compute SQL Compute SQL Compute … HDFS Node Node Node Node Node

Data mart Storage pool

SQL Data SQL Data SQL SQL SQL Node Node Spark Spark … Spark Server Server Server

HDFS Data Node HDFS Data Node HDFS Data Node Storage Storage Kubernetes pod IoT data

Node Node Node Node Node Node Node Persistent storage Fast and fully managed big data analytics service

Fully managed Optimized for Designed for for efficiency streaming data data exploration

Focus on insights, not the infra- Get near-instant insights Run ad-hoc queries using the structure for fast time to value from fast-flowing data intuitive

No infrastructure to manage; Scale linearly up to 200 MB per Returns results from 1 Billion provision the service, choose the second per node with highly records < 1 second without SKU for your workload, and performant, low latency ingestion. modifying the data or metadata create database.

© Microsoft Corporation Azure Data Explorer overview

1. Capability for many data types, formats, and sources Structured (numbers), semi-structured (JSON\XML), and free text

2. Batch or streaming ingestion Use managed ingestion pipeline or queue a request for pull ingestion

3. Compute and storage isolation • Independent scale out / scale in • Persistent data in Azure Blob Storage • Caching for low-latency on compute

4. Multiple options to support data consumption Use out-of-the box tools and connectors or use APIs/SDKs for custom solution

© Microsoft Corporation

Questions to ask client

• Can you use the cloud? • Is this a new solution or a migration? • Do the developers have Hadoop skills? • Will you use non-relational data (variety)? • How much data do you need to store (volume)? • Is this an OLTP or OLAP/DW solution? • Will you have streaming data (velocity)? • Will you use dashboards? • How fast do the operational reports need to run? • Will you do predictive analytics? • Do you want to use Microsoft tools or open source? • What are your high availability and/or disaster recovery requirements? • Do you need to master the data (MDM)? • Are there any security limitations with storing data in the cloud? • Does this solution require 24/7 client access? • How many concurrent users will be accessing the solution at peak-time and on average? • What is the skill level of the end users? • What is your budget and timeline? • Is the source data cloud-born and/or on-prem born? • How much daily data needs to be imported into the solution? • What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)? • Are you ok with using products that are in preview? LOB

CRM BI + Reporting INGEST STORE PREP MODEL & SERVE (& store) Graph

Advanced Analytics Image

Social Data orchestration Big data store Transform & Clean Data warehouse and monitoring AI

IoT Azure Data Factory Azure Data Lake Azure Databricks Azure SQL Data Warehouse Storage Gen2 SSIS Azure HDInsight Azure Analysis Services Blob Storage PolyBase & Stored SQL Database (Single, MI, Azure Data Lake Procedures HyperScale) Storage Gen1 Power BI Dataflow SQL Server in a VM SQL Server 2019 Big Data Cluster Azure Data Lake Analytics Cosmos DB Power BI Aggregations Blob Storage Data Lake Store

Large partner ecosystem Built for Hadoop Global scale – All 50 regions Hierarchical namespace Durability options ACLs, AAD and RBAC Tiered - Hot/Cool/Archive Performance tuned for big data Cost Efficient Very high scale capacity and throughput Azure Data Lake Storage Gen2

Large partner ecosystem Built for Hadoop Global scale – All 50 regions Hierarchical namespace Durability options ACLs, AAD and RBAC Tiered - Hot/Cool/Archive Performance tuned for big data Cost Efficient Very high scale capacity and throughput CLOUD DATA WAREHOUSE

INGEST STORE PREP & TRAIN MODEL & SERVE

Logs (unstructured)

Media (unstructured)

PolyBase

Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services

Business/custom apps (structured)

Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. MODERN DATA WAREHOUSE

INGEST STORE PREP & TRAIN MODEL & SERVE

Logs (unstructured)

Azure Databricks

Media (unstructured)

PolyBase

Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services

Business/custom apps (structured)

Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. ADVANCED ANALYTICS ON BIG DATA

INGEST STORE PREP & TRAIN MODEL & SERVE

Logs (unstructured)

Azure Databricks Cosmos DB

SparkR Real-time apps Media (unstructured)

PolyBase

Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services

Business/custom apps (structured)

Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.

REAL TIME ANALYTICS

INGEST STORE PREP & TRAIN MODEL & SERVE

Sensors and IoT (unstructured)

Azure Databricks

Logs (unstructured) Apache Kafka for Cosmos DB Real-time apps HDInsight

Media (unstructured)

Files (unstructured) PolyBase

Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Business/custom apps Warehouse Services (structured)

Microsoft Azure also supports other Big Data services like Azure IoT Hub, Azure Event Hubs, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.

Financial services use cases

Effective customer Decision services Risk and revenue Risk and compliance Recommendation engagement management management management engine

Customer profiles Customer segmentation Transaction data CRM Clickstream data Credit history CRM data Demographics Credit Products Transactional data Credit data Purchasing history Risk Services LTV Market data Trends Merchant records Customer service data Loyalty Products and services

Customer Financial Risk, fraud, Credit Marketing analytics modeling threat detection analytics analytics

Customer 360 Commercial/retail banking, Real-time anomaly detection Enterprise DataHub Recommendation engine degree evaluation securities, trading and investment models Card monitoring and fraud Regulatory and Predictive analytics and Customer segmentation detection compliance analysis targeted advertising Decision science, simulations Reduced customer churn and forecasting Security threat identification Credit risk management Fast marketing and multi- channel engagement Underwriting, servicing and Investment recommendations Risk aggregation Automated credit analytics delinquency handling Customer sentiment analysis Insights for new products

Faster innovation Improved consumer Enhanced customer Transform growth Improved customer for a better outcomes and experience with with predictive engagement with customer experience increased revenue machine learning analytics machine learning https://aka.ms/ADAG Q & A ?

James Serra, Big Data Evangelist Email me at: [email protected] (email me for a copy of this deck) Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (check out the “presentations” tab)

Who manages what?

On Premises Infrastructure Platform Software Physical / Virtual as a Service as a Service as a Service

Applications Applications Applications Applications

Data Data Data Data

You manage You management by Microsoft

Runtime Runtime Runtime Runtime Resilience Scale, and management by Microsoft

Middleware Middleware Middleware Middleware

Scale, Resilience Scale, and

You scale, make scale, make You resilient manage & resilient

O/S O/S O/S O/S Managed Managed by Microsoft Virtualization Virtualization Virtualization Virtualization

Servers Servers Servers Servers

You scale, make resilient manage and resilient scale, make You Storage Storage Storage Storage

Networking Networking Networking Networking

Windows Azure Windows Azure Virtual Machines Cloud Services Microsoft data platform solutions Product Category Description More Info

SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic https://www.microsoft.com/en-us/server- quadrant. JSON support. support cloud/products/-server-2017/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. https://azure.microsoft.com/en- Has built-in high availability and disaster recovery. JSON us/services/sql-database/ support. Managed Instance option soon SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. https://azure.microsoft.com/en- Provision and scale quickly. Can pause service to reduce us/services/sql-data-warehouse/ cost Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of https://azure.microsoft.com/en- your data while making it faster to get up and running with us/services/data-lake-store/ batch, streaming, and interactive analytics HDInsight PaaS Hadoop A managed Apache Hadoop, Spark, R Server, HBase, Kafka, https://azure.microsoft.com/en- compute/Hadoop Interactive Query (Hive LLAP) and Storm cloud service made us/services/hdinsight/ clusters-as-a-service easy Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics https://databricks.com/azure platform optimized for Azure

Azure Data Lake Analytics On-demand analytics job Cloud-based service that dynamically provisions resources https://azure.microsoft.com/en- service/Big Data-as-a- so you can run queries on exabytes of data. Includes U-SQL, us/services/data-lake-analytics/ service a new big data query language Azure Cosmos DB PaaS NoSQL: Key-value, Globally distributed, massively scalable, multi-model, multi- https://azure.microsoft.com/en- Column-family, API, low latency data service – which can be used as an us/services/cosmos-db/ Document, Graph operational database or a hot data lake Azure Database for PostgreSQL, RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en- MySQL, and MariaDB us/services/postgresql ExpressRoute

(3.6TB/hour)

(7.2TB/hour)

(18TB/hour)

(36TB/hour)

We guarantee 99.95% ExpressRoute dedicated circuit availability Low Hot Data Cold Data

HDFS (HDP, Cloudera)

ADLS NoSQL Structure (DocDB)

SQL Cache (SQL DB, SQL DW)

High

Request Rate Low

High Cost/GB

Latency High

Low Data Volume Data Lake Store: Technical Requirements

Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).

Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up

Reliable Must be highly available and reliable (no permanent loss of data).

Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark

Low latency Must have low latency for high-frequency operations.

Details Must be able to store data with all details; aggregation may lead to loss of details.

Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.

All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.

Multiple analytic Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. frameworks No one analytic framework can work for all data and all types of analysis.

64 DW SCALABILITY SPIDER CHART

The spiderweb depicts “Data Volume” important attributes to “Mixed MPP – Multidimensional consider when evaluating Scalability Workload” 5 PB Data Warehousing options. “Query Concurrency“ SMP – Tunable in one dimension 500 TB on cost of other dimensions Strategic, Tactical Big Data support is newest Loads, SLA dimension. 100 TB Strategic, Tactical 10.000 Loads 50 TB

10 TB 1.000 Strategic, Tactical

Strategic 100 “Data “Query complexity“ Freshness” Near Real Time Daily Weekly 3-5 Way Data Feeds Load Load Joins 5-10 Way ▪ Joins + Joins ▪ OLAP operations + Simple Batch Reporting, ▪ Aggregation + Star Repetitive Queries ▪ Complex “Where” constraints + Multiple, ▪ Views MB’s Integrated ▪ Parallelism Stars Ad Hoc Queries Data Analysis/Mining Normalized

GB’s Multiple, Integrated Stars and Normalized

“Query Freedom“ TB’s “Schema Sophistication“

“Query Data Volume“ HiveQL T-SQL Updates INSERT UPDATE, INSERT, DELETE Transactions Supported Supported Index Types Compact, Bitmap Clustered, Non-Clustered, Columnar

Latency Minutes Seconds Data Types + map, struct, array Int, float, boolean, char, binary.. Functions Dozens of built-in functions Hundreds of built-in functions Multi- Inserts Supported Not Supported Partitioning Supported Supported Multi-Level Partitioning Supported Not Supported Views Read-Only Read-Only and Updateable Extensibility UDF’s, MapReduce Scripts UDF’s, Stored Procedures HDInsight vs HDP on Azure VM

HDInsight HDP on Azure VM PaaS (setup, scale, manage, patch, etc) IaaS Managed by Microsoft Managed by customer Storage separate (Blob or ADLS) Storage in VM (local disk), but can also have storage in Azure blob or ADLS Delete VM keeps data Delete VM deletes data (unless external) Up to 30-days behind latest HDP version Latest HDP Version Limited Hadoop projects Unlimited Hadoop projects Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop No on-prem version On-prem version PolyBase use cases

Ingestion Batch Presentation Company C Ingestion Speed Presentation

DATA SOURCES INGEST PREPARE ANALYZE PUBLISH CONSUME

Machine Hot Path Learning

Real-time Scoring

Blobs – Reference Data Event hubs ASA Job Rule #2

Cortana

Event hubs Flatten & Event hubs ASA Job Rule #1 Simulated Sensors Metadata Join and devices

Archived Data Aggregated Data Hourly, Daily, Power BI Monthly Roll Ups

Data Lake Data Lake Data Lake Data Lake Store Store Analytics Store Offline Training

Machine Batch Scoring Learning Azure SQL Web/LOB Data Warehouse Dashboards CSV Data Data Factory: Move Data, Orchestrate, Schedule, and Monitor

On Premise Cold Path Batch