Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra Big Data Evangelist Microsoft [email protected] Blog: JamesSerra.com About Me
▪ Microsoft, Big Data Evangelist ▪ In IT for 30 years, worked on many BI and DW projects ▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer ▪ Been perm employee, contractor, consultant, business owner ▪ Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference ▪ Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions ▪ Blog at JamesSerra.com ▪ Former SQL Server MVP ▪ Author of book “Reporting with Microsoft SQL Server 2012” I tried to understand the Microsoft data platform on my own… And felt like I was body slammed by Randy Savage:
Let’s prevent that from happening… Agenda
• Data Lake Driven Analytics • Compute technologies • Architecture Patterns Enterprise Data Maturity Stages
STAGE 4: Transformative STAGE 3: Predictive Data transforms STAGE 2: business to drive Informative Data capture is desired outcomes. STAGE 1: Any data, any Reactive comprehensive and Structured data is scalable and leads source, anywhere at managed and business decisions scale Structured data is analyzed centrally based on advanced transacted and and informs the analytics locally managed. business Data used reactively
Laggards Leaders (rear-view (Real-time mirror) intelligence) Benefits of the cloud for Big Data
Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security
Total cost of ownership calculator: https://www.tco.microsoft.com/
Need to collect any data Harness the growing and changing nature of data Structured Semi-structured Streaming
“ ”
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data. Requires a data lake
Get the right information to the right people at the right time in the right format Two Approaches to getting value out of data: Top-Down + Bottoms-Up
How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis Hypothesis it happen? Diagnostic Pattern What Observation Analytics happened? Observation Confirmation Descriptive Analytics Data Warehousing Uses A Top-Down Approach
Understand Gather Implement Data Warehouse Corporate Requirements Reporting & Strategy Reporting & Analytics Design Analytics Business Development Requirements
Dimension Modelling Physical Design
ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune The “data lake” Uses A Bottoms-Up Approach
Ingest all data Store all data Do analysis regardless of requirements in native format without Using analytic engines schema definition like Hadoop
Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.
• Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements Enterprise data warehouse (EDW) • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data (archive) Needs data governance so your data lake does not turn into a data swamp! Objectives ✓ Plan the structure based on optimal data retrieval ✓ Avoid a chaotic, unorganized data swamp
Common ways to organize the data: Time Partitioning Data Retention Policy Probability of Data Access Year/Month/Day/Hour/Minute Temporary data Recent/current data Permanent data Historical data Applicable period (ex: project lifetime) etc… Subject Area etc… Confidential Classification Security Boundaries Business Impact / Criticality Public information Department High (HBI) Internal use only Business unit Medium (MBI) Supplier/partner confidential etc… Low (LBI) Personally identifiable information (PII) etc… Sensitive – financial Sensitive – intellectual property Downstream App/Purpose Owner / Steward / SME etc… Data Lake + Data Warehouse Better Together
What happened? What will happen? Descriptive Predictive Analytics Analytics
Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do System of Record: Well-understood data to do experimentation / data discovery operational reporting Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available All workloads – batch, interactive, streaming, Optimized for interactive querying machine learning Complementary to DW Can be sourced from Data Lake Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions A data lake is just a glorified file folder with data files in it – how many end- users can accurately create reports from it?
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) Fully-managed Hadoop and Spark Azure for the cloud HDInsight 100% Open Source Hortonworks data platform Hadoop and Spark as a Service on Azure Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight” Hortonworks Data Platform (HDP) 3.0 (under the covers of HDInsight 4.0 – public preview)
Simply put, Hortonworks ties all the open source products together (20) Have a mature Hadoop ecosystem already established that you want to migrate Want to be 100% open source Want to use Hadoop tools and make them available 24/7 at a lower cost Want to use certain tools like Kafka/Storm/HBase/R Server/LLAP/Pig/Beam/Flink KNOWING THE VARIOUS BIG DATA SOLUTIONS
CONTROL EASE OF USE
Azure Databricks
Azure HDInsight
Azure Marketplace
HDP | CDH | MapR
Reduced Reduced Administration
BIG DATA DATA BIG ANALYTICS Any Hadoop technology, Workload optimized, Frictionless & Optimized any distribution managed clusters Spark clusters
IaaS Clusters Managed Clusters Azure Data Lake Analytics
Azure Data Lake Store Gen2 STORAGE Azure Storage BIG DATA Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster. SMP vs MPP Azure Analysis Services
Microsoft Products vs Hadoop/OSS Products Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc Cosmos DB MongoDB, MarkLogic, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion Event Hub Apache Kafka Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO Microsoft R Open R SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban Azure Data Lake Storage/WebHDFS HDFS Ozone
Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay)
SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES) SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration
Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Others Hadoop clusters), Apache Flume (collecting log data)
Note: Many of the Hadoop/OSS products are available in Azure Custom apps BI Analytics
SQL Server SQL master instance
Compute pool Compute pool Compute pool Directly read from SQL Compute SQL Compute SQL Compute SQL Compute SQL Compute … HDFS Node Node Node Node Node
Data mart Storage pool
SQL Data SQL Data SQL SQL SQL Node Node Spark Spark … Spark Server Server Server
HDFS Data Node HDFS Data Node HDFS Data Node Storage Storage Kubernetes pod IoT data
Node Node Node Node Node Node Node Persistent storage Azure Data Explorer Fast and fully managed big data analytics service
Fully managed Optimized for Designed for for efficiency streaming data data exploration
Focus on insights, not the infra- Get near-instant insights Run ad-hoc queries using the structure for fast time to value from fast-flowing data intuitive query language
No infrastructure to manage; Scale linearly up to 200 MB per Returns results from 1 Billion provision the service, choose the second per node with highly records < 1 second without SKU for your workload, and performant, low latency ingestion. modifying the data or metadata create database.
© Microsoft Corporation Azure Data Explorer overview
1. Capability for many data types, formats, and sources Structured (numbers), semi-structured (JSON\XML), and free text
2. Batch or streaming ingestion Use managed ingestion pipeline or queue a request for pull ingestion
3. Compute and storage isolation • Independent scale out / scale in • Persistent data in Azure Blob Storage • Caching for low-latency on compute
4. Multiple options to support data consumption Use out-of-the box tools and connectors or use APIs/SDKs for custom solution
© Microsoft Corporation
Questions to ask client
• Can you use the cloud? • Is this a new solution or a migration? • Do the developers have Hadoop skills? • Will you use non-relational data (variety)? • How much data do you need to store (volume)? • Is this an OLTP or OLAP/DW solution? • Will you have streaming data (velocity)? • Will you use dashboards? • How fast do the operational reports need to run? • Will you do predictive analytics? • Do you want to use Microsoft tools or open source? • What are your high availability and/or disaster recovery requirements? • Do you need to master the data (MDM)? • Are there any security limitations with storing data in the cloud? • Does this solution require 24/7 client access? • How many concurrent users will be accessing the solution at peak-time and on average? • What is the skill level of the end users? • What is your budget and timeline? • Is the source data cloud-born and/or on-prem born? • How much daily data needs to be imported into the solution? • What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)? • Are you ok with using products that are in preview? LOB
CRM BI + Reporting INGEST STORE PREP MODEL & SERVE (& store) Graph
Advanced Analytics Image
Social Data orchestration Big data store Transform & Clean Data warehouse and monitoring AI
IoT Azure Data Factory Azure Data Lake Azure Databricks Azure SQL Data Warehouse Storage Gen2 SSIS Azure HDInsight Azure Analysis Services Blob Storage PolyBase & Stored SQL Database (Single, MI, Azure Data Lake Procedures HyperScale) Storage Gen1 Power BI Dataflow SQL Server in a VM SQL Server 2019 Big Data Cluster Azure Data Lake Analytics Cosmos DB Power BI Aggregations Blob Storage Data Lake Store
Large partner ecosystem Built for Hadoop Global scale – All 50 regions Hierarchical namespace Durability options ACLs, AAD and RBAC Tiered - Hot/Cool/Archive Performance tuned for big data Cost Efficient Very high scale capacity and throughput Azure Data Lake Storage Gen2
Large partner ecosystem Built for Hadoop Global scale – All 50 regions Hierarchical namespace Durability options ACLs, AAD and RBAC Tiered - Hot/Cool/Archive Performance tuned for big data Cost Efficient Very high scale capacity and throughput CLOUD DATA WAREHOUSE
INGEST STORE PREP & TRAIN MODEL & SERVE
Logs (unstructured)
Media (unstructured)
PolyBase
Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services
Business/custom apps (structured)
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. MODERN DATA WAREHOUSE
INGEST STORE PREP & TRAIN MODEL & SERVE
Logs (unstructured)
Azure Databricks
Media (unstructured)
PolyBase
Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services
Business/custom apps (structured)
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. ADVANCED ANALYTICS ON BIG DATA
INGEST STORE PREP & TRAIN MODEL & SERVE
Logs (unstructured)
Azure Databricks Cosmos DB
SparkR Real-time apps Media (unstructured)
PolyBase
Files (unstructured) Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services
Business/custom apps (structured)
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.
REAL TIME ANALYTICS
INGEST STORE PREP & TRAIN MODEL & SERVE
Sensors and IoT (unstructured)
Azure Databricks
Logs (unstructured) Apache Kafka for Cosmos DB Real-time apps HDInsight
Media (unstructured)
Files (unstructured) PolyBase
Azure Data Factory Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Business/custom apps Warehouse Services (structured)
Microsoft Azure also supports other Big Data services like Azure IoT Hub, Azure Event Hubs, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.
Financial services use cases
Effective customer Decision services Risk and revenue Risk and compliance Recommendation engagement management management management engine
Customer profiles Customer segmentation Transaction data CRM Clickstream data Credit history CRM data Demographics Credit Products Transactional data Credit data Purchasing history Risk Services LTV Market data Trends Merchant records Customer service data Loyalty Products and services
Customer Financial Risk, fraud, Credit Marketing analytics modeling threat detection analytics analytics
Customer 360 Commercial/retail banking, Real-time anomaly detection Enterprise DataHub Recommendation engine degree evaluation securities, trading and investment models Card monitoring and fraud Regulatory and Predictive analytics and Customer segmentation detection compliance analysis targeted advertising Decision science, simulations Reduced customer churn and forecasting Security threat identification Credit risk management Fast marketing and multi- channel engagement Underwriting, servicing and Investment recommendations Risk aggregation Automated credit analytics delinquency handling Customer sentiment analysis Insights for new products
Faster innovation Improved consumer Enhanced customer Transform growth Improved customer for a better outcomes and experience with with predictive engagement with customer experience increased revenue machine learning analytics machine learning https://aka.ms/ADAG Q & A ?
James Serra, Big Data Evangelist Email me at: [email protected] (email me for a copy of this deck) Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (check out the “presentations” tab)
Who manages what?
On Premises Infrastructure Platform Software Physical / Virtual as a Service as a Service as a Service
Applications Applications Applications Applications
Data Data Data Data
You manage You management by Microsoft
Runtime Runtime Runtime Runtime Resilience Scale, and management by Microsoft
Middleware Middleware Middleware Middleware
Scale, Resilience Scale, and
You scale, make scale, make You resilient manage & resilient
O/S O/S O/S O/S Managed Managed by Microsoft Virtualization Virtualization Virtualization Virtualization
Servers Servers Servers Servers
You scale, make resilient manage and resilient scale, make You Storage Storage Storage Storage
Networking Networking Networking Networking
Windows Azure Windows Azure Virtual Machines Cloud Services Microsoft data platform solutions Product Category Description More Info
SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic https://www.microsoft.com/en-us/server- quadrant. JSON support. Linux support cloud/products/sql-server-2017/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. https://azure.microsoft.com/en- Has built-in high availability and disaster recovery. JSON us/services/sql-database/ support. Managed Instance option soon SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. https://azure.microsoft.com/en- Provision and scale quickly. Can pause service to reduce us/services/sql-data-warehouse/ cost Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of https://azure.microsoft.com/en- your data while making it faster to get up and running with us/services/data-lake-store/ batch, streaming, and interactive analytics HDInsight PaaS Hadoop A managed Apache Hadoop, Spark, R Server, HBase, Kafka, https://azure.microsoft.com/en- compute/Hadoop Interactive Query (Hive LLAP) and Storm cloud service made us/services/hdinsight/ clusters-as-a-service easy Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics https://databricks.com/azure platform optimized for Azure
Azure Data Lake Analytics On-demand analytics job Cloud-based service that dynamically provisions resources https://azure.microsoft.com/en- service/Big Data-as-a- so you can run queries on exabytes of data. Includes U-SQL, us/services/data-lake-analytics/ service a new big data query language Azure Cosmos DB PaaS NoSQL: Key-value, Globally distributed, massively scalable, multi-model, multi- https://azure.microsoft.com/en- Column-family, API, low latency data service – which can be used as an us/services/cosmos-db/ Document, Graph operational database or a hot data lake Azure Database for PostgreSQL, RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en- MySQL, and MariaDB us/services/postgresql ExpressRoute
(3.6TB/hour)
(7.2TB/hour)
(18TB/hour)
(36TB/hour)
We guarantee 99.95% ExpressRoute dedicated circuit availability Low Hot Data Cold Data
HDFS (HDP, Cloudera)
ADLS NoSQL Structure (DocDB)
SQL Cache (SQL DB, SQL DW)
High
Request Rate Low
High Cost/GB
Latency High
Low Data Volume Data Lake Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
Reliable Must be highly available and reliable (no permanent loss of data).
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Low latency Must have low latency for high-frequency operations.
Details Must be able to store data with all details; aggregation may lead to loss of details.
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
Multiple analytic Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. frameworks No one analytic framework can work for all data and all types of analysis.
64 DW SCALABILITY SPIDER CHART
The spiderweb depicts “Data Volume” important attributes to “Mixed MPP – Multidimensional consider when evaluating Scalability Workload” 5 PB Data Warehousing options. “Query Concurrency“ SMP – Tunable in one dimension 500 TB on cost of other dimensions Strategic, Tactical Big Data support is newest Loads, SLA dimension. 100 TB Strategic, Tactical 10.000 Loads 50 TB
10 TB 1.000 Strategic, Tactical
Strategic 100 “Data “Query complexity“ Freshness” Near Real Time Daily Weekly 3-5 Way Data Feeds Load Load Joins 5-10 Way ▪ Joins + Joins ▪ OLAP operations + Simple Batch Reporting, ▪ Aggregation + Star Repetitive Queries ▪ Complex “Where” constraints + Multiple, ▪ Views MB’s Integrated ▪ Parallelism Stars Ad Hoc Queries Data Analysis/Mining Normalized
GB’s Multiple, Integrated Stars and Normalized
“Query Freedom“ TB’s “Schema Sophistication“
“Query Data Volume“ HiveQL T-SQL Updates INSERT UPDATE, INSERT, DELETE Transactions Supported Supported Index Types Compact, Bitmap Clustered, Non-Clustered, Columnar
Latency Minutes Seconds Data Types + map, struct, array Int, float, boolean, char, binary.. Functions Dozens of built-in functions Hundreds of built-in functions Multi-Table Inserts Supported Not Supported Partitioning Supported Supported Multi-Level Partitioning Supported Not Supported Views Read-Only Read-Only and Updateable Extensibility UDF’s, MapReduce Scripts UDF’s, Stored Procedures HDInsight vs HDP on Azure VM
HDInsight HDP on Azure VM PaaS (setup, scale, manage, patch, etc) IaaS Managed by Microsoft Managed by customer Storage separate (Blob or ADLS) Storage in VM (local disk), but can also have storage in Azure blob or ADLS Delete VM keeps data Delete VM deletes data (unless external) Up to 30-days behind latest HDP version Latest HDP Version Limited Hadoop projects Unlimited Hadoop projects Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop No on-prem version On-prem version PolyBase use cases
Ingestion Batch Presentation Company C Ingestion Speed Presentation
DATA SOURCES INGEST PREPARE ANALYZE PUBLISH CONSUME
Machine Hot Path Learning
Real-time Scoring
Blobs – Reference Data Event hubs ASA Job Rule #2
Cortana
Event hubs Flatten & Event hubs ASA Job Rule #1 Simulated Sensors Metadata Join and devices
Archived Data Aggregated Data Hourly, Daily, Power BI Monthly Roll Ups
Data Lake Data Lake Data Lake Data Lake Store Store Analytics Store Offline Training
Machine Batch Scoring Learning Azure SQL Web/LOB Data Warehouse Dashboards CSV Data Data Factory: Move Data, Orchestrate, Schedule, and Monitor
On Premise Cold Path Batch