Differentiate Big Data Vs Data Warehouse Use Cases for a Cloud Solution
Total Page:16
File Type:pdf, Size:1020Kb
Differentiate Big Data vs Data Warehouse use cases for a cloud solution James Serra Big Data Evangelist Microsoft [email protected] Blog: JamesSerra.com About Me ▪ Microsoft, Big Data Evangelist ▪ In IT for 30 years, worked on many BI and DW projects ▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer ▪ Been perm employee, contractor, consultant, business owner ▪ Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference ▪ Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions ▪ Blog at JamesSerra.com ▪ Former SQL Server MVP ▪ Author of book “Reporting with Microsoft SQL Server 2012” I tried to understand the Microsoft data platform on my own… And felt like I was body slammed by Randy Savage: Let’s prevent that from happening… Agenda • Data Lake Driven Analytics • Compute technologies • Architecture Patterns Enterprise Data Maturity Stages STAGE 4: Transformative STAGE 3: Predictive Data transforms STAGE 2: business to drive Informative Data capture is desired outcomes. STAGE 1: Any data, any Reactive comprehensive and Structured data is scalable and leads source, anywhere at managed and business decisions scale Structured data is analyzed centrally based on advanced transacted and and informs the analytics locally managed. business Data used reactively Laggards Leaders (rear-view (Real-time mirror) intelligence) Benefits of the cloud for Big Data Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security Total cost of ownership calculator: https://www.tco.microsoft.com/ Need to collect any data Harness the growing and changing nature of data Structured Semi-structured Streaming “ ” Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data. Requires a data lake Get the right information to the right people at the right time in the right format Two Approaches to getting value out of data: Top-Down + Bottoms-Up How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis Hypothesis it happen? Diagnostic Pattern What Observation Analytics happened? Observation Confirmation Descriptive Analytics Data Warehousing Uses A Top-Down Approach Understand Gather Implement Data Warehouse Corporate Requirements Reporting & Strategy Reporting & Analytics Design Analytics Business Development Requirements Dimension Modelling Physical Design ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune The “data lake” Uses A Bottoms-Up Approach Ingest all data Store all data Do analysis regardless of requirements in native format without Using analytic engines schema definition like Hadoop Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements Enterprise data warehouse (EDW) • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data (archive) Needs data governance so your data lake does not turn into a data swamp! Objectives ✓ Plan the structure based on optimal data retrieval ✓ Avoid a chaotic, unorganized data swamp Common ways to organize the data: Time Partitioning Data Retention Policy Probability of Data Access Year/Month/Day/Hour/Minute Temporary data Recent/current data Permanent data Historical data Applicable period (ex: project lifetime) etc… Subject Area etc… Confidential Classification Security Boundaries Business Impact / Criticality Public information Department High (HBI) Internal use only Business unit Medium (MBI) Supplier/partner confidential etc… Low (LBI) Personally identifiable information (PII) etc… Sensitive – financial Sensitive – intellectual property Downstream App/Purpose Owner / Steward / SME etc… Data Lake + Data Warehouse Better Together What happened? What will happen? Descriptive Predictive Analytics Analytics Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do System of Record: Well-understood data to do experimentation / data discovery operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, Optimized for interactive querying machine learning Complementary to DW Can be sourced from Data Lake Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions A data lake is just a glorified file folder with data files in it – how many end- users can accurately create reports from it? What is Azure Databricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) Fully-managed Hadoop and Spark Azure for the cloud HDInsight 100% Open Source Hortonworks data platform Hadoop and Spark as a Service on Azure Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight” Hortonworks Data Platform (HDP) 3.0 (under the covers of HDInsight 4.0 – public preview) Simply put, Hortonworks ties all the open source products together (20) Have a mature Hadoop ecosystem already established that you want to migrate Want to be 100% open source Want to use Hadoop tools and make them available 24/7 at a lower cost Want to use certain tools like Kafka/Storm/HBase/R Server/LLAP/Pig/Beam/Flink KNOWING THE VARIOUS BIG DATA SOLUTIONS CONTROL EASE OF USE Azure Databricks Azure HDInsight Azure Marketplace HDP | CDH | MapR Reduced Administration BIG DATA BIG DATA ANALYTICS Any Hadoop technology, Workload optimized, Frictionless & Optimized any distribution managed clusters Spark clusters IaaS Clusters Managed Clusters Azure Data Lake Analytics Azure Data Lake Store Gen2 STORAGE Azure Storage BIG DATA Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster. SMP vs MPP Azure Analysis Services Microsoft Products vs Hadoop/OSS Products Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc Cosmos DB MongoDB, MarkLogic, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion Event Hub Apache Kafka Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO Microsoft R Open R SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban Azure Data Lake Storage/WebHDFS HDFS Ozone Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay) SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES) SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Others Hadoop clusters), Apache Flume (collecting log data) Note: Many of the Hadoop/OSS products are available in Azure Custom apps BI Analytics SQL Server SQL master instance Compute pool Compute pool Compute pool Directly read from SQL Compute SQL Compute SQL Compute SQL Compute SQL Compute … HDFS Node Node Node Node Node Data mart Storage pool SQL Data SQL Data SQL SQL SQL Node