Data in the Cloud Happy 10th ACM SoCC!
Raghu Ramakrishnan CTO for Data, Technical Fellow ACM SoCC Topics Over the Past 10 Years
2010 2011 2012 2013 2014
2015 2016 2017 2018 2019
Word clouds courtesy Carlo Curino ACM SoCC Topics After Filtering “data” and “cloud”
2010 2011 2012 2013 2014
2015 2016 2017 2018 2019
Word clouds courtesy Carlo Curino 4
Carbon Footprint of Cloud Computing
TRADITIONAL 52% TRADITIONAL 71% TRADITIONAL 77% TRADITIONAL 22% more efficient more efficient more efficient more efficient DATACENTER DATACENTER DATACENTER DATACENTER Virtualized Dedicated Large, Large, servers, high- storage, high-end high-end end deployment high-end deployment deployment deployment
SharePoint Online
Azure Compute
Azure Storage Exchange Online
Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year
http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf 8 AI in Operation & Optimization
IoT and big data platforms make it increasingly easy to optimize datacenters IoT telemetry, analytics and ML optimization
Predictive maintenance
Capacity planning and workload placement
Microsoft Confidential 9 Microsoft Confidential 10
Ubiquitous Data
Scale
Heterogeneity
Network Many engines
latency Data silos Data
Many workloads Elastic compute Elastic
Elastic storage
16 Azure SQL DW Azure SQL DB Analytics-optimized Update-optimized OPERATIONAL ANALYTICS Meta data Meta data XACT_STATE XACT_STATE
Governance
Governance RELATIONAL
Azure Cosmos DB Spark, Hive, ML… Document model
Data Lake RELATIONAL Meta data -
Meta data NON XACT_STATE Governance
Governance Big Picture: Separation of Compute and State
Spark, Hive, ML… Azure SQL DW Azure SQL DB Azure Cosmos DB
Data Lake Analytics-optimized Update-optimized Document model
Meta data Meta data Meta data Meta data
XACT_STATE XACT_STATE XACT_STATE Microsoft’s Internal Big Data Service
Microsoft’s internal data lake Azure Data Lake Store HDFS as a PaaS cloud service • A data lake for all teams Enabling business growth: @Microsoft • Microsoft’s serverless Big Data platform • Good developer tools Office productivity revenue (45%YoY)* Intelligent Cloud (100% YoY)* • Fully aligned with Hadoop ecosystem • Batch, Interactive, Streaming, ML Bing search share doubles and standards, with full support for • Used across Office, Xbox, Azure, Hadoop tools and engines as well as Windows, Ads, Bing, Skype, … unique Microsoft capabilities • Production jobs and experimentation • Migrated to ADLS • 1P = 3P By the numbers • 9+ Exabytes of data, 8+ billion files J. Zhou et. al., SCOPE: parallel databases • 100Ks of physical servers meet MapReduce, VLDBJ 21(5) • Millions of interactive queries R. Ramakrishnan et. Al., Azure Data Lake • Huge streaming pipelines Store, SIGMOD 2017 • 100Ks of daily batch jobs • 15K+ developers Apache YARN Federation • 300+ teams
MSR/GSL Collaboration Traditional MPP DW Architecture Meta data
Transactions
DQE communication channel
Data movement channel
Compute • DMS • SQL Adaptive cache • SSD Cache
Remote storage
Snapshot backups Data
Log Premium Standard Cloud-Native Scale-Out, Data Heterogeneity
▪ Data and state separated from compute Data movement channel ▪ Fault-tolerant scale-out
▪ Online scaling Compute • ES ▪ Data heterogeneity • SQL • SSD Cache
➢ Converge DW and Lake
Centralized services
Remote storage Meta data Standard • Distribution-less
Transactions
Columnar files Polaris Concurrency – Workload Aware Scheduling State Machines: A next generation distributed query engine (blend massive scale batch QP• withGuarantees scale up (small precedence scale-out) constraints interactive QP) are satisfied • Defines a formal model on how we recover from failures Task
Resource % Demand Workload Task Graph State Machine Execution
State
25 5 Transition Workload Tasks
Agg 10 25
5 5 Query 1 Agg 10 5 5
15 40 Edges are precedence constraints States 5 5 Global Workload Graph that enablesWaiting Execution for 40 15 5 workload optimizations acrossExecuting queries 5 15 Query 2 Failed Completed
5
Task-cost Driven Scheduling
Resource Aware Task Placement Scalability: All TPC-H Queries at 1PB Scale! Elastic DQP – Unlimited Scale P. Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019
풔풊풛풆풐풇(풅풂풕풂) 풔풊풛풆풐풇(풅풂풕풂) 풔풊풛풆풐풇(풅풂풕풂) 푵 푵 푵
MSR Collaboration OLTP Data Warehouses Big Data / Lake NoSQL Unified Data Suite and Governance
Global apps
Spark, Hive, ML… SQL Azure Cosmos DB
Data Lake Analytics-optimized Update-optimized Document storage
Meta data Meta data Meta data Meta data
XACT_STATE XACT_STATE XACT_STATE
Governance Big Picture: Must Simplify Usability and Governance
• Cloud • Elastic compute and storage is transformative • But compute-storage latency and bandwidth is key challenge • Edge blurs cloud/on-prem separation
• ML • An integral part of data processing, with a rapidly growing community of its own
• Implications for Data Management • Rethink what belongs in a “DBMS”—ML, data governance • Rethink data architectures from the ground up—OLTP/Analytics/HTAP A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020.
Model Development / Training other data
Data Model Tracking featurization Model Model Model Catalogs & Provenance Training optimization
offline featurization. Logs & Access Telemetry offline Control
online
App deployment orchestration policies logic Featurization Governance Live Data policies Model Model Scoring Decisions
GSL Collaboration Unified Governance
Big Data and Data warehousing A single pane of glass to…
Manage data lifecycle Data & (collect, clean, publish, discover, curate, …) Operational Systems Ensure Data Quality & Correctness Assess data compliance, privacy & protection BI AI and ML Author & manage data policy (access, use, retention, location, sharing)
Across Cloud, Edge, On-Prem