Data in the Cloud Happy 10th ACM SoCC!

Raghu Ramakrishnan CTO for Data, Technical Fellow ACM SoCC Topics Over the Past 10 Years

2010 2011 2012 2013 2014

2015 2016 2017 2018 2019

Word clouds courtesy Carlo Curino ACM SoCC Topics After Filtering “data” and “cloud”

2010 2011 2012 2013 2014

2015 2016 2017 2018 2019

Word clouds courtesy Carlo Curino 4

Carbon Footprint of Cloud Computing

TRADITIONAL 52% TRADITIONAL 71% TRADITIONAL 77% TRADITIONAL 22% more efficient more efficient more efficient more efficient DATACENTER DATACENTER DATACENTER DATACENTER Virtualized Dedicated Large, Large, servers, high- storage, high-end high-end end deployment high-end deployment deployment deployment

SharePoint Online

Azure Compute

Azure Storage Exchange Online

Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year

http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf 8 AI in Operation & Optimization

IoT and big data platforms make it increasingly easy to optimize datacenters IoT telemetry, analytics and ML optimization

Predictive maintenance

Capacity planning and workload placement

Microsoft Confidential 9 Microsoft Confidential 10

Ubiquitous Data

Scale

Heterogeneity

Network Many engines

latency Data silos Data

Many workloads Elastic compute Elastic

Elastic storage

16 Azure SQL DW Azure SQL DB Analytics-optimized Update-optimized OPERATIONAL ANALYTICS Meta data Meta data XACT_STATE XACT_STATE

Governance

Governance RELATIONAL

Azure Cosmos DB Spark, Hive, ML… Document model

Data Lake RELATIONAL Meta data -

Meta data NON XACT_STATE Governance

Governance Big Picture: Separation of Compute and State

Spark, Hive, ML… Azure SQL DW Azure SQL DB Azure Cosmos DB

Data Lake Analytics-optimized Update-optimized Document model

Meta data Meta data Meta data Meta data

XACT_STATE XACT_STATE XACT_STATE Microsoft’s Internal Big Data Service

Microsoft’s internal data lake Store HDFS as a PaaS cloud service • A data lake for all teams Enabling business growth: @Microsoft • Microsoft’s serverless Big Data platform • Good developer tools Office productivity revenue (45%YoY)* Intelligent Cloud (100% YoY)* • Fully aligned with Hadoop ecosystem • Batch, Interactive, Streaming, ML Bing search share doubles and standards, with full support for • Used across Office, , Azure, Hadoop tools and engines as well as Windows, Ads, Bing, , … unique Microsoft capabilities • Production jobs and experimentation • Migrated to ADLS • 1P = 3P By the numbers • 9+ Exabytes of data, 8+ billion files J. Zhou et. al., SCOPE: parallel databases • 100Ks of physical servers meet MapReduce, VLDBJ 21(5) • Millions of interactive queries R. Ramakrishnan et. Al., Azure Data Lake • Huge streaming pipelines Store, SIGMOD 2017 • 100Ks of daily batch jobs • 15K+ developers Apache YARN Federation • 300+ teams

MSR/GSL Collaboration Traditional MPP DW Architecture Meta data

Transactions

DQE communication channel

Data movement channel

Compute • DMS • SQL Adaptive cache • SSD Cache

Remote storage

Snapshot backups Data

Log Premium Standard Cloud-Native Scale-Out, Data Heterogeneity

▪ Data and state separated from compute Data movement channel ▪ Fault-tolerant scale-out

▪ Online scaling Compute • ES ▪ Data heterogeneity • SQL • SSD Cache

➢ Converge DW and Lake

Centralized services

Remote storage Meta data Standard • Distribution-less

Transactions

Columnar files Polaris Concurrency – Workload Aware Scheduling State Machines: A next generation distributed query engine (blend massive scale batch QP• withGuarantees scale up (small precedence scale-out) constraints interactive QP) are satisfied • Defines a formal model on how we recover from failures Task

Resource % Demand Workload Task Graph State Machine Execution

State

25 5 Transition Workload Tasks

Agg 10 25

5 5 Query 1 Agg 10 5 5

15 40 Edges are precedence constraints States 5 5 Global Workload Graph that enablesWaiting Execution for 40 15 5 workload optimizations acrossExecuting queries 5 15 Query 2 Failed Completed

5

Task-cost Driven Scheduling

Resource Aware Task Placement Scalability: All TPC-H Queries at 1PB Scale! Elastic DQP – Unlimited Scale P. Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019

풔풊풛풆풐풇(풅풂풕풂) 풔풊풛풆풐풇(풅풂풕풂) 풔풊풛풆풐풇(풅풂풕풂) 푵 푵 푵

MSR Collaboration OLTP Data Warehouses Big Data / Lake NoSQL Unified Data Suite and Governance

Global apps

Spark, Hive, ML… SQL Azure Cosmos DB

Data Lake Analytics-optimized Update-optimized Document storage

Meta data Meta data Meta data Meta data

XACT_STATE XACT_STATE XACT_STATE

Governance Big Picture: Must Simplify Usability and Governance

• Cloud • Elastic compute and storage is transformative • But compute-storage latency and bandwidth is key challenge • Edge blurs cloud/on-prem separation

• ML • An integral part of data processing, with a rapidly growing community of its own

• Implications for Data Management • Rethink what belongs in a “DBMS”—ML, data governance • Rethink data architectures from the ground up—OLTP/Analytics/HTAP A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020.

Model Development / Training other data

Data Model Tracking featurization Model Model Model Catalogs & Provenance Training optimization

offline featurization. Logs & Access Telemetry offline Control

online

App deployment orchestration policies logic Featurization Governance Live Data policies Model Model Scoring Decisions

GSL Collaboration Unified Governance

Big Data and Data warehousing A single pane of glass to…

Manage data lifecycle Data & (collect, clean, publish, discover, curate, …) Operational Systems Ensure Data Quality & Correctness Assess data compliance, privacy & protection BI AI and ML Author & manage data policy (access, use, retention, location, sharing)

Across Cloud, Edge, On-Prem