Building File System Semantics for an Exabyte Scale Object Storage System

Shane Mainali Raji Easwaran

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Agenda

. Analytics Workloads (Access patterns & challenges) . Storage overview . Under the hood . Q&A

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2 Analytics Workloads Access Patterns and Challenges Analytics Workload Pattern

Cosmos DB INGEST EXPLORE PREP & TRAIN MODEL & SERVE Sensors and IoT (unstructured)

Real-time Apps

Logs (unstructured)

Media (unstructured) SQL Data Warehouse Azure Databricks Azure SQL Azure Data Factory Azure Databricks Data Warehouse Azure Data Explorer

Azure Analysis Files (unstructured) Services

Business/custom apps STORE (structured) Azure Data Lake Storage Gen2 Power BI

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4 Challenges

- Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks - Client-side file system emulation impacts performance, semantics, and correctness - Directory operations are expensive - Coarse grained Access Control - Throughput is critical for

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5 Storage for Analytics - Goals

. Address shortcomings of client-side design . First-class hierarchical namespace . Interoperability with Object Storage (Blobs) . Object-level ACLs (POSIX) . Platform for future filesystem-based protocols (e.g. NFS)

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6 Azure Data Lake Storage File System Semantics on Object Storage

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7 Hierarchical Namespace (HNS)

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8 Azure Data Lake Storage Architecture

Blob API ADLS Gen2 API

Unstructured File Data Object Data

Hadoop File System, File Server Backups, Archive and Folder Hierarchy, Storage, Semi-structured Granular ACLs, Atomic File Data Transactions

Common Blob Storage Foundation

Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage Account HA/DR support through ZRS and RA-GRS Management Security

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9 Blobs and Flat Namespace Blob API

Flat Namespace

/foo/bar/file.txt

Data

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10 Files and Folders in HNS

Blob API ADLS API

Hierarchical Namespace foo

bar

baz.txt

Data

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11 Mapping the concepts

. Same Storage account ADLS Blob . URIs are same except endpoint http://account.dfs.core.windows.net/container/videos/movie.mp4 Account Account http://account.blob.core.windows.net/container/videos/movie.mp4 . Filesystem == Container Create File System and Create Container APIs do the same thing File System Container Exactly the same metadata and objects under the covers . Directory ~= Blob Directories are first class entities; both implicit and explicit creation supported Directory Implicit creation when blobs are created ACLs and Leases obeyed by both Blob . File == Blob ADLS Gen 2 adds Append and Flush semantics File Existing Blob semantics supported as is ACLs and Leases obeyed by both 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12 API Interoperability

ADLS Blob . Can use Blob or ADLS Gen 2 API’s to access Account Account the same data . Existing Blob applications work without code changes and no data movement on the File System Container Data Lake account

Directory

Blob

File

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13 Under the Hood Designing for Performance, Scale & Throughput

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14 Blob Storage Architecture

Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation)

Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (F-J) strong consistency

Stream Layer Stores multiple replicas of the data, deals with Stream 2 failures, bit rot, etc.

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15 Blob Storage with HNS Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation)

Hierarchical Namespace Serves metadata based on partitions, including file names, directory structure and ACLs.

Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (G3-G4) strong consistency

Stream Layer Stores multiple replicas of the data, deals with Stream 2 the media/devices, handles failures, bit rot, etc.

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16 Hierarchical Namespace Topology

GUID1 -> GUID2 ------, GUID1 <=> “/” / GUID1 GUID1 -> GUID3 / GUID1, GUID2 <=> “path1” /path1/ GUID3 -> GUID4 GUID1, GUID3 <=> “path2” GUID3, GUID4 <=> “path3” /path2/file1 GUID2 GUID3 path1 path2 /path2/file2 file1 /path2/ file2 GUID4 path3 /path2/path3/ /path2/path3/file3 file3

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17 Partition Layer Schema

Partition Key Row Key Columns

# Parent ID Name CT Del File Metadata Child ID

1 GUID-ROOT . 00001 N Y … GUID-BLOB1

/ 2 GUID-ROOT path1 00100 N N … GUID-PATH1

/path1/ 3 GUID-ROOT path2 00200 N N … GUID-PATH2 /path2/file1 4 GUID-PATH1 . 00100 N Y … GUID-BLOB2 6 GUID-PATH2 . 00200 N N … GUID-BLOB3 /path2/file2 GUID-PATH2 file1 00300 N Y … GUID-BLOB4

/path2/ 7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4

/path2/path3/ 8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5 … /path2/path3/file3 GUID-PATH2 path3 00400 N N GUID-PATH3

10 GUID-PATH3 . 00400 N Y … GUID-BLOB6

11 GUID-PATH3 file3 00401 N N … GUID-BLOB7 Account;FileSystem;GUID-ROOT path1 00100

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18 Hierarchical Namespace Flow

Staging\Oscars\Movie.mp4

Create File

FE 2

3

Partition 3 Hierarchical Namespace (F-J) Parent Name Label Guid1 Staging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 19 Hierarchical Namespace Flow

Master\Oscars\Movie.mp4

Rename Directory

FE 2

3

Hierarchical Namespace Partition 3 (F-J) Parent Name Label Guid1 MasterStaging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20 Scale Out & Load Balancing Namespace Processors NP1: GUID 1 – 100 NP2: GUID 101 to 200 NP3: GUID 201 to 300 NP4: GUID 301 to 400  A Scale Unit contains hundreds of NP5: GUID 401 to 500 nodes NP6: GUID 501 to 1000 NP6: GUID 501 toto 750750  Each node has many Namespace Node Node Processors (NP)  Each NP manages a portion of the Node Node Node Node namespace (GUID range for each )  Hot nodes are load balanced with Namespace Processors Node Node NP1: GUID 1001 – 1100 Node Node other nodes in the Scale Unit by NP2: GUID 1101 – 1200 splitting managed GUID ranges NP3: GUID 1201 – 1300 NP4: GUID 1301 – 1400 among NPs Node Node NP5: GUID 751 toto 10001000 Node Node Node Node  An Azure region contains several scale units  When a majority of nodes in a Scale Unit 1 Scale unit become hot, load Scale Unit 2 balancing occurs across Scale units

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21 Transaction Processing and Caching / /path2/file1 path2 NP1 NP2 NP4 NP5 file1

NP3 NP6 GUID1 /

GUID2 GUID3 path1 path2

file1 NP7 NP8 file2 GUID4 path3 NP9 file3 Owning ID Name NP Parent ID Name File Child ID GUID-ROOT / NP1 GUID-ROOT path2 N GUID-PATH2 GUID-PATH2 path2 NP7 GUID-PATH2 file1 Y GUID-BLOB4 GUID-BLOB4 file1 NP6 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22 High Throughput

. A single object/file contains multiple blocks Movie.mp4 (50 GB) . The block range is partitioned uniformly across partitions . A single write can potentially be served by FE 2 all partition nodes . Support 100s of Gbps of Ingress/Egress for a Partition 3 single account or to a (F-J) single file . 2 layers of caching to enable high throughput Stream 2 read performance

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23 Performance & Scale Implications

. Hierarchical Namespace is only in the path of namespace traversal and metadata operations . Data reads and writes don’t go through Hierarchical Namespace . Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead . Separation of Distributed Cache and Persistent State (Partition) layers is critical . Load Balancing is very efficient and fast . Leverage Partition Layer; distinct partitioning for Blobs and HNS . While Distributed Transactions are more expensive, they are less frequent

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24 Opportunities

- Snapshots at any level of the hierarchy - Time travel operations with E2E built-in transaction timestamps - Support a wide variety of File Systems - Interop across all - Zero data copying - In-Place upgrade from Flat -> Hierarchical Namespace - Cross-Entity Strongly Consistent Reads - High-Fidelity On-Prem->Cloud Migration/Hybrid

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25 Q & A

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26