Building File System Semantics for an Exabyte Scale Object Storage System
Shane Mainali Raji Easwaran Microsoft
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Agenda
. Analytics Workloads (Access patterns & challenges) . Azure Data Lake Storage overview . Under the hood . Q&A
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2 Analytics Workloads Access Patterns and Challenges Analytics Workload Pattern
Cosmos DB INGEST EXPLORE PREP & TRAIN MODEL & SERVE Sensors and IoT (unstructured)
Real-time Apps
Logs (unstructured)
Media (unstructured) SQL Data Warehouse Azure Databricks Azure SQL Azure Data Factory Azure Databricks Data Warehouse Azure Data Explorer
Azure Analysis Files (unstructured) Services
Business/custom apps STORE (structured) Azure Data Lake Storage Gen2 Power BI
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4 Challenges
- Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks - Client-side file system emulation impacts performance, semantics, and correctness - Directory operations are expensive - Coarse grained Access Control - Throughput is critical for Big Data
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5 Storage for Analytics - Goals
. Address shortcomings of client-side design . First-class hierarchical namespace . Interoperability with Object Storage (Blobs) . Object-level ACLs (POSIX) . Platform for future filesystem-based protocols (e.g. NFS)
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6 Azure Data Lake Storage File System Semantics on Object Storage
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7 Hierarchical Namespace (HNS)
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8 Azure Data Lake Storage Architecture
Blob API ADLS Gen2 API
Unstructured File Data Object Data
Hadoop File System, File Server Backups, Archive and Folder Hierarchy, Storage, Semi-structured Granular ACLs, Atomic File Data Transactions
Common Blob Storage Foundation
Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage Account HA/DR support through ZRS and RA-GRS Management Security
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9 Blobs and Flat Namespace Blob API
Flat Namespace
/foo/bar/file.txt
Data
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10 Files and Folders in HNS
Blob API ADLS API
Hierarchical Namespace foo
bar
baz.txt
Data
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11 Mapping the concepts
. Same Storage account ADLS Blob . URIs are same except endpoint http://account.dfs.core.windows.net/container/videos/movie.mp4 Account Account http://account.blob.core.windows.net/container/videos/movie.mp4 . Filesystem == Container Create File System and Create Container APIs do the same thing File System Container Exactly the same metadata and objects under the covers . Directory ~= Blob Directories are first class entities; both implicit and explicit creation supported Directory Implicit creation when blobs are created ACLs and Leases obeyed by both Blob . File == Blob ADLS Gen 2 adds Append and Flush semantics File Existing Blob semantics supported as is ACLs and Leases obeyed by both 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12 API Interoperability
ADLS Blob . Can use Blob or ADLS Gen 2 API’s to access Account Account the same data . Existing Blob applications work without code changes and no data movement on the File System Container Data Lake account
Directory
Blob
File
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13 Under the Hood Designing for Performance, Scale & Throughput
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14 Blob Storage Architecture
Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation)
Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (F-J) strong consistency
Stream Layer Stores multiple replicas of the data, deals with Stream 2 failures, bit rot, etc.
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15 Blob Storage with HNS Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation)
Hierarchical Namespace Serves metadata based on partitions, including file names, directory structure and ACLs.
Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (G3-G4) strong consistency
Stream Layer Stores multiple replicas of the data, deals with Stream 2 the media/devices, handles failures, bit rot, etc.
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16 Hierarchical Namespace Topology
GUID1 -> GUID2 ------, GUID1 <=> “/” / GUID1 GUID1 -> GUID3 / GUID1, GUID2 <=> “path1” /path1/ GUID3 -> GUID4 GUID1, GUID3 <=> “path2” GUID3, GUID4 <=> “path3” /path2/file1 GUID2 GUID3 path1 path2 /path2/file2 file1 /path2/ file2 GUID4 path3 /path2/path3/ /path2/path3/file3 file3
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17 Partition Layer Schema
Partition Key Row Key Columns
# Parent ID Name CT Del File Metadata Child ID
1 GUID-ROOT . 00001 N Y … GUID-BLOB1
/ 2 GUID-ROOT path1 00100 N N … GUID-PATH1
/path1/ 3 GUID-ROOT path2 00200 N N … GUID-PATH2 /path2/file1 4 GUID-PATH1 . 00100 N Y … GUID-BLOB2 6 GUID-PATH2 . 00200 N N … GUID-BLOB3 /path2/file2 GUID-PATH2 file1 00300 N Y … GUID-BLOB4
/path2/ 7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4
/path2/path3/ 8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5 … /path2/path3/file3 GUID-PATH2 path3 00400 N N GUID-PATH3
10 GUID-PATH3 . 00400 N Y … GUID-BLOB6
11 GUID-PATH3 file3 00401 N N … GUID-BLOB7 Account;FileSystem;GUID-ROOT path1 00100
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18 Hierarchical Namespace Flow
Staging\Oscars\Movie.mp4
Create File
FE 2
3
Partition 3 Hierarchical Namespace (F-J) Parent Name Label
Master\Oscars\Movie.mp4
Rename Directory
FE 2
3
Hierarchical Namespace Partition 3 (F-J) Parent Name Label
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20 Scale Out & Load Balancing Namespace Processors NP1: GUID 1 – 100 NP2: GUID 101 to 200 NP3: GUID 201 to 300 NP4: GUID 301 to 400 A Scale Unit contains hundreds of NP5: GUID 401 to 500 nodes NP6: GUID 501 to 1000 NP6: GUID 501 toto 750750 Each node has many Namespace Node Node Processors (NP) Each NP manages a portion of the Node Node Node Node namespace (GUID range for each
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21 Transaction Processing and Caching / /path2/file1 path2 NP1 NP2 NP4 NP5 file1
NP3 NP6 GUID1 /
GUID2 GUID3 path1 path2
file1 NP7 NP8 file2 GUID4 path3 NP9 file3 Owning ID Name NP Parent ID Name File Child ID GUID-ROOT / NP1 GUID-ROOT path2 N GUID-PATH2 GUID-PATH2 path2 NP7 GUID-PATH2 file1 Y GUID-BLOB4 GUID-BLOB4 file1 NP6 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22 High Throughput
. A single object/file contains multiple blocks Movie.mp4 (50 GB) . The block range is partitioned uniformly across partitions . A single write can potentially be served by FE 2 all partition nodes . Support 100s of Gbps of Ingress/Egress for a Partition 3 single account or to a (F-J) single file . 2 layers of caching to enable high throughput Stream 2 read performance
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23 Performance & Scale Implications
. Hierarchical Namespace is only in the path of namespace traversal and metadata operations . Data reads and writes don’t go through Hierarchical Namespace . Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead . Separation of Distributed Cache and Persistent State (Partition) layers is critical . Load Balancing is very efficient and fast . Leverage Partition Layer; distinct partitioning for Blobs and HNS . While Distributed Transactions are more expensive, they are less frequent
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24 Opportunities
- Snapshots at any level of the hierarchy - Time travel operations with E2E built-in transaction timestamps - Support a wide variety of File Systems - Interop across all - Zero data copying - In-Place upgrade from Flat -> Hierarchical Namespace - Cross-Entity Strongly Consistent Reads - High-Fidelity On-Prem->Cloud Migration/Hybrid
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25 Q & A
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26