Building File System Semantics for an Exabyte Scale Object Storage System Shane Mainali Raji Easwaran Microsoft 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Agenda . Analytics Workloads (Access patterns & challenges) . Azure Data Lake Storage overview . Under the hood . Q&A 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2 Analytics Workloads Access Patterns and Challenges Analytics Workload Pattern Cosmos DB INGEST EXPLORE PREP & TRAIN MODEL & SERVE Sensors and IoT (unstructured) Real-time Apps Logs (unstructured) Media (unstructured) SQL Data Warehouse Azure Databricks Azure SQL Azure Data Factory Azure Databricks Data Warehouse Azure Data Explorer Azure Analysis Files (unstructured) Services Business/custom apps STORE (structured) Azure Data Lake Storage Gen2 Power BI 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4 Challenges - Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks - Client-side file system emulation impacts performance, semantics, and correctness - Directory operations are expensive - Coarse grained Access Control - Throughput is critical for Big Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5 Storage for Analytics - Goals . Address shortcomings of client-side design . First-class hierarchical namespace . Interoperability with Object Storage (Blobs) . Object-level ACLs (POSIX) . Platform for future filesystem-based protocols (e.g. NFS) 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6 Azure Data Lake Storage File System Semantics on Object Storage 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7 Hierarchical Namespace (HNS) 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8 Azure Data Lake Storage Architecture Blob API ADLS Gen2 API Unstructured File Data Object Data Hadoop File System, File Server Backups, Archive and Folder Hierarchy, Storage, Semi-structured Granular ACLs, Atomic File Data Transactions Common Blob Storage Foundation Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage Account HA/DR support through ZRS and RA-GRS Management Security 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9 Blobs and Flat Namespace Blob API Flat Namespace /foo/bar/file.txt Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10 Files and Folders in HNS Blob API ADLS API Hierarchical Namespace foo bar baz.txt Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11 Mapping the concepts . Same Storage account ADLS Blob . URIs are same except endpoint http://account.dfs.core.windows.net/container/videos/movie.mp4 Account Account http://account.blob.core.windows.net/container/videos/movie.mp4 . Filesystem == Container Create File System and Create Container APIs do the same thing File System Container Exactly the same metadata and objects under the covers . Directory ~= Blob Directories are first class entities; both implicit and explicit creation supported Directory Implicit creation when blobs are created ACLs and Leases obeyed by both Blob . File == Blob ADLS Gen 2 adds Append and Flush semantics File Existing Blob semantics supported as is ACLs and Leases obeyed by both 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12 API Interoperability ADLS Blob . Can use Blob or ADLS Gen 2 API’s to access Account Account the same data . Existing Blob applications work without code changes and no data movement on the File System Container Data Lake account Directory Blob File 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13 Under the Hood Designing for Performance, Scale & Throughput 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14 Blob Storage Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation) Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (F-J) strong consistency Stream Layer Stores multiple replicas of the data, deals with Stream 2 failures, bit rot, etc. 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15 Blob Storage with HNS Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation) Hierarchical Namespace Serves metadata based on partitions, including file names, directory structure and ACLs. Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (G3-G4) strong consistency Stream Layer Stores multiple replicas of the data, deals with Stream 2 the media/devices, handles failures, bit rot, etc. 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16 Hierarchical Namespace Topology GUID1 -> GUID2 --------, GUID1 <=> “/” / GUID1 GUID1 -> GUID3 / GUID1, GUID2 <=> “path1” /path1/ GUID3 -> GUID4 GUID1, GUID3 <=> “path2” GUID3, GUID4 <=> “path3” /path2/file1 GUID2 GUID3 path1 path2 /path2/file2 file1 /path2/ file2 GUID4 path3 /path2/path3/ /path2/path3/file3 file3 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17 Partition Layer Schema Partition Key Row Key Columns # Parent ID Name CT Del File Metadata Child ID 1 GUID-ROOT . 00001 N Y … GUID-BLOB1 / 2 GUID-ROOT path1 00100 N N … GUID-PATH1 /path1/ 3 GUID-ROOT path2 00200 N N … GUID-PATH2 /path2/file1 4 GUID-PATH1 . 00100 N Y … GUID-BLOB2 6 GUID-PATH2 . 00200 N N … GUID-BLOB3 /path2/file2 GUID-PATH2 file1 00300 N Y … GUID-BLOB4 /path2/ 7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4 /path2/path3/ 8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5 … /path2/path3/file3 GUID-PATH2 path3 00400 N N GUID-PATH3 10 GUID-PATH3 . 00400 N Y … GUID-BLOB6 11 GUID-PATH3 file3 00401 N N … GUID-BLOB7 Account;FileSystem;GUID-ROOT path1 00100 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18 Hierarchical Namespace Flow Staging\Oscars\Movie.mp4 Create File FE 2 3 Partition 3 Hierarchical Namespace (F-J) Parent Name Label <Null> Guid1 Staging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 19 Hierarchical Namespace Flow Master\Oscars\Movie.mp4 Rename Directory FE 2 3 Hierarchical Namespace Partition 3 (F-J) Parent Name Label <Null> Guid1 MasterStaging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20 Scale Out & Load Balancing Namespace Processors NP1: GUID 1 – 100 NP2: GUID 101 to 200 NP3: GUID 201 to 300 NP4: GUID 301 to 400 A Scale Unit contains hundreds of NP5: GUID 401 to 500 nodes NP6: GUID 501 to 1000 NP6: GUID 501 toto 750750 Each node has many Namespace Node Node Processors (NP) Each NP manages a portion of the Node Node Node Node namespace (GUID range for each <Account, FileSystem>) Hot nodes are load balanced with Namespace Processors Node Node NP1: GUID 1001 – 1100 Node Node other nodes in the Scale Unit by NP2: GUID 1101 – 1200 splitting managed GUID ranges NP3: GUID 1201 – 1300 NP4: GUID 1301 – 1400 among NPs Node Node NP5: GUID 751 toto 10001000 Node Node Node Node An Azure region contains several scale units When a majority of nodes in a Scale Unit 1 Scale unit become hot, load Scale Unit 2 balancing occurs across Scale units 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21 Transaction Processing and Caching / /path2/file1 path2 NP1 NP2 NP4 NP5 file1 NP3 NP6 GUID1 / GUID2 GUID3 path1 path2 file1 NP7 NP8 file2 GUID4 path3 NP9 file3 Owning ID Name NP Parent ID Name File Child ID GUID-ROOT / NP1 GUID-ROOT path2 N GUID-PATH2 GUID-PATH2 path2 NP7 GUID-PATH2 file1 Y GUID-BLOB4 GUID-BLOB4 file1 NP6 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22 High Throughput . A single object/file contains multiple blocks Movie.mp4 (50 GB) . The block range is partitioned uniformly across partitions . A single write can potentially be served by FE 2 all partition nodes . Support 100s of Gbps of Ingress/Egress for a Partition 3 single account or to a (F-J) single file . 2 layers of caching to enable high throughput Stream 2 read performance 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23 Performance & Scale Implications . Hierarchical Namespace is only in the path of namespace traversal and metadata operations . Data reads and writes don’t go through Hierarchical Namespace . Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead . Separation of Distributed Cache and Persistent State (Partition) layers is critical . Load Balancing is very efficient and fast . Leverage Partition Layer; distinct partitioning for Blobs and HNS . While Distributed Transactions are more expensive, they are less frequent 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24 Opportunities - Snapshots at any level of the hierarchy - Time travel operations with E2E built-in transaction timestamps - Support a wide variety of File Systems - Interop across all - Zero data copying - In-Place upgrade from Flat -> Hierarchical Namespace - Cross-Entity Strongly Consistent Reads - High-Fidelity On-Prem->Cloud Migration/Hybrid 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25 Q & A 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages26 Page
-
File Size-