Building File System Semantics for an Exabyte Scale Object Storage System

Building File System Semantics for an Exabyte Scale Object Storage System

Building File System Semantics for an Exabyte Scale Object Storage System Shane Mainali Raji Easwaran Microsoft 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Agenda . Analytics Workloads (Access patterns & challenges) . Azure Data Lake Storage overview . Under the hood . Q&A 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2 Analytics Workloads Access Patterns and Challenges Analytics Workload Pattern Cosmos DB INGEST EXPLORE PREP & TRAIN MODEL & SERVE Sensors and IoT (unstructured) Real-time Apps Logs (unstructured) Media (unstructured) SQL Data Warehouse Azure Databricks Azure SQL Azure Data Factory Azure Databricks Data Warehouse Azure Data Explorer Azure Analysis Files (unstructured) Services Business/custom apps STORE (structured) Azure Data Lake Storage Gen2 Power BI 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4 Challenges - Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks - Client-side file system emulation impacts performance, semantics, and correctness - Directory operations are expensive - Coarse grained Access Control - Throughput is critical for Big Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5 Storage for Analytics - Goals . Address shortcomings of client-side design . First-class hierarchical namespace . Interoperability with Object Storage (Blobs) . Object-level ACLs (POSIX) . Platform for future filesystem-based protocols (e.g. NFS) 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6 Azure Data Lake Storage File System Semantics on Object Storage 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7 Hierarchical Namespace (HNS) 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8 Azure Data Lake Storage Architecture Blob API ADLS Gen2 API Unstructured File Data Object Data Hadoop File System, File Server Backups, Archive and Folder Hierarchy, Storage, Semi-structured Granular ACLs, Atomic File Data Transactions Common Blob Storage Foundation Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage Account HA/DR support through ZRS and RA-GRS Management Security 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9 Blobs and Flat Namespace Blob API Flat Namespace /foo/bar/file.txt Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10 Files and Folders in HNS Blob API ADLS API Hierarchical Namespace foo bar baz.txt Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11 Mapping the concepts . Same Storage account ADLS Blob . URIs are same except endpoint http://account.dfs.core.windows.net/container/videos/movie.mp4 Account Account http://account.blob.core.windows.net/container/videos/movie.mp4 . Filesystem == Container Create File System and Create Container APIs do the same thing File System Container Exactly the same metadata and objects under the covers . Directory ~= Blob Directories are first class entities; both implicit and explicit creation supported Directory Implicit creation when blobs are created ACLs and Leases obeyed by both Blob . File == Blob ADLS Gen 2 adds Append and Flush semantics File Existing Blob semantics supported as is ACLs and Leases obeyed by both 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12 API Interoperability ADLS Blob . Can use Blob or ADLS Gen 2 API’s to access Account Account the same data . Existing Blob applications work without code changes and no data movement on the File System Container Data Lake account Directory Blob File 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13 Under the Hood Designing for Performance, Scale & Throughput 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14 Blob Storage Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation) Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (F-J) strong consistency Stream Layer Stores multiple replicas of the data, deals with Stream 2 failures, bit rot, etc. 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15 Blob Storage with HNS Architecture Front End Stateless, front door for request handling FE 2 (auth, request metering/throttling, validation) Hierarchical Namespace Serves metadata based on partitions, including file names, directory structure and ACLs. Partition Layer Serves data in key-value fashion based on Partition 3 partitions, enables batch transactions and (G3-G4) strong consistency Stream Layer Stores multiple replicas of the data, deals with Stream 2 the media/devices, handles failures, bit rot, etc. 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16 Hierarchical Namespace Topology GUID1 -> GUID2 --------, GUID1 <=> “/” / GUID1 GUID1 -> GUID3 / GUID1, GUID2 <=> “path1” /path1/ GUID3 -> GUID4 GUID1, GUID3 <=> “path2” GUID3, GUID4 <=> “path3” /path2/file1 GUID2 GUID3 path1 path2 /path2/file2 file1 /path2/ file2 GUID4 path3 /path2/path3/ /path2/path3/file3 file3 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17 Partition Layer Schema Partition Key Row Key Columns # Parent ID Name CT Del File Metadata Child ID 1 GUID-ROOT . 00001 N Y … GUID-BLOB1 / 2 GUID-ROOT path1 00100 N N … GUID-PATH1 /path1/ 3 GUID-ROOT path2 00200 N N … GUID-PATH2 /path2/file1 4 GUID-PATH1 . 00100 N Y … GUID-BLOB2 6 GUID-PATH2 . 00200 N N … GUID-BLOB3 /path2/file2 GUID-PATH2 file1 00300 N Y … GUID-BLOB4 /path2/ 7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4 /path2/path3/ 8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5 … /path2/path3/file3 GUID-PATH2 path3 00400 N N GUID-PATH3 10 GUID-PATH3 . 00400 N Y … GUID-BLOB6 11 GUID-PATH3 file3 00401 N N … GUID-BLOB7 Account;FileSystem;GUID-ROOT path1 00100 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18 Hierarchical Namespace Flow Staging\Oscars\Movie.mp4 Create File FE 2 3 Partition 3 Hierarchical Namespace (F-J) Parent Name Label <Null> Guid1 Staging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 19 Hierarchical Namespace Flow Master\Oscars\Movie.mp4 Rename Directory FE 2 3 Hierarchical Namespace Partition 3 (F-J) Parent Name Label <Null> Guid1 MasterStaging Guid1 Guid2 Oscars Stream 2 Guid2 Guid3 Movie.mp4 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20 Scale Out & Load Balancing Namespace Processors NP1: GUID 1 – 100 NP2: GUID 101 to 200 NP3: GUID 201 to 300 NP4: GUID 301 to 400 A Scale Unit contains hundreds of NP5: GUID 401 to 500 nodes NP6: GUID 501 to 1000 NP6: GUID 501 toto 750750 Each node has many Namespace Node Node Processors (NP) Each NP manages a portion of the Node Node Node Node namespace (GUID range for each <Account, FileSystem>) Hot nodes are load balanced with Namespace Processors Node Node NP1: GUID 1001 – 1100 Node Node other nodes in the Scale Unit by NP2: GUID 1101 – 1200 splitting managed GUID ranges NP3: GUID 1201 – 1300 NP4: GUID 1301 – 1400 among NPs Node Node NP5: GUID 751 toto 10001000 Node Node Node Node An Azure region contains several scale units When a majority of nodes in a Scale Unit 1 Scale unit become hot, load Scale Unit 2 balancing occurs across Scale units 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21 Transaction Processing and Caching / /path2/file1 path2 NP1 NP2 NP4 NP5 file1 NP3 NP6 GUID1 / GUID2 GUID3 path1 path2 file1 NP7 NP8 file2 GUID4 path3 NP9 file3 Owning ID Name NP Parent ID Name File Child ID GUID-ROOT / NP1 GUID-ROOT path2 N GUID-PATH2 GUID-PATH2 path2 NP7 GUID-PATH2 file1 Y GUID-BLOB4 GUID-BLOB4 file1 NP6 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22 High Throughput . A single object/file contains multiple blocks Movie.mp4 (50 GB) . The block range is partitioned uniformly across partitions . A single write can potentially be served by FE 2 all partition nodes . Support 100s of Gbps of Ingress/Egress for a Partition 3 single account or to a (F-J) single file . 2 layers of caching to enable high throughput Stream 2 read performance 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23 Performance & Scale Implications . Hierarchical Namespace is only in the path of namespace traversal and metadata operations . Data reads and writes don’t go through Hierarchical Namespace . Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead . Separation of Distributed Cache and Persistent State (Partition) layers is critical . Load Balancing is very efficient and fast . Leverage Partition Layer; distinct partitioning for Blobs and HNS . While Distributed Transactions are more expensive, they are less frequent 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24 Opportunities - Snapshots at any level of the hierarchy - Time travel operations with E2E built-in transaction timestamps - Support a wide variety of File Systems - Interop across all - Zero data copying - In-Place upgrade from Flat -> Hierarchical Namespace - Cross-Entity Strongly Consistent Reads - High-Fidelity On-Prem->Cloud Migration/Hybrid 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25 Q & A 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us