Zoned Namespaces Use Cases, Standard and Ecosystem

SDC SNIA EMEA 20 | Javier González – Principal Software Engineer / R&D Center Lead Why do we need a new interface? SSDs are already mainstream • Great performance (Bandwidth / Latency) – Combination of NAND + NVMe • Easy to deploy - Direct replacement for HDDs • Acceptable price $/GB But, we have 3 recurrent problems: 1. Log-on-log Problem (WAF + OP) 2. Cost Gap with HDDs 3. Multi-tenancy everywhere - Remove redundancies - Higher bit count (QLC) - Address noisy-neighbor - Leverage log data structures - Reduce DRAM - Provide QoS - Reduce OP & WAF

Log-Structured Applications (e.g., RocksDB)

Addr. Mapping (L2L) Data Placement Garbage Collection Metadata Mgmt. User Space Kernel Space VFS

Log-Structured (e.g., F2FS, XFS, ZFS) Addr. Mapping (L2L) I/O Scheduling Data Placement Garbage Collection Metadata Mgmt.

Block Layer Host Read / Write / / Hints Traditional SSD FTL - Log-Structured Mapping Addr. Mapping (L2P) Media Mgmt. I/O Scheduling Data Placement Garbage Collection Metadata Mgmt.

1.J. Yang, N. Plasson, G. Gillis, N. Talagala, and S. Sundararaman. Don’t stack your 2.log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating https://blocksandfiles.com/2019/08/28/nearline-disk-drives-ssd-attack/ 3.Systems and Workloads (INFLOW), 2014.

1 ZNS: Basic Concepts

Divide LBA address space in fixed size zones Zone mapping in device • Write Constraints Zone 0 - Write strictly sequentially within a zone (Write Pointer) Zone1 … - Explicitly reset write pointer Config. A Zone n-1 • No zone mapping constraints … - Zone physical location transparent to host Zone 0 Zone 1 Zone k-1 Zone k Zone k+1 … Zone 2k+1 - Open design space for different configuration … … … Zone n-k Zone n-k+1 … Zone n-1 Zone state machine Config. B … • Zone transitions managed by host Config. Z - Direct commands or boundaries Die 0 Die 1 Die 2 Die 3 Die 4 Die 5 Die 6 … Die k-1 • Zone transitions triggered by device - Notification to host through a exception LBA mapping in zone Active Resources LBA 0 LBA m-1 Open Resources ZSEO: ZSE: Explicitly Empty Open …

ZSIO: Zone 0 Zone 1 Zone 2 Zone n-1 Implicitly ZSRO: Open Read Only Write Pointer

ZSC: ZSO: ZSF: Full ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA ZSLBA … ZSLBA + ZCAP - 1 Closed Offline - Strict sequential writes - Write after reset

2 ZNS: Advanced Concepts

Append Command • Implementation of nameless write [1] • Benefits - Increase QD in a single zone to improve parallelism • Host changes - Changes needed to block allocator in application / file system - Full reimplementation of LBA-based features in submission path (e.g., snapshotting) Write Path Append Path • Required QD=1 • QD<=X / X = #LBAs in zone

- Reminder: No ordering guarantees in NVMe Append Append Write Command Commands Commands (QD1) (SQ) (CQ)

… … Write Pointer QD1 Write Pointer LBA=ZSLBA+4 LBA=ZSLBA+3 LBA=ZSLBA+ZCAP-1 ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 … ZSLBA + ZCAP - 1 ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 … ZSLBA + ZCAP - 1

Zone 0 Zone 0 Zone1 Zone1 … … Zone n-1 Zone n-1

Die 0 Die 1 Die 2 Die 3 Die 4 Die 5 Die 6 … Die k-1 Die 0 Die 1 Die 2 Die 3 Die 4 Die 5 Die 6 … Die k-1

[1] Y. Zhang, L. P. Arulraj, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. De-indirection for flash-based SSDs with nameless writes. FAST, 2012.

3 ZNS: Host Responsibilities

ZNS requires changes to the host • First step: Changes to file systems - Advantages in log-structured file systems (e.g., F2FS, XFS, ZFS) - Still best-effort from application perspective - First important step for adoption • Second step: Changes to applications - Only make sense to log-structure applications (e.g., LSM databases such as RocksDB) - Real benefits in terms of WAF, OP and data movement (GC) - Need a way to minimize application changes! (Spoiler Alert: xNVMe)

Zone Data Address Zone Address Exception Garbage Metadata Address Placement Mapping Mapping (L2L) Handling Collection Mgmt. Mapping Zone Garbage Garbage Metadata Zone Metadata Collection Collection Mgmt. Mgmt. I/O: Read / Write / Trim / Hints Host I/O: Read / Write / Append / Reset / Commit Traditional Host Admin: Zone Management SSD FTL ZNS SSD Sect. Mapping (L2P) Media Mgmt. Exception Mgmt. Zone FTL I/O Scheduling Garbage Collection Metadata Mgmt. Zone Mapping (L2P) Media Mgmt. Exception Mgmt. Over Provisioning I/O Scheduling Zone Meta. Mgmt. Metadata Mgmt.

4 ZNS Extensions: Zone Random Write Area (ZRWA)

Concept • Expose the write buffer in front of a zone to the host

Explicit Write Implicit Benefits Commit Commit • Enable in-place updates within ZRWA In-place Updates Allowed - Reduce WAF due to metadata updates - Aligns with metadata layout of many file systems - Simplifies recovery (not changes to metadata layout) ZRWA • Allows writes at QD > 1 - No need changes to host stack (as opposed to append) ZONE - ZRWA size balanced with expected performance Operation • Writes are placed into the ZRWA - No write pointer constraint Read / Write / Commit • In-place updates allowed in the ZRWA window ZRWA ZRWA ZRWA ZRWA • ZRWA can be committed explicitly Commit - Through dedicated command Zone 0 Zone 1 Zone 2 Zone n-1 • ZRWA can be committed implicitly … - When writing over sizeof(ZRWA) LBA 0 LBA m-1

5 ZNS Extensions: Simple Copy

Concept • Offload copy operation to SSD • Simple copy, because SCSI XCOPY become to complex Benefits • Data is moved by the SSD controller Zone 1 Zone 2 - No data movement through interconnect (e.g., PCIe) - No CPU cycles used for data movement • Specially relevant for ZNS – host GC Operation • Host forms and submits SC command Simple Copy - Select a list of source LBA ranges - Select a destination with a single LBA + length • SSD moves data from sources to destination

- Error model complexity depending on scope Zone 3 Scope • Under discussion in NVMe

6 ZNS: Archival Use Case

Goal: Reduce TCO Configuration optimized for cold data • Reduce WAF • Large zones • Reduce OP • Semi-immutable per-zone data (all / nothing) • Reduce DRAM usage • Minimize metadata and mapping tables • Facilitate consumption of QLC • Need for large QDs (Append or ZRWA) Adoption Architecture properties • Tiering of cold storage • Host GC à Less WAF and OP - Denmark cold: ZNS SSDs • Zone Mapping table à Less DRAM (break 1GB / 1TB) - Finland cold: High-capacity SMR HDDs 1 zone:1 application - Mars cold: Tape Target low GC Zone 0 Zone 10 • Leverage SMR ecosystem Zone 7 Zone 5 Application 1 Zone 14 Application 2 Zone 12 - Build on top of same abstractions … … - Expand use-cases Zone 245 Zone 351 Large Zones Zone 0 - Large Erase block Zone 1 - Large I/O sizes Zone 2 - Few “streams” Zone … Mapping Metadata Internal data placement Zone n-1 Device Scheduling

Die 0 Die 1 Die 2 Die 3 Die 4 Die 5 Die 6 … Die k-1

7 ZNS: I/O Predictability Use Case

Goal: Support multi-tenant workloads Configuration optimized for multi-tenancy • Inherit archival benefits: • Zones grouped in “isolation domains” - ↓WAF, ↓OP, and ↓DRAM • Host data placement / stripping • Support QoS policies • Host I/O scheduling - Provide isolation properties across zones Inherit archival architecture properties - Enable host I/O scheduling • Host GC à Less WAF and OP - Allow more control over I/O submission / completion • Zone Mapping table à Less DRAM (break 1GB / 1TB)

1 isolation domain:1 application Manage Stripping Target low GC Zone 0 Zone 1 Zone k Zone k+1 Application 1 Zone 2k Application 2 Zone 2k+1 … … Zone n-k Zone n-1 Small Zones Zone 0 Zone 1 … Zone k-1 - Small Erase block Zone k Zone k+1 … Zone 2k+1 - Variable I/O sizes … … - 10s / 100s “streams” Zone … Mapping Metadata Zone n-k Zone n-k+1 … Zone n-1 Host data placement Host I/O Scheduling

Die 0 Die 1 Die 2 Die 3 Die 4 Die 5 Die 6 … Die k-1

8 Linux Stack: Zoned Block Framework

Purpose • Built for SMR drivers • Add support for write constraints • Expose to applications through Applications block layer User ApplicationApplication Space libzbc Kernel Status Kernel • Merged in 4.9 Zoned-aware FS Space dm-zoned • Block / drivers (e.g., F2FS, , ZFS) - Support in block layer Zoned Block Device Zoned Block Device - Support in SCSI / ATA - Support in null_blk Block Layer

• File systems SCSI / ATA Drivers - F2FS - Btrfs (patches posted) Host - ZFS / XFS (ongoing) HW SCSI / ATA • Libraries - libzbc SAS ZBC / SATA ZAC HDD • Tools - Utils-linux, fio

9 Linux Stack: Adding ZNS support

ZNS support in Zoned Block I/O paths • Add support ZNS-specific features (kernel & tools) • Using zoned FS (e.g., F2FS) - Append, ZRWA and Simple Copy - No changes to application - Async() path for new admin commands • Using zoned application • Relax assumptions from SMR - Support in application backend - Conventional zones - Explicit zone management - Zone sizes, state transitions, etc. • Data placement, sched., zone mgmt.

blktests FS (e.g., F2FS) Legacy Zoned Applications RocksDB ZNS RocksDB ZNS ZNS Env. RocksDB xNVMe Applications (e.g., RocksDB) blkzoned Zoned Env. Zone Mgmt ZNS fio Zoned FS ZNS xNVMe SPDK nvme-cli Posix xNVMe User Space

User Space xNVMe (e.g., F2FS) ZNS ZNS

R / W Zone Mgmt R / W / A Admin Admin & I/O SQ / CQ (e.g., Reset) SQ / CQ io_uring libaio / sync IOCTL ZNS ZNS ZNS io_uring / AIO IOCTL io_uring / AIO

FS I/O FS I/O Raw Block Admin R / W R / W / A I/O Framework F2FS XFS F2FS Zone Mgmt ZNS Zoned + ZNS Kernel-Space ZNS extensions Zone Mgmt (e.g., Reset) R / W / A Block Layer block block zoned Block Layer User-space ZNS extensions Block Layer zoned Zoned Block Kernel Space Kernel Kernel Space Kernel mq-deadline NVMe Driver mq-deadline NVMe Driver Zoned Block + ZNS

QEMU ZNS SSD ZNS ZNS SSD

10 xNVMe: Supporting non-block Applications xNVMe: Library to enable non-block applications across I/O backends • Common API and helpers • path: psync, libaio, io_uring • SPDK: User-space in Linux and FreeBSD • Windows: Future work – new backend required Benefits • Portable API across I/O backends and OSs (minimal benefit for block devices) • Common API for new I/O interfaces (Re-write application backend only once) • File-system semantics on raw devices (block & non-block) • No performance impact

11 Takeaways

ZNS is a new interface in NVMe • Builds on top of Namespace Types (e.g., KV) • Driven by vendors and early customers ZNS targets 2 main use-cases • Archival (original use case) • I/O Predictability (ZNS extensions) Host needs changes • Basic functionality builds on top of existing zoned block framework - Respect write constraints - Follow zone state machine: Reset->Open->Closed->Full / ->Offline / ->ReadOnly • Append needs major changes to block allocators in I/O path - Might work for some file systems / applications, but not for all • ZRWA gives an alternative to append - Same functionality, not changes to block allocator in I/O path We are building open ecosystem • Linux stack to be released on TP ratification - xNVMe, Linux Kernel, tools (e.g., nvme-cli), fio - RocksDB Backend on xNVMe • Sponsoring TPs for ZNS extensions in NVMe

12