Accelerating RocksDB with NVMe™ Zoned SSDs

Hans Holmberg, R&D Technologist Emerging System Architectures Group

2019 StorageWestern Developer Conference. Digital © Western Digital Corporation or its affiliates. All Rights Reserved. 1 RocksDB on Zoned NVMe™ SSDs

Optimal data placement

No drive Minimal Device Write garbage amplification collection

Longer Higher ZNS NVMe SSD Lower latency Development Platform life throughput

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 2 Agenda

▪ Zoned Namespaces 101 ▪ Adapting RocksDB for Zoned SSDs ▪ Demo ▪ Results ▪ What’s next?

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 3 Zoned Namespaces 101

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 4 What are Zoned Block Devices?

The new paradigm in storage Device LBA range divided in zones

▪ The storage device logical Zone 0 Zone 1 Zone 2 Zone 3 Zone X block addresses are divided into ranges of zones. ▪ Writes within a zone must Write commands be sequential. advance the write Write pointer pointer ▪ The zone must be erased position before it can be rewritten. Reset write pointer commands rewind the write pointer

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 5 Zoned Storage on SMR

Conventional HDD SMR HDD ▪ SMR (Shingled Magnetic Recording) Discrete Tracks Overlapped Tracks

▪ Enables areal density growth … ▪ Shares flash access model ▪ Erase before re-write

▪ Zoned Access Zone ▪ Zoned Block I/F standardized in INCITS ▪ Zoned Block Commands (ZBC): SAS ▪ Zoned ATA Commands (ZAC): SATA ▪ Host/Device cooperate to optimize RMW aspect of SMR by enforcing sequential writes and enabling host FTL model

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 6 Ubiquitous Workloads The cloud applies multiple workloads to a single SSD

SSDs write log-structured to the media that requires garbage collection Database s

Multiplex data streams onto the same Sensors garbage collection units

Analytics Increases Solid-State Write Amplification, Over-Provisioning Virtualization Drive and thereby Cost Decreases throughput and latency predictability Video

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 7 Zones for Solid State Drives

Application 1 Application 2 Application 3 Application 1 Application 2 Application 3

Conventional SSD Controller ZNS SSD Controller

LBA Space LBA Space

Zone Eliminate data streams multiplexing: • Significantly decreases write amplification, over-provisioning and thereby reduces cost • Increases throughput and latency predictability

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 8 ZNS: Synergies w/ ZAC/ZBC software ecosystem

User Space ▪ Device exposed as a Zoned Block Device (ZBD) Optimized Applications RocksDB ▪ Reuse existing work already done for ZAC/ZBC Traditional Block Storage Applications devices fio libzbd ▪ Existing ZBD-aware file systems & device mappers “just work” Regular File- File-System with Systems () ZBD Support Kernel (f2fs, , xfs) • Few additions to support to ZNS Logical Block ▪ Integrates with file-systems and applications Device (dm-zoned) ▪ RocksDB, Ceph, fio, libzbd, … Block Layer ▪ ZAC/ZBC devices are already in production at SCSI/ATA NVMe technology adopters and a mature storage stack is available through the Linux® eco-system ZNS NVMe SSD ZNSDevelopment NVMe PlatformSSD ZNSDevelopment NVMeZNS PlatformSSD SSD Development Platform

*= Enhanced data paths for SMR/ZNS drives

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 9 Zoned Namespaces

▪ Ongoing Technical Proposal in the NVMe working group ▪ New Zoned Command Set – Inherits the NVM Command Set and adds zone support. Under review ▪ Aligns to the existing host-managed models defined in the ZAC/ZBC specifications. ▪ Note that it does not map 1:1. Beware of the details. ▪ Optimized for Solid State Drives ▪ Zone Capacity ▪ Zone Append ▪ Zone Descriptors

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 10 Host-Managed Zoned Block Devices

▪ Zone States ▪ Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. ▪ Changes state upon writes, zone management commands, and device resets. ▪ Zone Management ▪ Open Zone, Close Zone, Finish Zone, and Reset Zone ▪ Zone Size & Zone Capacity(NEW) ▪ Zone Size is fixed ▪ Zone Capacity is the writeable area within a zone

Zone Size (e.g., 512MB)

Zone X - 1 Zone X Zone X + 1

Zone Start LBA Zone Capacity (E.g., 500MB)

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 11 ZonedStorage.IO

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 12 RocksDB on ZNS

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 13 RocksDB, a good fit for ZNS

▪ Persistent key-value store for fast storage environments

▪ Log-structured, flash ZNS NVMe SSD Development Platform friendly ▪ Customizable storage back ends

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 14 WA on a conventional SSD

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 15 Target: End-to-end-integration

RocksDB RocksDB

Data placement Data placement

File system

Block allocator

FTL

Media Management Media Management

MEDIA MEDIA

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 16 Challenges

▪ Multiple, parallel files being written to ▪ Map each file to a set of zones ▪ All writes must be sequential and ordered ▪ Use direct I/O and the deadline scheduler ▪ Limits on number of open zones ▪ Finish zones when done with writes

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 17 RocksDB on-disk data structures Writeahed-log (WAL)

WAL

Memtable

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 18 RocksDB on-disk data structures Sorted string tables

Append-only ata e.g. 64MB

In memoryIn sync Hot D Hot e.g. 128MB A Z A Z A Z A Z

Persisted Most Updated compaction

e.g. 1G A Z compaction

Grows 10X ata

e.g. 100G

A Z Cold D Cold Least Updated

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 19 Mapping files to zones

WAL File WAL File SST File SST File SST File SST File Zones

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 20 Approach

▪ Files are mapped to zones ▪ Zone management through file management ▪ Zones are allocated when creating a new file ▪ Zones are released after file deletion ▪ Zones can be rewritten after being reset No device-side garbage collection

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 21 Demo (Random Insert Workload)

ZNS NVMe SSD Development Platform

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 22 Results

Smart data placement device write amplification 3-6X WA’s measured on conventional drives (28% OP)

No on-drive garbage increase in capacity collection Compared to a conventional 28% OP SSD

TCO reduction Increases lifetime/writes significantly

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 23 Conclusions

▪ Easy to leverage flash-friendy data placement ▪ ZNS enables applications to become flash-optimal ▪ Zoned Block Device Software Eco-system already available ▪ Libraries, tools, emulation ▪ Easy to integrate with existing storage stack

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 24 What’s next?

▪ Upstream support to RocksDB ▪ More ZNS end-to-end-integration: ▪ Databases (LSM-based, logs, ...) ▪ Filesystems (btrfs, ceph, ...) ▪ Cloud infrastructure

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 25 Thanks!

Western Digital and the Western Digital logo are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. The NVMe word mark and NVM Express design mark are trademarks of NVM Express, Inc. All other marks are the property of their respective owners.

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 26