Accelerating RocksDB with NVMe™ Zoned SSDs
Hans Holmberg, R&D Technologist Emerging System Architectures Group
2019 StorageWestern Developer Conference. Digital © Western Digital Corporation or its affiliates. All Rights Reserved. 1 RocksDB on Zoned NVMe™ SSDs
Optimal data placement
No drive Minimal Device Write garbage amplification collection
Longer Higher ZNS NVMe SSD Lower latency Development Platform life throughput
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 2 Agenda
▪ Zoned Namespaces 101 ▪ Adapting RocksDB for Zoned SSDs ▪ Demo ▪ Results ▪ What’s next?
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 3 Zoned Namespaces 101
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 4 What are Zoned Block Devices?
The new paradigm in storage Device LBA range divided in zones
▪ The storage device logical Zone 0 Zone 1 Zone 2 Zone 3 Zone X block addresses are divided into ranges of zones. ▪ Writes within a zone must Write commands be sequential. advance the write Write pointer pointer ▪ The zone must be erased position before it can be rewritten. Reset write pointer commands rewind the write pointer
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 5 Zoned Storage on SMR
Conventional HDD SMR HDD ▪ SMR (Shingled Magnetic Recording) Discrete Tracks Overlapped Tracks
▪ Enables areal density growth … ▪ Shares flash access model ▪ Erase before re-write
▪ Zoned Access Zone ▪ Zoned Block I/F standardized in INCITS ▪ Zoned Block Commands (ZBC): SAS ▪ Zoned ATA Commands (ZAC): SATA ▪ Host/Device cooperate to optimize RMW aspect of SMR by enforcing sequential writes and enabling host FTL model
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 6 Ubiquitous Workloads The cloud applies multiple workloads to a single SSD
SSDs write log-structured to the media that requires garbage collection Database s
Multiplex data streams onto the same Sensors garbage collection units
Analytics Increases Solid-State Write Amplification, Over-Provisioning Virtualization Drive and thereby Cost Decreases throughput and latency predictability Video
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 7 Zones for Solid State Drives
Application 1 Application 2 Application 3 Application 1 Application 2 Application 3
Conventional SSD Controller ZNS SSD Controller
LBA Space LBA Space
Zone Eliminate data streams multiplexing: • Significantly decreases write amplification, over-provisioning and thereby reduces cost • Increases throughput and latency predictability
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 8 ZNS: Synergies w/ ZAC/ZBC software ecosystem
User Space ▪ Device exposed as a Zoned Block Device (ZBD) Optimized Applications RocksDB Ceph ▪ Reuse existing work already done for ZAC/ZBC Traditional Block Storage Applications devices fio libzbd ▪ Existing ZBD-aware file systems & device Linux mappers “just work” Regular File- File-System with Systems (xfs) ZBD Support Kernel (f2fs, btrfs, xfs) • Few additions to support to ZNS Logical Block ▪ Integrates with file-systems and applications Device (dm-zoned) ▪ RocksDB, Ceph, fio, libzbd, … Block Layer ▪ ZAC/ZBC devices are already in production at SCSI/ATA NVMe technology adopters and a mature storage stack is available through the Linux® eco-system ZNS NVMe SSD ZNSDevelopment NVMe PlatformSSD ZNSDevelopment NVMeZNS PlatformSSD SSD Development Platform
*= Enhanced data paths for SMR/ZNS drives
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 9 Zoned Namespaces
▪ Ongoing Technical Proposal in the NVMe working group ▪ New Zoned Command Set – Inherits the NVM Command Set and adds zone support. Under review ▪ Aligns to the existing host-managed models defined in the ZAC/ZBC specifications. ▪ Note that it does not map 1:1. Beware of the details. ▪ Optimized for Solid State Drives ▪ Zone Capacity ▪ Zone Append ▪ Zone Descriptors
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 10 Host-Managed Zoned Block Devices
▪ Zone States ▪ Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. ▪ Changes state upon writes, zone management commands, and device resets. ▪ Zone Management ▪ Open Zone, Close Zone, Finish Zone, and Reset Zone ▪ Zone Size & Zone Capacity(NEW) ▪ Zone Size is fixed ▪ Zone Capacity is the writeable area within a zone
Zone Size (e.g., 512MB)
Zone X - 1 Zone X Zone X + 1
Zone Start LBA Zone Capacity (E.g., 500MB)
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 11 ZonedStorage.IO
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 12 RocksDB on ZNS
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 13 RocksDB, a good fit for ZNS
▪ Persistent key-value store for fast storage environments
▪ Log-structured, flash ZNS NVMe SSD Development Platform friendly ▪ Customizable storage back ends
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 14 WA on a conventional SSD
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 15 Target: End-to-end-integration
RocksDB RocksDB
Data placement Data placement
File system
Block allocator
FTL
Media Management Media Management
MEDIA MEDIA
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 16 Challenges
▪ Multiple, parallel files being written to ▪ Map each file to a set of zones ▪ All writes must be sequential and ordered ▪ Use direct I/O and the deadline scheduler ▪ Limits on number of open zones ▪ Finish zones when done with writes
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 17 RocksDB on-disk data structures Writeahed-log (WAL)
Memtable
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 18 RocksDB on-disk data structures Sorted string tables
Append-only ata e.g. 64MB
In memoryIn sync Hot D Hot e.g. 128MB A Z A Z A Z A Z
Persisted Most Updated compaction
e.g. 1G A Z compaction
Grows 10X ata
e.g. 100G
A Z Cold D Cold Least Updated
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 19 Mapping files to zones
WAL File WAL File SST File SST File SST File SST File Zones
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 20 Approach
▪ Files are mapped to zones ▪ Zone management through file management ▪ Zones are allocated when creating a new file ▪ Zones are released after file deletion ▪ Zones can be rewritten after being reset No device-side garbage collection
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 21 Demo (Random Insert Workload)
ZNS NVMe SSD Development Platform
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 22 Results
Smart data placement device write amplification 3-6X WA’s measured on conventional drives (28% OP)
No on-drive garbage increase in capacity collection Compared to a conventional 28% OP SSD
TCO reduction Increases lifetime/writes significantly
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 23 Conclusions
▪ Easy to leverage flash-friendy data placement ▪ ZNS enables applications to become flash-optimal ▪ Zoned Block Device Software Eco-system already available ▪ Libraries, tools, emulation ▪ Easy to integrate with existing storage stack
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 24 What’s next?
▪ Upstream support to RocksDB ▪ More ZNS end-to-end-integration: ▪ Databases (LSM-based, logs, ...) ▪ Filesystems (btrfs, ceph, ...) ▪ Cloud infrastructure
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 25 Thanks!
Western Digital and the Western Digital logo are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. The NVMe word mark and NVM Express design mark are trademarks of NVM Express, Inc. All other marks are the property of their respective owners.
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 26