Linux Zoned Block Device Ecosystem: No longer exotic

Dmitry Fomichev Western Digital Research, System Software Group

October 2019

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 Outline

• Why zoned block devices (ZBD)? – SMR recording and zoned models • Support in - status overview – Standards and kernel – Application support • Kernel support details – Overview, Block layer, File systems and device-mapper, Known problems • Application support details – SG Tools, libzbc, fio, etc. • ZNS – Why zones for flash? ZNS vs. ZBD, ZNS use cases • Ongoing Work and Next Steps

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 2 What are Zoned Block Devices? Zoned device access model

• The storage device logical block addresses Device LBA range divided in zones are divided into ranges of zones • Zone size is much larger than LBA size Zone 0 Zone 1 Zone 2 Zone 3 Zone X – E.g. 256 MB on today’s SMR disks • Zone size is fixed • Reads can be done in the usual manner commands • Writes within a zone must be sequential Write pointer advance the write pointer • A zone must be erased before it can be position rewritten Reset write pointer commands • Zones are identified by their start LBA rewind the write pointer

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 3 What are Zoned Block Devices? Accommodate advanced recording technology

• Shingled Magnetic Recording (SMR) disks Conventional PMR HDD SMR HDD – Enables higher areal density Discrete Tracks Overlapped Tracks – Wider write head produces stronger field, enabling smaller grains and lower noise

– Better sector erasure coding, more powerful … data detection and recovery • Zoned Access – Random reads to the device are allowed

– But writes within zones must be sequential Zone – Zones can be rewritten from the beginning after erasing – Additional commands are needed • Some zones of the device can still be PMR – Conventional Zones

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 4 What are Zoned Block Devices? Standardized Zoned Device Models

• Drive firmware can be designed to alleviate and conceal zone write restrictions, but this comes at a cost – Garbage collection is necessary, resulting in lower performance

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 5 What are Zoned Block Devices? Mainstream technology? • WDC: Capacity enterprise market exabyte growth expectations for FY 2019: Meaningfully exceed 30% y/y – Competitors expect similar growth • WDC: By 2023, half of HDD produced capacity will be Host Managed SMR • Both T10 (SCSI), T13 (ATA) and SAT standards are now stable – The SCSI standard is called ZBC, ATA - ZAC – SAT is SCSI to ATA translation

2013 2014 2015 2016 2017 10 11 12 1 2 3 10 11 12 1 2 3 10 11 12 1 2 3 Disk isn’t dead, it has gone to heaven cloud ZBC r00 ZBC SAT forward ZAC forwarded forwarded to to INCITS to INCITS INCITS

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 6 Host Managed Zoned Block Devices What is needed for zoned operation? • Need functionality to the current zone state – REPORT ZONES command for SCSI ZC2 – Zoned Device Information page of IDENTIFY DEVICE log is read for ATA ZC1 ZC5 ZC4 • Need functionality for resource management – Zone resources are limited, need operations like and to manage them. MaxOpen and MaxActive. ZC6 ZC7 • Zone operations – new commands ZC3 – OPEN ZONE • zone can also be opened implicitly by a write – CLOSE ZONE – FINISH ZONE • write pointer becomes invalid! – RESET ZONE

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 7 I/O Path Support for Zoned Block Devices The Big Picture

User Applications Space (blkzone, libzbc, fio, sg tools)

File access File access Block access Zoned block access Direct device access

File System (any) (f2fs, zonefs) (dm-linear, dm-flakey, dm-zoned)

Kernel Block I/O Layer Space Block I/O Scheduler (deadline and mq-deadline zone write-locking since 4.16.0) SCSI Generic

SCSI Mid Layer (sd driver scan code)

SCSI Low Level Drivers

HBA

HW

ZBC/ZAC Disk

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 8 Application vs Kernel Support Dependent on kernel version and HBA compliance

** Some current enterprise * ”vanilla” kernels from www.kernel.org Kernels pre-v4.10* Kernels v4.10** onward distributions may use older (Enterprise distribution kernels may kernels that have out-of-date backport features to lower kernel ZBD support version) No Kernel has zoned block device support

Yes

No No HBA exposes device and has HBA exposes device functional ZBC SAT Yes (SAT optional) Yes

Application direct management Kernel based management

SG_IO based support: Application direct access Application indirect access Device not usable • Application specific implementation (application support required): (also legacy applications): • sg3utils • and regular system calls • File system • libzbc • libzbc • Device mapper • fio

Minimum support space Ideal support space

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 9 Support Overview ZBC and ZAC support timeline • Initial work started with kernel version 3.18 • Full ZBC and ZAC command support was implemented in kernel version 4.10 – Exposes Host Managed disks as block devices – API for REPORT ZONES and RESET WRITE POINTER • No kernel internal support for OPEN, CLOSE and FINISH ZONE commands – Support ZBC to ZAC command translation of all commands • libATA From 5.0: -mq only

2014 2017 2018 2019 10 11 12 1 2 3 4 8 9 10 11 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12

3.18 4.10 4.13 4.16 5.0 5.1

5.2 SG nodes support Zoned block dm-zoned scsi-mq Latest stable TYPE_ZBC SCSI device support device mapper support release TYPE_ZAC ATA

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 10 Linux Kernel Support Overview Kernel Zoned Write Constraints for Host Managed devices • Sequential write constraint is exposed to the drive – file systems and applications MUST write sequentially to sequential zones – Write ordering guarantees implemented by limiting per zone write queue depth to 1 User • Application Also solves many HBA level command ordering problems, including AHCI level • Implemented in the SCSI disk driver for kernels 4.10.0 to 4.15.x • Implemented with the “deadline” and “mq-deadline” schedulers since kernel 4.16.0 Kernel VFS • mq-deadline is mandatory with kernels 5.0 and above (legacy single queue I/O path File System removed) Device mapper Kernel Block Layer Block I/O scheduler • Since 4.16, sequential write constraint is enforced in deadline SCSI / ATA stack scheduler HBA Driver – Now, mq-deadline scheduler only

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 11 Linux Kernel Support Overview Why deadline scheduler? • Maintains shared state for all hardware queues – This simplifies zone locking mechanism – Support for zoned block devices is only ~30 lines of code out of ~800

User Application • Read and write queues are sorted by LBA level – Creates favorable patterns for zone I/O Kernel VFS • Never reorders I/O requests File System – Merges are allowed as long as they don’t cause chunk boundary crossing Device mapper Kernel Block Layer • Atomic bit operations are used to lock zones Block I/O scheduler – Small memory amount as the number of zones can be very large SCSI / ATA stack HBA Driver • Maintaining QD=1 with the zone lock may cause contention – Only comes to play with certain I/O patterns – An example later in the presentation

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 12 Linux Kernel Support - Core Functionalities Block layer API • A set of defined in include/uapi/linux/blkzoned.h Zone report is available via an ioctl call: /** ... * @BLKREPORTZONE: Get zone information. Takes a zone report as argument. * The zone report will start from the zone containing the User * sector specified in the report request structure. Application level */ #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report) Kernel VFS Zone reset command is also available as an ioctl call: File System /** ... Device mapper * @BLKRESETZONE: Reset the write pointer of the zones in the specified * sector range. The sector range must be zone aligned. Kernel Block Layer ... */ Block I/O scheduler #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range) SCSI / ATA stack Kernel 4.19 introduced two additional ioctl calls: HBA Driver /** ... * @BLKGETZONESZ: Get the device zone size in number of 512 B sectors. * @BLKGETNRZONES: Get the total number of zones of the device. ... */ #define BLKGETZONESZ _IOR(0x12, 132, __u32) #define BLKGETNRZONES _IOR(0x12, 133, __u32)

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 13 Linux Kernel Support - Core Functionalities SCSI layer • Zone configuration information is printed as part of disk scan – Kernel log messages

[ 3.687797] scsi 5:0:0:0: Direct-Access-ZBC ATA HGST HSH721414AL TE8C PQ: 0 ANSI: 7 User [ 3.696359] sd 5:0:0:0: Attached scsi generic sg4 type 20 Application [ 3.696485] sd 5:0:0:0: [sdd] Host-managed zoned block device level [ 3.865072] sd 5:0:0:0: [sdd] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB) [ 3.873046] sd 5:0:0:0: [sdd] 4096-byte physical blocks Kernel VFS [ 3.878343] sd 5:0:0:0: [sdd] 52156 zones of 524288 logical blocks [ 3.884591] sd 5:0:0:0: [sdd] Write Protect is off File System [ 3.889440] sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00 [ 3.889458] sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Device mapper [ 4.253140] sd 5:0:0:0: [sdd] Attached SCSI disk Kernel Block Layer • Zone model and geometry is available to Block I/O scheduler > cat /sys/block/sdd/queue/zoned host-managed applications through files SCSI / ATA stack HBA Driver > cat /sys/block/sdd/queue/chunk_sectors • Files in /sys/block//queue 524288 • zoned file for device zone model type > cat /sys/block/sdd/queue/nr_zones • chunk_sectors for zone size 52156 • Kernel 4.19 added nr_zones for the total number of zones of a disk

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 14 Linux Kernel Support - Device Mapper Zoned block device handling and abstraction • Support for zone aligned dm-linear – Splitting and concatenation of zone ranges of one or more devices into a logical zoned block device • Zone configuration reorganization User Application • Drive-managed emulation with dm-zoned device mapper level – Exposes a Host Managed disk as a regular disk Kernel VFS File System – Allows using any file system Device mapper

– Minimum capacity loss for internal metadata Kernel Block Layer • A few zones Block I/O scheduler – Run-time overhead due to zone garbage collection SCSI / ATA stack • Dependent on workload HBA Driver – Needs conventional zones • Support for zone aligned dm-flakey – Error injection for tests

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 15 Linux Kernel Support - File Systems Native support for f2fs • F2FS native support for zoned block devices introduced in kernel 4.10.0 – Completely hides sequential write constraints to application level

User Application • ZoneFS, a new file system that exposes every zone as a POSIX file level – Useful for environments using LSM trees, such as RocksDB or LevelDB Kernel VFS – Sequential write access for every file, file grows from 0 to zone size File System – Truncate to zero = Reset Write Pointer Device mapper – Development stage, well received and reviewed Kernel Block Layer Block I/O scheduler • native support development ongoing SCSI / ATA stack – Patch series posted as RFC HBA Driver • Other file systems do not natively support zoned block devices – Ext2/3/4 and XFS do not have the capability of sequentially writing zones

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 16 Application Level Support Management tools and libraries • Sg3utils (http://sg.danny.cz/sg/sg3utils.html) – Legacy, well known generic SCSI command interface and application tool chain – ZBC support since version 1.42 User Application • sg_rep_zones, sg_reset_wp, sg_zone tools level

– Rely on SAT layer for ZBC command translation to ZAC command Kernel VFS • Kernel SAT layer (libata) or HBA SAT layer File System • libzbc – a Linux-only that provides a unified API for Block Layer manipulating zoned block devices Kernel Device mapper Block I/O scheduler – libzbc internal processing translates API calls to SCSI commands, ATA SCSI / ATA stack commands or kernel system calls HBA Driver – Unified API hides different interfaces to kernel – Higher level functions simplify coding – Available at https://github.com/hgst/libzbc

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 17 libzbc: Internal Structure Device type dependencies handled with internal backend drivers • Automatic selection of internal backend drivers based on the device type – Block device, SCSI or ATA backend – Emulation backend for development

Application

libzbc Block ZBC driver ZAC driver Emulation driver

System calls (read/write) SG_IO ioctl

Zoned block interface Raw direct device interface File interface Block device access SCSI Generic device access File access

Kernel Block I/O Layer SCSI Generic File system Space

SCSI Mid and Low Layers

Hardware ZAC or ZBC disk ZBC disk ZAC disk Regular disk

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 18 libzbc: Backend Driver Matrix Automatic detection with manual override • Backend driver detection is automatic – User can manually control if necessary (e.g. bypass a buggy SAT layer) – CLI tools: select the driver by using SG node name vs. block name – zbc_open() flags ZBC_O_xxx_DRIVER

Device model

ZBC disk ZAC disk Regular disk or file

SCSI backend (if SAT OK) SG node file SCSI backend x or ATA backend (no SAT) Device identifier Block device file Block device backend Emulation backend Regular file x x

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 19 libzbc: The Test Suite and Tools Device conformance tests and CLI utilities • The Test Suite – Built with scripts and a few simple C programs (calling libzbc API) – configure --with-test, run with libzbc/test/zbc_test.sh – Test with SCSI or ATA drivers Command Description – 3 Test sections, 108 tests zbc_info Display disk information zbc_report_zones Display the list of disk zones • CLI Tools zbc_reset_zone Reset a zone or all zones write pointers – Convenient utilities for performing zone operations zbc_read_zone Read data from a zone – Give examples of basic use of zbc_write_zone Write data or a file to a zone libzbc API functions zbc_open_zone Explicitly open a zone – We’ll see some screenshots zbc_close_zone Close zone(s) at the end of the zbc_finish_zone Finish zone(s) presentation zbc_set_zones For emulation mode: configure the disk zones For emulation mode: change the value of a zbc_set_write_pointer disk write pointer

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 20 libzbc: Graphical Interface, gzbc View and control zone state with convenience

Sequential zone full

Data written Sequential write zones Conventional zones

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 21 Application Level Support - continued Other system applications • Linux sys-utils • https://github.com/karelzak/util-linux – Maintained by RedHat – blkzone utility added with version 2.30.0 User Application • Support zone report and zone reset level – Implemented using the BLKZONEREPORT and BLKZONERESET ioctl • No support for open, close and finish zone Kernel VFS File System – No equivalent kernel-defined device ioctl Block Layer

• ZBC tcmu-runner Kernel Device mapper – The userspace component for TCM-loop/TCM-user LIO kernel Block I/O scheduler stack backend storage SCSI / ATA stack – Allows to add userspace command handlers for block devices HBA Driver created by LIO • The configuration is done by targetcli – The current version emulates ZBC devices

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 22 Application Level Support - continued Benchmarking: fio • fio supports Host Managed SMR disks since version 3.9 – Current version is 3.15: https://github.com/axboe/fio/ – Random-write pattern changed into “random zone write” • Writes to the write pointer position of a randomly chosen zone User Application – Automatic zone reset executed when starting writing to a full zone level

– Zone locking to allow multiple job execution Kernel VFS • Ensures sequential writes to zones with multiple jobs File System – Core ZBD fio code uses block ioctls Block Layer Kernel Device mapper • Example: random 16 KB write to sequential zones at QD=16 with Block I/O scheduler data verification SCSI / ATA stack HBA Driver fio --name=zbc –filename=/dev/sdc --zonemode=zbd --offset=140660178944 --ioengine=libaio --iodepth=16 --rw=randwrite --bs=16K --do_verify=1 --verify=md5 --size=1G • Many more examples with included tests: fio/t/zbc/test-zbc-support

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 23 ZNS Let’s, finally, talk about SSDs

Why Zones for Solid State Drives?

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 24 ZNS: SSD Architecture NVMe SATA Complex and highly parallel SAS • Parallelism in architecture – Tens or even hundreds of dies – Dies connected to parallel channels • NAND Read/Program/Erase – Erase can only be done at Erase Block granularity (~1MB) • NAND access latencies • Flash Translation Layer – Mapping of LBAs to Erase Blocks – Wear leveling – Bad block management – Media error handling – Garbage collection

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 25 ZNS: Garbage Collection in SSDs The world would have been a better place without it

Application 1 Application 2 Application 3 • Discarded and overwritten data creates gaps in erase block space and data from Conventional SSD Controller partially full erase blocks needs to be LBA Space moved to make empty erase blocks – Reduced write performance – Write amplification (internal) – Need for more overprovisioning (OP may reach 30% of advertised capacity) • It is hard to establish a good upper bound for GC operations – Increased tail latencies Erase block size – Increased controller RAM requirements

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 26 ZNS: Zones for Solid State Drives Sequential data reduce the need for GC

Application 1 Application 2 Application 3 Application 1 Application 2 Application 3

Conventional SSD Controller ZNS SSD Controller

LBA Space LBA Space

Eliminate data streams multiplexing: Zone • Significantly decreases write amplification, over-provisioning and thereby reduces cost • Increases throughput and latency predictability

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 27 ZNS: Zoned Namespaces Ongoing Technical Proposal in the NVMeTM working group

Under review • New Zoned Command Set – Inherits the NVM Command Set and adds zone support • Aligns to the existing Host Managed models defined in the ZAC/ZBC specifications – Note that it does not map 1:1. Beware of the details – Zone information is read from a log, similar to ZAC • Optimized for Solid State Drives – Zone Capacity – Zone Descriptors – Zone Append

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 28 ZNS: It is Host Managed Similar, but what is different?

• Zone States (not conditions) – Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline – Smaller zone size (1MB), no conventional zones(NEW) – Changes state upon writes, zone management commands, and device resets • Zone Management – Open Zone, Close Zone, Finish Zone, and Reset Zone – All zone operations will be supported in the kernel • Zone Size & Zone Capacity(NEW) – Zone Capacity is the writeable area within a zone

Zone X - 1 Zone X Zone X + 1

Zone Start LBA Zone Capacity (E.g., 500MB)

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 29 ZNS: Zone Descriptors Giving an ID to a zone • Fixed-sized data associated with a zone – Upon opening an empty zone, assign a fixed- sized amount of data to it – Can be read in “extended” zone log report – Invalidated upon zone reset – Same size for all zones • Use cases: – Out-of-band recovery path by allocating UUIDs to each zone – Zones can be self-identifying (data key) – Can timestamp zones with time-based UUIDs Fixed-size data

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 30 ZNS: Zone Append Major zoned I/O performance boost

• ZAC/ZBC requires strict write ordering Metal (i7-6700K, 4K, QD1, RW, libaio) – Limits write performance, increases host overhead 1500 1300 • Low scalability with multiple writers to a zone 1100 900 – One writer per zone -> Good performance 700 – Multiple writers per zone -> Lock contention KIOPS 500 300 • Can improve by writing multiple zones, but… 100 -100 – There is still contention 1 2 3 4

– Performance may be limited by MaxActiveZones Number of Writers

• A new NVMe command: Zone Append 1 Zone 4 Zones Zone Append – Anonymous write concept – Write at zone start LBA – Controller returns the actual LBA for the written data • No contention. With Zone Append, we scale!

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 31 ZNS: Synergies with ZAC/ZBC ecosystem Very little effort required to integrate ZNS • Reuse existing work already done for User Space Optimized Applications

ZAC/ZBC devices RocksDB Ceph Traditional Block Storage Applications • Existing ZBD-aware file systems & device fio libzns mappers “just work” Linux File-System with ZBD – Minor changes to support ZNS Regular File-Systems Support Kernel (xfs) • Integrates with filesystems and (f2fs, btrfs) Logical Block Device applications (dm-zoned)

– RocksDB, Ceph, etc. Block Layer

• ZAC/ZBC management software is being SCSI/ATA NVMe modified to support ZNS ZNS SSD ZNSZNS NVMe NVMe SSD SSD ZNSDevelopment NVMe SSD Platform • A new library, libzns, will be available to Development Platform Development Platform play the role of libzbc

*= Enhanced data paths for SMR/ZNS drives

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 32 ZNS: Support in Linux ZNS namespace appears as a Host Managed ZBD in the system

NVMe N/S!

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 33 Demo: libzbc zbc_report_zones CLI command on a ZAC device

• 14 TB ZAC device, block backend driver chosen via device name

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 34 Demo: libzbc zbc_report_zones CLI command on a ZAC device

• 14 TB ZAC device, SG node with SAT working

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 35 Demo: libzns Emulated ZNS namespace, Zone Log read

Open zone

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 36 Demo: libzns Emulated ZNS namespace, Close Zone

Open zone

Close command

Closed zone

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 37 Conclusion and Ongoing Work • ZNS – Linux kernel support, QEMU emulation (no tcmu-runner), nvme-cli, libzns, fio, blktest – The approved version of the standard will be available 1H of 2020 • Native support for BtrFS – RFC patches posted upstream and reviewed – Ongoing discussion to address maintainers concerns • Native support for XFS – Design phase • dm-zoned integration improvements and performance optimizations – Reduce zone reclaim overhead – Persistent setup initialization across reboots, LVM integration • blktest – Implements portable, repeatable, full I/O stack tests – Zoned block device tests support patches posted upstream and under review

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 38 Links http://zonedstorage.io/  a must see for ZBC ☺ https://blog.westerndigital.com/storage-architectures-zettabyte-age/ https://blogs.dropbox.com/tech/2019/07/smr-what-we-learned-in-our-first-year/ https://github.com/hgst/libzbc https://github.com/open-iscsi/tcmu-runner http://xfs.org/index.php/File:Xfs-smr-structure-0.2.pdf

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 39 Questions?

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 40