Zoned Linux Ecosystem Overview

Linux Zoned Block Device Ecosystem: No longer exotic Dmitry Fomichev Western Digital Research, System Software Group October 2019 © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 Outline • Why zoned block devices (ZBD)? – SMR recording and zoned models • Support in Linux - status overview – Standards and kernel – Application support • Kernel support details – Overview, Block layer, File systems and device-mapper, Known problems • Application support details – SG Tools, libzbc, fio, etc. • ZNS – Why zones for flash? ZNS vs. ZBD, ZNS use cases • Ongoing Work and Next Steps © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 2 What are Zoned Block Devices? Zoned device access model • The storage device logical block addresses Device LBA range divided in zones are divided into ranges of zones • Zone size is much larger than LBA size Zone 0 Zone 1 Zone 2 Zone 3 Zone X – E.g. 256 MB on today’s SMR disks • Zone size is fixed • Reads can be done in the usual manner Write commands • Writes within a zone must be sequential Write pointer advance the write pointer • A zone must be erased before it can be position rewritten Reset write pointer commands • Zones are identified by their start LBA rewind the write pointer © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 3 What are Zoned Block Devices? Accommodate advanced recording technology • Shingled Magnetic Recording (SMR) disks Conventional PMR HDD SMR HDD – Enables higher areal density Discrete Tracks Overlapped Tracks – Wider write head produces stronger field, enabling smaller grains and lower noise – Better sector erasure coding, more powerful … data detection and recovery • Zoned Access – Random reads to the device are allowed – But writes within zones must be sequential Zone – Zones can be rewritten from the beginning after erasing – Additional commands are needed • Some zones of the device can still be PMR – Conventional Zones © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 4 What are Zoned Block Devices? Standardized Zoned Device Models • Drive firmware can be designed to alleviate and conceal zone write restrictions, but this comes at a cost – Garbage collection is necessary, resulting in lower performance © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 5 What are Zoned Block Devices? Mainstream technology? • WDC: Capacity enterprise market exabyte growth expectations for FY 2019: Meaningfully exceed 30% y/y – Competitors expect similar growth • WDC: By 2023, half of HDD produced capacity will be Host Managed SMR • Both T10 (SCSI), T13 (ATA) and SAT standards are now stable – The SCSI standard is called ZBC, ATA - ZAC – SAT is SCSI to ATA translation 2013 2014 2015 2016 2017 10 11 12 1 2 3 10 11 12 1 2 3 10 11 12 1 2 3 Disk isn’t dead, it has gone to heaven cloud ZBC r00 ZBC SAT forward ZAC forwarded forwarded to to INCITS to INCITS INCITS © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 6 Host Managed Zoned Block Devices What is needed for zoned operation? • Need functionality to read the current zone state – REPORT ZONES command for SCSI ZC2 – Zoned Device Information page of IDENTIFY DEVICE log is read for ATA ZC1 ZC5 ZC4 • Need functionality for resource management – Zone resources are limited, need operations like open and close to manage them. MaxOpen and MaxActive. ZC6 ZC7 • Zone operations – new commands ZC3 – OPEN ZONE • zone can also be opened implicitly by a write – CLOSE ZONE – FINISH ZONE • write pointer becomes invalid! – RESET ZONE © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 7 I/O Path Support for Zoned Block Devices The Big Picture User Applications Space (blkzone, libzbc, fio, sg tools) File access File access Block access Zoned block access Direct device access File System File System (any) (f2fs, zonefs) Device Mapper (dm-linear, dm-flakey, dm-zoned) Kernel Block I/O Layer Space Block I/O Scheduler (deadline and mq-deadline zone write-locking since 4.16.0) SCSI Generic SCSI Mid Layer (sd driver scan code) SCSI Low Level Drivers HBA HW ZBC/ZAC Disk © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 8 Application vs Kernel Support Dependent on kernel version and HBA compliance ** Some current enterprise * ”vanilla” kernels from www.kernel.org Kernels pre-v4.10* Kernels v4.10** onward distributions may use older (Enterprise distribution kernels may kernels that have out-of-date backport features to lower kernel ZBD support version) No Kernel has zoned block device support Yes No No HBA exposes device and has HBA exposes device functional ZBC SAT Yes (SAT optional) Yes Application direct management Kernel based management SG_IO based support: Application direct access Application indirect access Device not usable • Application specific implementation (application support required): (also legacy applications): • sg3utils • ioctl and regular system calls • File system • libzbc • libzbc • Device mapper • fio Minimum support space Ideal support space © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 9 Linux Kernel Support Overview ZBC and ZAC support timeline • Initial work started with kernel version 3.18 • Full ZBC and ZAC command support was implemented in kernel version 4.10 – Exposes Host Managed disks as block devices – API for REPORT ZONES and RESET WRITE POINTER • No kernel internal support for OPEN, CLOSE and FINISH ZONE commands – Support ZBC to ZAC command translation of all commands • libATA From 5.0: scsi-mq only 2014 2017 2018 2019 10 11 12 1 2 3 4 8 9 10 11 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 3.18 4.10 4.13 4.16 5.0 5.1 5.2 SG nodes support Zoned block dm-zoned scsi-mq Latest stable TYPE_ZBC SCSI device support device mapper support release TYPE_ZAC ATA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 10 Linux Kernel Support Overview Kernel Zoned Write Constraints for Host Managed devices • Sequential write constraint is exposed to the drive user – file systems and applications MUST write sequentially to sequential zones – Write ordering guarantees implemented by limiting per zone write queue depth to 1 User • Application Also solves many HBA level command ordering problems, including AHCI level • Implemented in the SCSI disk driver for kernels 4.10.0 to 4.15.x • Implemented with the “deadline” and “mq-deadline” schedulers since kernel 4.16.0 Kernel VFS • mq-deadline is mandatory with kernels 5.0 and above (legacy single queue I/O path File System removed) Device mapper Kernel Block Layer Block I/O scheduler • Since 4.16, sequential write constraint is enforced in deadline SCSI / ATA stack scheduler HBA Driver – Now, mq-deadline scheduler only © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 11 Linux Kernel Support Overview Why deadline scheduler? • Maintains shared state for all hardware queues – This simplifies zone locking mechanism – Support for zoned block devices is only ~30 lines of code out of ~800 User Application • Read and write queues are sorted by LBA level – Creates favorable patterns for zone I/O Kernel VFS • Never reorders I/O requests File System – Merges are allowed as long as they don’t cause chunk boundary crossing Device mapper Kernel Block Layer • Atomic bit operations are used to lock zones Block I/O scheduler – Small memory amount as the number of zones can be very large SCSI / ATA stack HBA Driver • Maintaining QD=1 with the zone lock may cause contention – Only comes to play with certain I/O patterns – An example later in the presentation © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 12 Linux Kernel Support - Core Functionalities Block layer API • A set of ioctls defined in include/uapi/linux/blkzoned.h Zone report is available via an ioctl call: /** ... * @BLKREPORTZONE: Get zone information. Takes a zone report as argument. * The zone report will start from the zone containing the User * sector specified in the report request structure. Application level */ #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report) Kernel VFS Zone reset command is also available as an ioctl call: File System /** ... Device mapper * @BLKRESETZONE: Reset the write pointer of the zones in the specified * sector range. The sector range must be zone aligned. Kernel Block Layer ... */ Block I/O scheduler #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range) SCSI / ATA stack Kernel 4.19 introduced two additional ioctl calls: HBA Driver /** ... * @BLKGETZONESZ: Get the device zone size in number of 512 B sectors. * @BLKGETNRZONES: Get the total number of zones of the device. ... */ #define BLKGETZONESZ _IOR(0x12, 132, __u32) #define BLKGETNRZONES _IOR(0x12, 133, __u32) © 2019 Western Digital Corporation or its affiliates. All rights reserved. 10/2/2019 13 Linux Kernel Support - Core Functionalities SCSI layer • Zone configuration information is printed as part of disk scan process – Kernel log messages [ 3.687797] scsi 5:0:0:0: Direct-Access-ZBC ATA HGST HSH721414AL TE8C PQ: 0 ANSI: 7 User [ 3.696359] sd 5:0:0:0: Attached scsi generic sg4 type 20 Application [ 3.696485] sd 5:0:0:0: [sdd] Host-managed zoned block device level [ 3.865072] sd 5:0:0:0: [sdd] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB) [ 3.873046] sd 5:0:0:0: [sdd] 4096-byte physical blocks Kernel VFS [ 3.878343] sd 5:0:0:0: [sdd]

Zoned Linux Ecosystem Overview

Storage Administration Guide Storage Administration Guide SUSE Linux Enterprise Server 12 SP4

The Linux Kernel Module Programming Guide

The Xen Port of Kexec / Kdump a Short Introduction and Status Report

Anatomy of Linux Loadable Kernel Modules a 2.6 Kernel Perspective

Kdump, a Kexec-Based Kernel Crash Dumping Mechanism

Communicating Between the Kernel and User-Space in Linux Using Netlink Sockets

Detecting Exploit Code Execution in Loadable Kernel Modules

DM-Relay - Safe Laptop Mode Via Linux Device Mapper

The Linux Storage Stack Diagram

An Evolutionary Study of Linux Memory Management for Fun and Profit Jian Huang, Moinuddin K

Comparison of Kernel and User Space File Systems

Lab 1: Loadable Kernel Modules (Lkms)