ZEA, A Data Management Approach for SMR

Adam Manzanares Co-Authors

• Western Digital Research – Cyril Guyot, Damien Le Moal, Zvonimir Bandic • University of California, Santa Cruz – Noah Watkins, Carlos Maltzahn

©2016 Western Digital Corporation or affiliates. All rights reserved. 2 Why SMR ?

• HDDs are not going away – Exponential growth of data still exists – $/TB vs. Flash is still much lower – We want to continue this trend! • Traditional Recording (PMR) is reaching scalability limits – SMR is a density enhancing technology being shipped right now. • Future recording technologies may behave like SMR – Write constraint similarities – HAMR

©2016 Western Digital Corporation or affiliates. All rights reserved. 3 Flavors of SMR

• SMR Constraint – Writes must be sequential to avoid data loss • Drive Managed – Transparent to the user – Comes at the cost of predictability and additional drive resources • Host Aware – Host is aware of SMR working model – If user does something “wrong” the drive will fix the problem internally • Host Managed – Host is aware of SMR working model – If user does something “wrong” the drive will reject the IO request

©2016 Western Digital Corporation or affiliates. All rights reserved. 4 SMR Drive Device Model

• New SCSI standard Zoned Block Commands (ZBC) – SATA equivalent ZAC • Drive described by zones and their restrictions

Type Write Intended Use Con Abbreviation Restriction Conventional None In-place updates Increased CZ Resources Sequential None Mostly sequential Variable SPZ Preferred writes Performance Sequential Sequential Write Only sequential SRZ Required writes

• Our library (libzbc) queries zone information from the drive – https://github.com/hgst/libzbc

©2016 Western Digital Corporation or affiliates. All rights reserved. 5 Why Host Managed SMR ?

• We are chasing predictability – Large systems with complex data flows • Demand latency windows that are relatively tight from storage devices • Translates into latency guarantees from the larger system • Drive managed HDDs – When is GC triggered – How long does GC take • Host Aware HDDs – Seem ok if you do the “right” thing – Degrade to drive managed when you don’t • Host Managed – Complexity and variance of drive internalized schemes are now gone

©2016 Western Digital Corporation or affiliates. All rights reserved. 6 Host Managed SMR seems great, but … The Storage Stack Diagram version 4.0, 2015-06-01 outlines the Linux storage stack as of Kernel version 4.0 ISCSI USB mmap Fibre over Channel Fibre Virtual Host Virtual FireWire (anonymous pages) Applications (processes) LIO malloc

vfs_writev, vfs_readv, ...... stat(2) (2) (2) write(2) (2) VFS tcm_fc sbp_target tcm_usb_gadget tcm_vhost tcm_qla2xxx iscsi_target_mod • All of these layers must be SMR compatible Block-based FS Network FS Pseudo FS Special NFS proc purpose FS target_core_mod Direct I/O Page ifs iso9660 smbfs ... (O_DIRECT) pipefs futexfs ramfs target_core_file gfs ocfs ... usbfs ... devtmpfs – Any IO reordering causes a problem Stackable FS target_core_iblock FUSE

userspace (e.g. ) • Must not happen at any point between the user and device target_core_pscsi target_core_user network stackable (optional) struct bio - sector on disk BIOs (block I/Os) Devices on top of “normal” BIOs (block I/Os) - sector cnt block devices - bio_vec cnt drbd LVM - bio_vec index mdraid - bio_vec list dm-crypt dm-mirror ... dm-cache dm-thin • What about my new SMR FS? dm- dm-delay userspace

BIOs BIOs – What is your GC policy? Block Layer BIOs I/O scheduler blkmq Maps BIOs to requests multi queue hooked in device drivers – Is it tunable? noop Software (they hook in like stacked ... queues cfq devices do) deadline

Hardware Hardware – Does your FS introduce latency at your discretion? dispatch ... dispatch queue queues

Request Request BIO • What about my new KV that is SMR compatible? based drivers based drivers based drivers

Request-based – See above device mapper targets dm-multipath SCSI mid layer sysfs -mq /dev/* /dev/rbd* /dev/mmcblk*p* /dev/nullb* /dev/vd* /dev/rssd* /dev/skd* (transport attributes) SCSI upper level drivers – In addition, is it built on top of a ? /dev/sda /dev/sd* ... /dev/nvme*n* /dev/rsxx* Transport classes scsi_transport_fc /dev/sr* /dev/st* zram rbd mmc null_blk virtio_blk mtip32xx nvme skd rsxx scsi_transport_sas scsi_transport_... network memory SCSI low level drivers libata megaraid_sas qla2xxx pm8001 iscsi_tcp virtio_scsi ufs ...

ahci ata_piix ... aacraid lpfc mpt3sas vmw_pvscsi

network

HDD SSD DVD LSI Qlogic PMC-Sierra Micron nvme stec para-virtualized virtio_pci mobile device drive RAID HBA HBA SCSI flash memory PCIe card device device Adaptec Emulex LSI 12Gbs VMware's SD-/MMC-Card RAID HBA SAS HBA para-virtualized IBM flash Physical devices SCSI adapter

The Linux Storage Stack Diagram http://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram Created by Werner Fischer and Georg Schönberger ©2016 Western Digital Corporation or affiliates. All rights reserved. License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ 7 What Did We Do ?

• Introduce a zone and block based allocator [ ZEA ] • Write Logical Extent [ Zone Block Address ] – Return Drive LBA • Read Extent [ Logical Block Address ] – Return data if extent is valid • Iterators over extent metadata ZEA + SMR Drive – Allows one to build ZBA -> LBA mapping

Per Write Metadata

©2016 Western Digital Corporation or affiliates. All rights reserved. 8 ZEA Performance vs. Block Device

©2016 Western Digital Corporation or affiliates. All rights reserved. 9 ZEA Is Just An Extent Allocator, What Next ?

• ZEA + LevelDB • LevelDB is KV store library that is widely used – LevelDB Backend API is good for SMR • Write File Append Only, Read Randomly, Create & Delete Files, Flat Directory – Lets Build a Simple FS compatible with LevelDB

ZEA + LevelDB Architecture

©2016 Western Digital Corporation or affiliates. All rights reserved. 10 ZEA + LevelDB Performance

©2016 Western Digital Corporation or affiliates. All rights reserved. 11 Lessons Learned

• ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent size is reasonable – Provides addressing flexibility to application • Useful for GC • LevelDB integration opens up usability of Host Managed SMR HDD • Unfortunately LevelDB not so great for large objects – Ideal Use case for SMR drive would be large static blobs

©2016 Western Digital Corporation or affiliates. All rights reserved. 12 What Is Left ?

• What is a “good” interface above ZEA • Garbage collection policies – When and How • How to use multiple zones efficiently – Allocation – Garbage collection

• Thanks for listening

©2016 Western Digital Corporation or affiliates. All rights reserved. 13 [email protected]

©2016 Western Digital Corporation or affiliates. All rights reserved. 14