ZEA, a Data Management Approach for SMR

ZEA, A Data Management Approach for SMR Adam Manzanares Co-Authors • Western Digital Research – Cyril Guyot, Damien Le Moal, Zvonimir Bandic • University of California, Santa Cruz – Noah Watkins, Carlos Maltzahn ©2016 Western Digital Corporation or affiliates. All rights reserved. 2 Why SMR ? • HDDs are not going away – Exponential growth of data still exists – $/TB vs. Flash is still much lower – We want to continue this trend! • Traditional Recording (PMR) is reaching scalability limits – SMR is a density enhancing technology being shipped right now. • Future recording technologies may behave like SMR – Write constraint similarities – HAMR ©2016 Western Digital Corporation or affiliates. All rights reserved. 3 Flavors of SMR • SMR Constraint – Writes must be sequential to avoid data loss • Drive Managed – Transparent to the user – Comes at the cost of predictability and additional drive resources • Host Aware – Host is aware of SMR working model – If user does something “wrong” the drive will fix the problem internally • Host Managed – Host is aware of SMR working model – If user does something “wrong” the drive will reject the IO request ©2016 Western Digital Corporation or affiliates. All rights reserved. 4 SMR Drive Device Model • New SCSI standard Zoned Block Commands (ZBC) – SATA equivalent ZAC • Drive described by zones and their restrictions Type Write Intended Use Con Abbreviation Restriction Conventional None In-place updates Increased CZ Resources Sequential None Mostly sequential Variable SPZ Preferred writes Performance Sequential Sequential Write Only sequential SRZ Required writes • Our user space library (libzbc) queries zone information from the drive – https://github.com/hgst/libzbc ©2016 Western Digital Corporation or affiliates. All rights reserved. 5 Why Host Managed SMR ? • We are chasing predictability – Large systems with complex data flows • Demand latency windows that are relatively tight from storage devices • Translates into latency guarantees from the larger system • Drive managed HDDs – When is GC triggered – How long does GC take • Host Aware HDDs – Seem ok if you do the “right” thing – Degrade to drive managed when you don’t • Host Managed – Complexity and variance of drive internalized schemes are now gone ©2016 Western Digital Corporation or affiliates. All rights reserved. 6 Host Managed SMR seems great, but … The Linux Storage Stack Diagram version 4.0, 2015-06-01 outlines the Linux storage stack as of Kernel version 4.0 ISCSI USB mmap Fibre Channel Fibre Ethernet over Channel Fibre Virtual Host Virtual FireWire (anonymous pages) Applications (processes) LIO malloc vfs_writev, vfs_readv, ... ... stat(2) read(2) open(2) write(2) chmod(2) VFS tcm_fc sbp_target tcm_usb_gadget tcm_vhost tcm_qla2xxx iscsi_target_mod • All of these layers must be SMR compatible Block-based FS Network FS Pseudo FS Special ext2 ext3 ext4 xfs NFS coda proc purpose FS target_core_mod Direct I/O sysfs Page btrfs ifs iso9660 smbfs ... (O_DIRECT) pipefs futexfs tmpfs ramfs cache target_core_file gfs ocfs ... ceph usbfs ... devtmpfs – Any IO reordering causes a problem Stackable FS target_core_iblock ecryptfs overlayfs unionfs FUSE userspace (e.g. sshfs) • Must not happen at any point between the user and device target_core_pscsi target_core_user network stackable (optional) struct bio - sector on disk BIOs (block I/Os) Devices on top of “normal” BIOs (block I/Os) - sector cnt block devices - bio_vec cnt drbd LVM - bio_vec index device mapper mdraid - bio_vec list dm-crypt dm-mirror ... dm-cache dm-thin bcache • What about my new SMR FS? dm-raid dm-delay userspace BIOs BIOs – What is your GC policy? Block Layer BIOs I/O scheduler blkmq Maps BIOs to requests multi queue hooked in device drivers – Is it tunable? noop Software (they hook in like stacked ... queues cfq devices do) deadline Hardware Hardware – Does your FS introduce latency at your discretion? dispatch ... dispatch queue queues Request Request BIO • What about my new KV that is SMR compatible? based drivers based drivers based drivers Request-based – See above device mapper targets dm-multipath SCSI mid layer sysfs scsi-mq /dev/zram* /dev/rbd* /dev/mmcblk*p* /dev/nullb* /dev/vd* /dev/rssd* /dev/skd* (transport attributes) SCSI upper level drivers – In addition, is it built on top of a file system? /dev/sda /dev/sd* ... /dev/nvme*n* /dev/rsxx* Transport classes scsi_transport_fc /dev/sr* /dev/st* zram rbd mmc null_blk virtio_blk mtip32xx nvme skd rsxx scsi_transport_sas scsi_transport_... network memory SCSI low level drivers libata megaraid_sas qla2xxx pm8001 iscsi_tcp virtio_scsi ufs ... ahci ata_piix ... aacraid lpfc mpt3sas vmw_pvscsi network HDD SSD DVD LSI Qlogic PMC-Sierra Micron nvme stec para-virtualized virtio_pci mobile device drive RAID HBA HBA SCSI flash memory PCIe card device device Adaptec Emulex LSI 12Gbs VMware's SD-/MMC-Card RAID HBA SAS HBA para-virtualized IBM flash Physical devices SCSI adapter The Linux Storage Stack Diagram http://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram Created by Werner Fischer and Georg Schönberger ©2016 Western Digital Corporation or affiliates. All rights reserved. License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ 7 What Did We Do ? • Introduce a zone and block based extent allocator [ ZEA ] • Write Logical Extent [ Zone Block Address ] – Return Drive LBA • Read Extent [ Logical Block Address ] – Return data if extent is valid • Iterators over extent metadata ZEA + SMR Drive – Allows one to build ZBA -> LBA mapping Per Write Metadata ©2016 Western Digital Corporation or affiliates. All rights reserved. 8 ZEA Performance vs. Block Device ©2016 Western Digital Corporation or affiliates. All rights reserved. 9 ZEA Is Just An Extent Allocator, What Next ? • ZEA + LevelDB • LevelDB is KV store library that is widely used – LevelDB Backend API is good for SMR • Write File Append Only, Read Randomly, Create & Delete Files, Flat Directory – Lets Build a Simple FS compatible with LevelDB ZEA + LevelDB Architecture ©2016 Western Digital Corporation or affiliates. All rights reserved. 10 ZEA + LevelDB Performance ©2016 Western Digital Corporation or affiliates. All rights reserved. 11 Lessons Learned • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent size is reasonable – Provides addressing flexibility to application • Useful for GC • LevelDB integration opens up usability of Host Managed SMR HDD • Unfortunately LevelDB not so great for large objects – Ideal Use case for SMR drive would be large static blobs ©2016 Western Digital Corporation or affiliates. All rights reserved. 12 What Is Left ? • What is a “good” interface above ZEA • Garbage collection policies – When and How • How to use multiple zones efficiently – Allocation – Garbage collection • Thanks for listening ©2016 Western Digital Corporation or affiliates. All rights reserved. 13 [email protected] ©2016 Western Digital Corporation or affiliates. All rights reserved. 14.

ZEA, a Data Management Approach for SMR

Storage Administration Guide Storage Administration Guide SUSE Linux Enterprise Server 12 SP4

Rootless Containers with Podman and Fuse-Overlayfs

Oracle® Linux 7 Release Notes for Oracle Linux 7.2

Globalfs: a Strongly Consistent Multi-Site File System

In Search of the Ideal Storage Configuration for Docker Containers

MINCS - the Container in the Shell (Script)

Using the Andrew File System on *BSD [email protected], Bsdcan, 2006 Why Another Network Filesystem

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

Tizen Based Remote Controller CAR Using Raspberry Pi2

SUSE Linux Enterprise Server 15 SP2 Autoyast Guide Autoyast Guide SUSE Linux Enterprise Server 15 SP2

The Linux Storage Stack Diagram

Effective Cache Apportioning for Performance Isolation Under