Quick viewing(Text Mode)

Lustre, ZFS, and Data Integrity

Lustre, ZFS, and Data Integrity

Lustre, ZFS, and Integrity

Lustre ZFS Team Andreas Dilger Sr. Staff Engineer Sun Microsystems ZFS BOF Supercomputing 2009 1 Topics Overview ZFS/DMU features ZFS Lustre Data Integrity ZFS-Lustre Integration Current Status

ZFS BOF Supercomputing 2009 2 2012 Lustre Requirements Filesystem Limits • 240 PB size • 1 trillion files (1012) per file system • 1.5 TB/s aggregate bandwidth • 100k clients Single File/Directory Limits • 10 billion files in a single directory • 0 to 1 PB file size range Reliability • 100h filesystem integrity check • End-to-end data integrity

ZFS BOF Supercomputing 2009 3 ZFS Meets Future Requirements Capacity • Single filesystems 512TB+ (theoretical 264 devices * 264 ) • Trillions of files in a single file system (theoretical 248 files per fileset) • Dynamic addition of capacity/performance Reliability and Integrity • Transaction based, copy-on-write • Internal (up to triple parity/mirror and/or 3 copies) • End-to-end of all data/ • Online integrity verification and reconstruction Functionality • Snapshots, filesets, compression, encryption • Online incremental backup/restore • Hybrid storage pools (HDD + SSD) • Portable code base (BSD, Solaris, ... , OS/X?)

ZFS BOF Supercomputing 2009 4 Lustre/ZFS Project Background ZFS-FUSE (2006) ● The first project of ZFS to the FUSE framework for the Linux in the user space Kernel DMU (2008) ● LLNL engineer ported core parts of ZFS to RedHat Linux as a kernel module Lustre-kDMU (2008 – present) ● Modify Lustre to work on ZFS/DMU storage backend for a production environment

ZFS BOF Supercomputing 2009 ZFS Overview ZFS POSIX Layer (ZPL)

● Interface to user applications

● Directories, namespace, ACLs, xattr Unit (DMU)

● Objects, datatsets, snapshots

● Copy-on-write atomic transactions

● name->value lookup tables (ZAP) Adaptive Replacement (ARC)

● Manage RAM, flash cache space Storage Pool Allocator (SPA)

● Allocate blocks at IO time

● Manages multiple disks directly

● Mirroring, parity, compression ZFS BOF Supercomputing 2009 Lustre/Kernel DMU Modules

ZFS BOF Supercomputing 2009 Lustre Software Stack Updates 1.8 MDS 1.8/2.0 OSS 2.0 MDS 2.1 MDS 2.1 OSS

MDT OST MDT MDT OST

MDS obdfilter MDD MDD OFD

ldiskfs DMU DMU fsfilt fsfilt OSD OSD OSD

ldiskfs ldiskfs ldiskfs DMU DMU

ZFS BOF Supercomputing 2009 How Safe Is Your Data? • There are many sources of – Software/: client, NIC, router, , HBA, disk – RAM, CPU, disk cache, media – Client network, storage network, cables

Client Server

Application Server Storage Router Transport Transport Network Network Network Link Link Link

ZFS BOF Supercomputing 2009 ZFS Industry Leading Data Integrity

ZFS BOF Supercomputing 2009 End-to-End Data Integrity Lustre Network Checksum • Detects data corruption over network • /4 does not checksum data on disk ZFS stores data/metadata • Fast (Fletcher-4 default, or none) • Strong (SHA-256) Combine for End-to-End Integrity • Integrate Lustre and ZFS checksums • Avoid recompute full checksum on data • Always overlap checksum coverage • Use scalable tree hash method

ZFS BOF Supercomputing 2009 11 Hash Tree and Multiple Sizes

1024kB LUSTRE RPC SIZE ROOT HASH

512kB

256kB

128kB ZFS BLOCK 64kB MAX PAGE

32kB

16kB

8kB INTERIOR HASH 4kB LEAF HASH E N I G DATA A M

P SEGMENT

ZFS BOF Supercomputing 2009 12 End-to-End Integrity Client Write CLIENT 0 4 8 12 16 20 1008 1016 1024kB

write(fd, buffer, size=1MB) [buffer copy or page reference] USER SPACE KERNEL 0 Page 0 16kB 1008 Page 63 1024kB 16kB PAGE SIZE BULK 4kB LEAF HASH C0 C1 C2 C3 C3 C5 C250 C251 C252 C253C254C255 RDMA

16kB PAGE HASH P0 P1 P62 P63 HASH TREE

1MB RPC HASH ROOT HASH K SERVER RPC WRITE RPC 1024kB K

ZFS BOF Supercomputing 2009 13 End-to-End Integrity Server Write SERVER CLIENT RPC WRITE RPC 1024kB K 0 4 8 12 16kB 1008kB 1024kB 4kB BULK PAGE RDMA 4kB LEAF HASH S0 S1 S2 S3 S4 S5 S250S251S252S253S254S255 SIZE

HASH TREE 128kB BLOCK HASH B0 B7 K = K' ? 1MB RPC HASH K' OST

DMU BLOCK_PTR 0-128kB B0 BLOCK_PTR 896-1024kB B7 128kB BLOCK SIZE

ZFS BOF Supercomputing 2009 14 Integrity Checking Targets Scale of proposed filesystem is enormous • 1 trillion files checked in 100h • 2-4PB of MDT filesystem metadata • ~3 million files/sec, 3GB/s+ for one pass Need to handle CMD coherency as well • Distributed link count on files, directories • Directory parent/child relationship • Filename to FID to inode mapping

ZFS BOF Supercomputing 2009 15 Distributed Coherency Checking Traverse MDT filesystem(s) only once • Piggyback on ZFS resilver/scrub to avoid extra IO • Lustre processes blocks after checksum validation Validate OST/remote MDT data opportunistically • Track which objects have not been validated (1 bit/object) • Each object has a back-reference to validate user • Back-reference avoids need to keep global reference table Lazily scan OST/MDT in the background • Load sensitive scanning • Continuous online verification/fix of distributed state

ZFS BOF Supercomputing 2009 16 Lustre DMU development Status Data Alpha ● OST basic functionality MD Alpha ● MDS basic functionality, beginning recovery support Beta ● Layouts, performance improvement, full recovery GA ● Full performance in production release

ZFS BOF Supercomputing 2009 Thank You

Questions?

ZFS BOF Supercomputing 2009 18 Hash Tree For Non-contiguous Data

K LH(x) = hash of data x (leaf) IH(x) = hash of data x (interior) x+y = concatenation x and y C 01 457 C = LH(data segment 0) 0 C = LH(data segment 1) 1 C C C C = LH(data segment 4) 01 45 7 4 C = LH(data segment 5) 5 C = LH(data segment 7) C C C C C 7 0 1 4 5 7 C = IH(C + C ) 01 0 1 C = IH(C + C ) 45 4 5 C = IH(C + C ) 457 45 7 K = IH(C + C ) = ROOT hash 01 457

ZFS BOF Supercomputing 2009 19 T10-DIF vs. Hash Tree CRC-16 Guard Word Fletcher-4 Checksum • All 1-bit errors • All 1- 2- 3- 4-bit errors • All adjacent 2-bit errors • All errors affecting 4 or fewer 32-bit • Single 16-bit burst error words • 10-5 bit error rate • Single 128-bit burst error • 10-13 bit error rate 32-bit Reference Tag • Misplaced write != 2nTB Hash Tree • Misplaced read != 2nTB • Misplaced read • Misplaced write • Phantom write • Bad RAID reconstruction

ZFS BOF Supercomputing 2009 ZFS Hybrid Pool Example

4 Xeon 7350 Processors (16 cores) 32GB FB DDR2 ECC DRAM OpenSolarisTM with ZFS

Configuration A Configuration B Seven 146GB 10,000 RPM SAS Drives Five 400GB 4200 RPM SATA Drives

32GB SSD ZIL Device 80GB SSD Cache Device

ZFS BOF Supercomputing 2009 ZFS Hybrid Pool Example

3.2x 1.1x 1/5 2x

Read IOPs Write IOPs Power Raw Capacity (2TB)

Hybrid Storage(DRAM + SSD + 5x 4200 SATA) Traditional Storage (DRAM + 7x 10K SAS)

ZFS BOF Supercomputing 2009