Verifyfs in Btrfs Style (Btrfs End to End Data Integrity)

<Insert Picture Here> VerifyFS in Btrfs Style (Btrfs end to end Data Integrity) Liu Bo ([email protected]) Btrfs community • Filesystems span many different use cases • Btrfs has contributors from many different companies(including Facebook, Fujitsu, FusionIO, Intel, Linux Foundation, Netgear, Novell/SUSE, Oracle, Redhat, STRATO AG) and many individuals • Broad community ensures that btrfs is full of interesting features Btrfs • Copy On Write (COW) • Writable snapshots, read-only snapshots • Transparent Compression (zlib, lzo) • Integrated multiple device support • Built-in Raid with restriping(raid 0,1,10,5,6) • Checksums on data and metadata(crc32c) • Space-efficient packing of small files • Conversion of existing ext3/4 file systems • Subvolume-aware quota support • Etc. Data corruptions • Data from disk != the expected contents Data corruptions • Data from disk != the expected contents • Why do they happen? • At different layers of storage stack • Disk firmware bugs • Software bugs • library / kernel errors, e.g. bugs in filesystems and device drivers Data Integrity • Why we need “end to end data integrity” in btrfs? • Most filesystems depend on disk/hardware to detect and report errors • Disk firmware is a black box. • Most filesystems don't guarantee the data is what you're looking for How to verify data integrity • Store checksum with disk block • Disk can be formatted with 520 or 528 byte sector rather than 512 • The extra bytes can be used to store checksum (block appended checksum) • data and checksum are stored as a unit -- so they're self-consistent 512 bytes of data 8 or 16 How to verify data integrity (cont.) • It is harder than it sounds to make good use of block-level checksum • It only proves that a block is self-consistent; • It doesn't prove that it's the right block • The rest of the I/O path from the disk to the host remains unprotected Solutions • Fault isolation, separate data block and checksum(e.g. btrfs, zfs) • Add more information in extra bytes (e.g. T10's Protection Information, DIF) Btrfs checksum • Checksums of data blocks are stored in the checksum tree • Checksums of metadata blocks and superblock are store inside their blocks Checksum tree root Metadata block / superblock leaf Metadata/superblock crc data crc ... data crc Figure 1 Figure 2 Btrfs checksum cont. • Already support crc32c algorithm • Checksuming on all things • Superblock, metadata blocks and data blocks • Fast but insecure • crc32c isn't suitable for detecting malicious data in general. • The goal is just to find blocks that are not correctly returned by the storage. • Recently support sha256 as an alternative algorithm Why sha256? • Fairly strong • Slower but secure • Intel has already developed acceleration instructions for sha256 • Btrfs disk format has checksum size limit Another checksum sha256 • For superblock and metadata blocks, btrfs has reserved 32bytes(256bit) for checksum. • For data blocks, btrfs store checksum in the crc tree, no size limit. • No need to change disk format! Schemes • Schemes to detect malicious changes to the FS data. • The Merkle tree? • Root hash Schemes cont.(1) • Btrfs + merkle tree, sounds great? • Does it work? • Unfortunately, sorry. • Merkle tree requires... • we wouldn't be allowed to write a tree node until all of its children had been checksum'd • These write ordering rules of metadata block will make things difficult under memory pressure Schemes cont.(2) • Checksum + 'btrfs scrub' • Data scrubbing will ... • read all superblock, metadata blocks and data blocks on disk • verify integrity by checking their sums • If errors occur(checksum failure or EIO), a good copy is searched for. • If one is found, the bad copy will be overwritten. • There is an READONLY option. Demo • Checksum sha256 + btrfs scrub Limitations • For btrfs's superblock and metadata blocks, it's not fault isolation • but they have two or more copies, • superblocks have up to 3 copies • metadata blocks have 2 copies. • Filesystem checksums are way better for READ time error detection • Which could be months later, original buffer is lost • Redundant copy may also be bad if buffer was incorrect • DIF/DIX checksums, catch errors at write time while we still have a chance to recover with good data in memory Performance • Heavily depends on the implementation of sha256 and btrfs scrub • Thank you! • Questions? References.

Load more