VerifyFS in Style (Btrfs end to end Integrity)

Liu Bo ([email protected]) Btrfs community

• Filesystems span many different use cases

• Btrfs has contributors from many different companies(including , , FusionIO, , Foundation, Netgear, Novell/SUSE, Oracle, Redhat, STRATO AG) and many individuals

• Broad community ensures that btrfs is full of interesting features Btrfs

• Copy On (COW) • Writable snapshots, read-only snapshots • Transparent Compression (zlib, lzo) • Integrated multiple device support • Built-in Raid with restriping( 0,1,10,5,6) • on data and (crc32c) • Space-efficient packing of small files • Conversion of existing /4 file systems • Subvolume-aware quota support • Etc. Data corruptions

• Data from disk != the expected contents Data corruptions

• Data from disk != the expected contents • Why do they happen? • At different layers of storage stack • Disk firmware bugs • bugs • / kernel errors, e.g. bugs in filesystems and device drivers

• Why we need “end to end data integrity” in btrfs? • Most filesystems depend on disk/hardware to detect and report errors • Disk firmware is a black box. • Most filesystems don't guarantee the data is what you're looking for How to verify data integrity

• Store with disk block • Disk can be formatted with 520 or 528 byte sector rather than 512 • The extra bytes can be used to store checksum (block appended checksum) • data and checksum are stored as a unit -- so they're self-consistent

512 bytes of data 8 or 16 How to verify data integrity (cont.)

• It is harder than it sounds to make good use of block-level checksum • It only proves that a block is self-consistent; • It doesn't prove that it's the right block • The rest of the I/O from the disk to the host remains unprotected Solutions

• Fault isolation, separate data block and checksum(e.g. btrfs, )

• Add more information in extra bytes (e.g. T10's Protection Information, DIF) Btrfs checksum

• Checksums of data blocks are stored in the checksum tree • Checksums of metadata blocks and superblock are store inside their blocks

Checksum tree root Metadata block / superblock leaf

Metadata/superblock crc data crc ... data crc Figure 1 Figure 2 Btrfs checksum cont.

• Already support crc32c algorithm • Checksuming on all things • Superblock, metadata blocks and data blocks • Fast but insecure • crc32c isn't suitable for detecting malicious data in general. • The goal is just to find blocks that are not correctly returned by the storage.

• Recently support sha256 as an alternative algorithm Why sha256?

• Fairly strong

• Slower but secure • Intel has already developed acceleration instructions for sha256

• Btrfs disk format has checksum size limit Another checksum sha256

• For superblock and metadata blocks, btrfs has reserved 32bytes(256bit) for checksum.

• For data blocks, btrfs store checksum in the crc tree, no size limit.

• No need to change disk format! Schemes

• Schemes to detect malicious changes to the FS data.

• The ? • Root hash Schemes cont.(1)

• Btrfs + merkle tree, sounds great? • Does it work? • Unfortunately, sorry. • Merkle tree requires... • we wouldn't be allowed to write a tree node until all of its children had been checksum'd • These write ordering rules of metadata block will make things difficult under memory pressure Schemes cont.(2)

• Checksum + 'btrfs scrub' • will ... • read all superblock, metadata blocks and data blocks on disk • verify integrity by checking their sums • If errors occur(checksum failure or EIO), a good copy is searched for. • If one is found, the bad copy will be overwritten. • There is an READONLY option. Demo

• Checksum sha256 + btrfs scrub Limitations

• For btrfs's superblock and metadata blocks, it's not fault isolation • but they have two or more copies, • superblocks have up to 3 copies • metadata blocks have 2 copies.

• Filesystem checksums are way better for READ time error detection • Which could be months later, original buffer is lost • Redundant copy may also be bad if buffer was incorrect • DIF/DIX checksums, catch errors at write time while we still have a chance to recover with good data in memory Performance

• Heavily depends on the implementation of sha256 and btrfs scrub • Thank you!

• Questions? References