Quick viewing(Text Mode)

Tolera&Ng File‐System Mistakes with Envyfs

Tolera&Ng File‐System Mistakes with Envyfs

Tolerang ‐System Mistakes with EnvyFS

Swaminathan Sundararaman Lakshmi N. Bairavasundaram Andrea C. Arpaci‐Dusseau NetApp, Inc. Remzi H. Arpaci‐Dusseau University of Wisconsin Madison File Systems in Today’s World

• Modern file systems are complex – Tens of thousands of lines of code (e.g., XFS 45K LOC)

• Storage stack is also geng deeper – Hypervisor, network, logical volume manager

• Need to handle a gamut of failures – Memory allocaon, disk faults, bit flips, system crashes

• Preserve integrity of its meta‐data and user data

6/18/09 Tolerang File‐System Mistakes with EnvyFS 2 Bugs

• Bug reports for 2.6 series from Bugzilla – ext3: 64, JFS: 17, ReiserFS: 38 – Some are FS corrupon causing permanent data loss

• FS bugs broadly classified into two categories – “fail‐stop”: System immediately crashes • Soluons: Nooks [Swi 04], CuriOS [David08] – “fail‐silent”: Accidentally corrupt on‐disk state • Many such bugs uncovered [Prabhakaran05, Gunawi08, Yang04, Yang06b]

6/18/09 Tolerang File‐System Mistakes with EnvyFS 3 Bugs are inevitable in file systems

Challenge: how to cope with them?

6/18/09 Tolerang File‐System Mistakes with EnvyFS 4 N‐Version File Systems

• Based on N‐version programming [Avizienis77] – NFS servers [Rodrigues01], databases [Vandiver07], security [Cox06]

• EnvyFS: Simple soware layer Applicaon EnvyFS layer – Store data in N child file systems – Operaons performed on all children … Child N Child 2 • Rely on a simple soware layer Child 1 • Challenge: reducing overheads while Disk driver SIS layer retaining reliability – SubSIST: Novel Single Instance Store Disk

6/18/09 Tolerang File‐System Mistakes with EnvyFS 5 Results

• Robustness – Tradional file systems handle few corrupons (< 4%)

– EnvyFS3 tolerates 98.9% of single file system mistakes

• Performance

– Desktop workloads: EnvyFS3 has comparable performance – I/O intensive workloads:

• Normal mode: EnvyFS3 + SubSIST acceptable performance • Under memory pressure: EnvyFS3 + SubSIST large overheads • Potenal as a debugging tool for FS developers – Pinpoint the source of “fail‐silent” bug in ext3

6/18/09 Tolerang File‐System Mistakes with EnvyFS 6 Outline

• Introducon • Building reliable file systems • Reducing overheads with SubSIST • Evaluaon • Conclusion

6/18/09 Tolerang File‐System Mistakes with EnvyFS 7 N‐Version Systems

Development process: 1. Producing the specificaon of soware 2. Implemenng N versions of the soware 3. Creang N‐version layer — Executes different versions — Determines the consensus result

6/18/09 Tolerang File‐System Mistakes with EnvyFS 8 1. Producing Specificaon • Our own specificaon ? – Imprac: Requires wide scale changes to file systems – Specificaons take years to get accepted

• Can we leverage exisng specificaon ? – Yes, can leverage VFS, but there are some issues

• VFS not precise for N‐versioning purpose – Needs to handle cases where specificaon is not precise – e.g., Ordering directory entries, inode number allocaon

6/18/09 Tolerang File‐System Mistakes with EnvyFS 9 Imprecise VFS Specificaon

File 1 Ordering directory entries File 2 File 3 Dir: • Issue: File 1 – No specified return order File 2 Readdir: test No Entries File 3 – Can’t blindly compare entries EnvyFS layer File 1 File 2 File 3 • Soluon: – Read all entries from a directory File 1 File 2 File 3 (dir: test in our case) from all FSes File 2 File 3 File 1 FS X … FS Y FS Z – Match entries from FSes File 3 File 1 File 2 – Return majority results Dir: test Dir: test Dir: test

6/18/09 Tolerang File‐System Mistakes with EnvyFS 10 Imprecise VFS Specificaon (cont) • Inode number allocaon – Inode numbers returned through system calls – Each child file system issues different inode numbers – Possible soluon: Force file systems to use same algorithm? – Our soluon: Issue inode numbers EnvyFS layer

Stat: File 1 File 1 | ?? 15 Virt # FS 1 FS 2 FS 3 EnvyFS layer 15 10 36 65 File 1 | 10 File 1 | 36 File 1 |65

Inode Mapping Table File 1 10 File 2 04 File 3 99 File 2 15 File 3 44 File 1 65 FS Z FS X File 3 16 File 1 36 FS Y File 2 43 Inode Mapping Table not persistently stored Dir: test Dir: test Dir: test

6/18/09 Inode Numbers Tolerang File‐System Mistakes with EnvyFS 11 2. Implemenng N versions of FS

• Painful process – High cost of development, long me delays

• Lucky! Hard work already done for us – 30 different disk based file systems in Linux 2.6

• Which file systems to use? – ext3, JFS, ReiserFS in a three‐version FS – Others should work without modificaons

6/18/09 Tolerang File‐System Mistakes with EnvyFS 12 3. Creang N‐Version Layer

• N‐Version layer (EnvyFS) Applicaon

– Inserted beneath VFS Read (file, 1 block) err , D – Simple design to avoid bugs VFS layer

Read (file, 1 block) err , D EnvyFS Inode Mapping Table • Example: Reading a file Layer Wrappers Comparators – Allocate N data buffers err = Read (…) err = Read (…) err = Read (…) – Read data block from the disk D D D – Compare: data, return code, file posion F F F – Return: data, return code pos: x pos: x pos: x

… JFS ext3 ReiserFS

• Issues: D D D – Allocate memory for each read operaon – Extra copy from allocated buffer to applicaon Disk – Comparison overheads

6/18/09 Tolerang File‐System Mistakes with EnvyFS 13 Reading a File in EnvyFS

• Soluon: Applicaon Read (file, 1 block) err , – Same applicaon buffer for all FS D – TCP‐like checksums for data comparison VFS layer Read (file, 1 block) err , D – Compare: checksums, return code, file EnvyFS Inode Mapping Table posion Layer Wrappers Comparators – Read data unl majority err = Read (…) err = Read (…) err = Read (…) D D D

F F F pos: x pos: x pos: x

… JFS ext3 ReiserFS

D D D FS 1 # FS 2 # … FS N #

435 435 … 436 Disk Checksums

6/18/09 Tolerang File‐System Mistakes with EnvyFS 14 Outline

• Introducon • Building reliable file systems • Reducing overheads with SubSIST • Evaluaon • Conclusion

6/18/09 Tolerang File‐System Mistakes with EnvyFS 15 Case for Single Instance Storage (SIS)

Applicaon • Ideal: One disk per FS VFS layer

1 • Praccal: One disk for all FS EnvyFS layer

1 2 N • Overheads FS 1 FS 2 – Effecve storage space: 1/N … FS N – N mes I/O (Read/) 1 2 N Disk 1 Disk Req. Queue Disk 2 … Disk N • Challenge: Maintain diversity Part 1 Part 2 Disk … Part N while minimizing overheads

6/18/09 Tolerang File‐System Mistakes with EnvyFS 16 SubSIST: Single Instance Store

Applicaon • Variant of an Single Instance Store – Selecvely merges data blocks VFS layer D • Block addressable SIS EnvyFS layer

– Exports virtual disks to FSes D D D – Manages mapping, free space info.

FS 1 … FS 2 – Not persistently stored on disk FS N M D M D M D • EnvyFS writes through N file systems Vdisk 1 Vdisk 2 Vdisk N – N data blocks merged to 1 data block – Content hashes not stored persistently CHash Layer Read Cache SubSIST – Meta‐data blocks not merged Free Space Management – Inter FS blocks and not intra FS

Disk 6/18/09 Tolerang File‐System Mistakes with EnvyFS 17 6/18/09 • × 

Handling Data Block Corrupons?

Single copy of data Corrupon to data block inside disk Corrupon to data in a single FS – – – – – –

Different on‐disk structures Different code paths Corrupt data block fixed at next read All other N‐1 data blocks merged Corrupt data blocks not merged Due to bugs, bit flips, storage stack Tolerang File‐System Mistakes with EnvyFS Vdisk 1 CHash Free Space Management D D

D D FS 1 Layer

EnvyFS layer Applicaon VFS layer Vdisk 2 Disk D D

FS 2 D Read Cache

… D D

Vdisk N D

D FS N

18 SubSIST Outline

• Introducon • Building reliable file systems • Reducing overheads with SubSIST • Evaluaon – Reliability – Performance • Conclusion

6/18/09 Tolerang File‐System Mistakes with EnvyFS 19 Reliability Evaluaon: Fault Injecon

VFS Robustness of EnvyFS in recovering from a child file system’s mistake? EnvyFS layer

• Corrupon: bugs in FS / storage stack JFS ext 3

• Types of disk blocks ReiserFS

– superblock, inode, block bitmap, file data, … B B Pseudo • Perform different file ops ‐aware fault injecon [Prabhakaran05] Device Driver – , stat, creat, unlink, read, … B B B • Report user visible results Block Driver • All results are applicable with SubSIST except corrupon to data blocks B B B Disk

6/18/09 Tolerang File‐System Mistakes with EnvyFS 20 ) ) Normal )

ext3 stat, … fsync Data E loss

path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link rename symlink write truncate unlink mount SET‐3 ( umount Cannot INODE mount DIR Ops fail BMAP Data IMAP corrupt INDIRECT Result Matrix Crash DATA SUPER Read‐only

JSUPER Depends GDESC e N/A 6/18/09 Tolerang File‐System Mistakes with EnvyFS 21 ) ) )

ext3 stat, … chmod fsync E

path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 ( umount Data INODE loss DIR Cannot mount BMAP Ext3 stores many superblock copies; IMAP but, does not handle superblock corrupon N/A INDIRECT

DATA SUPER

JSUPER

GDESC

6/18/09 Tolerang File‐System Mistakes with EnvyFS 22 ) ) )

ext3 stat, … chmod fsync E path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 ( umount INODE Data DIR loss BMAP • In addion to operaons failing, inode Cannot mount IMAP corrupon leads to data loss INDIRECT Ops fail • Unlink: system crash during unmount DATA Crash SUPER

JSUPER N/A

GDESC

6/18/09 Tolerang File‐System Mistakes with EnvyFS 23 ) ) Normal )

ext3 stat, … chmod fsync Data E loss

path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 ( umount Cannot INODE mount DIR Ops fail BMAP Data IMAP corrupt INDIRECT Crash DATA SUPER Read‐only JSUPER Depends GDESC e N/A 6/18/09 Tolerang File‐System Mistakes with EnvyFS 24 ) )

EnvyFS3 ) EnvyFS stat, … chmod fsync Kernel E J R panic in path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 ( umount ext3 INODE

DIR

BMAP

IMAP Normal INDIRECT

DATA N/A

SUPER

JSUPER

GDESC

EnvyFS3 works in every scenario 6/18/09 Tolerang File‐System Mistakes with EnvyFS 25 Potenal for Bug Isolaon

ext3 EnvyFS3

Unlink on corrupt inode: Unlink on corrupt inode: ‐ ext3_lookup (bug) ‐ ext3_lookup (bug) ‐ ext3_unlink ‐ ext3 inode does not match others

Time ‐ Further ops not issued

Unmount (panic)

In typical use, a problem is In EnvyFS3, a problem is noced noced only on panic the first me child file system returns wrong results

6/18/09 Tolerang File‐System Mistakes with EnvyFS 26 Normal JFS Data J loss

path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount Cannot INODE mount DIR Ops fail BMAP IMAP Data INTERNAL corrupt DATA Crash SUPER JSUPER Read‐only JDATA AGGR‐INODE Depends a IMAPDESC N/A 6/18/09 IMAPCNTL Tolerang File‐System Mistakes with EnvyFS 27 EnvyFS3 EnvyFS

E J R path traversal SET‐1 SET‐2 read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 umount INODE DIR BMAP IMAP Normal INTERNAL Crash DATA SUPER N/A JSUPER JDATA AGGR‐INODE Kernel IMAPDESC panic in EnvyFS3 6/18/09 IMAPCNTL Tolerang File‐System Mistakes with EnvyFS 28 Performance Evaluaon OpenSSH Benchmark

•30 Experimental setup 3 % overhead • CPU Intensive • 25 – AMD Opteron 2.2 GHz Processor OpenSSH 4.5 ‐‐ Copy, untar and 20 – 2GB RAM 15 – 80 GB Hitachi Deskstar 7200‐rpm SATA disk

10 – Linux 2.6.12 – 5 4GB disk paron for each file system Elapsed Time (in Seconds) 0 ext3 JFS ReiserFS EnvyFS EnvyFS+SIS File Systems Performance of EnvyFS3 is comparable to a single file system 6/18/09 Tolerang File‐System Mistakes with EnvyFS 29 Postmark Benchmark

EnvyFS : 8x 900 3 851 • I/O Intensive + SubSIST: 4x 800 700 – Mimics busy mail server workload EnvyFS3: 1.7x 600 – Transacon: creates, deletes, reads, appends, … + SubSIST: 11.5% 500 430 • Postmark Configuraon 406 400 EnvyFS : 3.3x – 3 271 300 2500 files + SubSIST: ‐32% 243 200 – File size: 4Kb – 40Kb 129.0 107 128 Elapsed Time (in Seconds) 100 78 39.0 34 – No. of transacons: 10K and 100K 14.7 9.6 26.4 29 0 Postmark‐10K Postmark‐100K Postmark‐100K* ext3 JFS ReiserFS EnvyFS EnvyFS+SIS

6/18/09 Tolerang File‐System Mistakes with EnvyFS 30 Summary of Results

• Robustness – Tradional file systems vulnerable to corrupons

– EnvyFS3 tolerates almost all mistakes in one FS

• Performance

– Desktop workloads: EnvyFS3 has comparable performance – I/O intensive workloads:

• Regular Operaons: EnvyFS3 + SubSIST acceptable performance

• Memory pressure: EnvyFS3 + SubSIST has large overhead

6/18/09 Tolerang File‐System Mistakes with EnvyFS 31 Outline

• Introducon • Building reliable file systems • Reducing overheads with SubSIST • Evaluaon • Conclusion

6/18/09 Tolerang File‐System Mistakes with EnvyFS 32 Conclusion

• Bugs/mistakes are inevitable in any soware – Must cope, not just hope to avoid

• EnvyFS: N‐version approach to tolerang FS bugs – Built using exisng specificaon and file systems

• SubSIST: single instance store – Decreases overheads while retaining reliability

6/18/09 Tolerang File‐System Mistakes with EnvyFS 33 Thank You!

Advanced Systems Lab (ADSL) University of Wisconsin‐Madison hp://www.cs.wisc.edu/adsl

6/18/09 Tolerang File‐System Mistakes with EnvyFS 34 Future Work

• Debugging tool for developers – Run older and newer version of file systems – Compare results with older version

• File system repair – Simple repair: copy data from other file system – Complex repair: recreate enre file system tree – How to do micro repair ?

6/18/09 Tolerang File‐System Mistakes with EnvyFS 35

) ) )

ext3 stat, … chmod fsync

E path traversal SET‐1 ( SET‐2 ( read readlink getdirentries creat link mkdir rename symlink write truncate rmdir unlink mount SET‐3 ( umount INODE Data loss DIR BMAP Read‐only IMAP N/A INDIRECT

DATA SUPER • Ext3 detects corrupon for rmdir, unlink JSUPER • creat , mkdir, symlink cause ext3 to reuse an GDESC inode, resulng in data loss

6/18/09 Tolerang File‐System Mistakes with EnvyFS 36