A Study of Evolution Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Shan Lu

Motivation How We Studied ? What We Did ? Local file systems are important File systems are evolving over time Our method: manual inspection • Desktop and laptop: Windows, Mac, Linux • Code base is not static • XFS, , , , Reiserfs and JFS • Data center / Cloud: Google FS, Hadoop DFS • New features, bug-fixing, performance, reliability • All Linux 2.6 major versions • Mobile devices: Andriod, iPhone • 5079 patches, multiple passes Patches describe such evolution Why study is useful • How one version transforms to the next Quantitatively analyze file systems • Study drives system designs • Every patch is available • Patch types, bug patterns, bug consequences • Answer important questions quantitatively • “System software archeology” is possible • Performance improvement • Valuable for different communities • Reliability enhancement • File system developers Study with other rich information • System researchers • Source code, design documents Provide an annotated dataset • Tool builders • Forum, mailing list • Rich data for further analysis

Key Questions 1 Patch Overview 2 Bug Pattern 2004 1154 809 537 384 191 5079 511 450 358 229 158 80 1786 Q1: What do patches do ? 100% 100%

Q2: What do bugs look like ? 80% Maintenance 80% Error Code Feature Q3: Do bugs diminish over time ? 60% 60% Memory Reliability Q4: What consequences do bugs have ? 40% 40% Concurrency Performance Semantic 20% 20% Q5: Where does complexity lie ? Bug

0% 0% Q6: Do bugs occur on normal paths or XFS Ext4 Btrfs Ext3 Reiser JFS All XFS Ext4 Btrfs Ext3 Reiser JFS All

failure paths ? A1: 45% of patches are maintenance patches; A2: Semantic bugs dominate other types (about Q7: What performance techniques are 35% of patches are bug fixings. Even stable file 55% of total bugs). Concurrency bugs account systems, such as Ext3, have a large percentage for about 20% (higher than user-level software, used across file systems ? of bug patches. which has about 3% of concurrency bugs)

3 Bug Trend 4 Bug Consequence 5 Correlation

Semantic Concurrency Memory Error Code file super trans tree 525 461 366 235 166 80 1833 100% balloc dir other 40 XFS Ext4 80 Btrfs 0.4 0.4 0.4 40 XFS Ext4 Btrfs 30 60 Wrong 0.3 0.3 0.3 30 80% 20 40 Leak 0.2 0.2 0.2 20

10 10 20 60% Hang 0.1 0.1 0.1

0 0 0 0.0 0.0 0.0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Deadlock 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 40% Ext3 ReiserFS JFS Error 15 40 10 0.4 Ext3 0.4 ReiserFS 0.4 JFS Number of Bugs 30 Crash 0.3 0.3 0.3 10 20% Percentage of Bugs 20 5 Corruption 0.2 0.2 0.2 5 10 0% 0.1 0.1 0.1 XFS Ext4 Btrfs Ext3 Reiser JFS All 0 0 0 0.0 0.0 0.0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Linux Version A4: Corruption and crash are most common, Percentage of Code A3: The total number of bugs do not diminish (about 40% and 20%). Wrong behavior accounts A5: Metadata management has high bug over time (even for stable file systems), rather for only 5% to 10%, while it is dominant in user- density (e.g., inode and super block). Tree ebbing and flowing over time. level applications. structures are not particularly error-prone.

6 Failure Path 7 Performance Conclusions

200 149 144 88 63 28 672 135 80 126 41 24 9 415 100% 100% A large-scale study is feasible Other • Time consuming, but manageable 80% 80% Locality • Similar study for other OS components

60% Scalability 60% Research should match reality Scheduling 40% 40% • New tools are required for semantic bugs Access Opt 20% • More attention for failure paths 20% Sync 0% 0% XFS Ext4 Btrfs Ext3 Reiser JFS All History repeats itself XFS Ext4 Btrfs Ext3 Reiser JFS All • Same mistakes, same performance improvement A6: 38% of bugs are on failure paths. Common A7: A wide variety of performance techniques are bug examples: wrong or miss state update used across file systems. Performance likely justifies Dataset is available (semantic); miss unlock (concurrency); resource the use of more complicated and time saving • http://research.cs.wisc.edu/adsl/Traces/fs-patch/ leak, null dereference (memory). synchronization methods. • More data in our paper and dataset

Sunday, 10 February 2013