Error Propagation Analysis for File Systems!

Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau! University of Wisconsin-Madison PLDI’09

ECS 289C Seminar on Program Analysis February 17th, 2015

Motivation!

• Systems software plays an important role! o Designed to operate the computer hardware! o Provides platform for running application software! • Particular focus on file systems! o Store massive amounts of data! o Used everywhere!! § Home use: photos, movies, tax returns! § Servers: network file servers, search engines! • Incorrect error handling → serious consequences! o Silent data corruption, data loss, system crashes, etc.!

• Broken systems software → Broken user applications!! ! 2 Error Handling!

Error Propagation + Error Recovery!

USER-LEVEL APPLICATION SYSTEMS SOFTWARE Error Recovery:! ERROR ERROR • Logging! • Notifying! • Re-trying!

Run-time ERROR ERROR Errors!

HARDWARE Incorrect error handling → longstanding problem!

3 Return-Code Idiom!

• Widely used in systems software written in C!

• Also used in large C++ applications!

• Run-time errors represented as integer values!

• Propagated through variable assignments and function return values!

4 Error Codes in !

5 Example of Error-Propagation Bug!

1 int txCommit(...) { 2 ... 3 if (isReadOnly(...)) { 4 rc = -EROFS; read-only file-system error! 5 ... 6 goto TheEnd; 7 } ... 8 if (rc = diWrite(...)) diWrite may return EIO error! 9 txAbort(...); 10 TheEnd: return rc; function returns variable rc 11 }

13 int diFree(...) { 14 ... 15 rc = txCommit(...); 16 ... rc not acknowledged by error 17 return 0; reporting/recovery code! Silent data 18 } error stored in rc is lost!! loss! Incorrect error propagation in IBM JFS Linux file system!

6 High-Level Approach!

Source code Source code + Error code definitions!

Static Program Analysis! Error Propagaon Analysis Set of error codes each variable may contain at each program point!

Second pass over code to find:! Find Error-Propagaon 1) Overwritten errors! Bugs 2) Out-of-scope errors! 3) Unsaved errors!

Bug Reports Sample path for each bug found!

7 Error Propagation Analysis!

• Interprocedural, flow- and context-sensitive static program analysis!

• Finds set of error codes each variable may contain at each program point!

• Formulated and solved using weighted pushdown systems (WPDS)!

8 Weighted Pushdown Systems!

• Control flow is encoded using Pushdown System (PDS)!

• Data flow encoded as transfer functions (or weights) on PDS rules! o Elements of a set S that constitutes a bounded idempotent semiring! S =(D, , , 0¯, 1)¯ ⌦ • Computes “meet-over-all-paths” solution!

9 Mappings Describing Transfer Functions!

Definition! Example! V CV C D D== wwww: V: V C C 2 [2 [ { | [ ! } 1 if (...) { { | [ ! } 2 x = EROFS; V: set of variables ! 3 } C: set of error-code constants ! 4 else { 5 x = EIO; 6 } 7 y = x; Example of a mapping! w

x y EIO EROFS OK

x y EIO EROFS OK

Ident[x ↦ {EROFS}]! 10 Combining Mappings: ! Modeling merging of control flow paths !

Definition of combine operator! Example!

(w w )(e) w (e) w (e) 1 if (...) { 1 2 ⌘ 1 [ 2 2 x = EROFS; 3 } MERGING where! w1,w2 D and! e V C BRANCHES! where 2 and 2 [ 4 else { 5 x = EIO; 6 } 7 y = x; x y EIO EROFS OK x y EIO EROFS OK x y EIO EROFS OK

x y EIO EROFS OK x y EIO EROFS OK x y EIO EROFS OK w1 w2 w1 w2 Ident[x ↦ {EROFS}]! Ident[x ↦ {EIO}]! Ident[x ↦ {EROFS, EIO}]!

11 Extending Mappings: ! Modeling sequential control flow! ⌦

Definition of extend operator! Example!

1 if (...) { ((ww1 w2w)(e))(e) w1(e0) w (e0) 1⌦ 2 ⌘ 1 2 x = EROFS; ⌦ e0 ⌘w2(e) 3 } 2[e0 w2(e) 2[ 4 else { EXTENDING! where! w1,wwhere 2 D and! e V C w1,w22 D e2 V[ C 5 x = EIO; 2 2 [ 6 } 7 y = x;

x y EIO EROFS OK x y EIO EROFS OK ww11 Ident[x ↦ {EIO, EROFS}]! w Ident[y ↦ {x}]! w22 x y EIO EROFS OK x y EIO EROFS OK w1 w2 Ident[x ↦ {EIO, EROFS},⌦ y ↦ {EIO, EROFS}]! 12 Detecting Handled Errors!

• Handled errors vs. unhandled errors! •! Adopt simple definition of “correct handling”! ! Example!

1 if (...) { x y EIO EROFS OK 2 x = EROFS; 3 } 4 else { 5 x = EIO; 6 } x y EIO EROFS OK 7 y = x; Ident[y ↦ {OK}]! logging error! 8 printk(“error %d\n”, y);

• Error-handling functions file-system specific! o ext3_error, jfs_error, etc.! 13 Analysis Result!

• Mappings representing execution from the beginning of the program to any program point! o Meet over all paths value ! o Errors each variable may contain at each program point!

• For each mapping a subset of inspected paths whose combine is equal to ! o Explains why value is as claimed!

14 Dropped Errors!

Error-code instances that vanish before being handled

OVERWRITTEN! Error overwritten with new value!

OUT OF SCOPE! Variable storing error goes out of scope!

UNSAVED! Error returned by a function is not saved by the caller!

15 Program Transformations!

1. Transform unsaved errors into out-of-scope errors!

1 int bar(...) { 2 return –EIO; EIO is returned! 3 }

5 int foo(...) { 6 OUT OF SCOPE! 7 ... 8 bar(); EIO is not saved! 9 ... 10 11 UNSAVED! 12 return ...; 13}

16 Program Transformations!

1. Transform unsaved errors into out-of-scope errors!

1 int bar(...) { 2 return –EIO; 3 }

5 int foo(...) { 6 OUT OF SCOPE! 7 ... 8 int tmp = bar(); EIO is saved in 9 ... temporary variable! 10 11 UNSAVED! 12 return ...; tmp is out of scope ! 13}

17 Program Transformations!

2. Transform out-of-scope errors into overwritten errors!

1 int bar(...) { OVERWRITTEN! 2 return –EIO; 3 }

5 int foo(...) { foo(...) { 6 OUT OF SCOPE! 7 ... 8 int tmp = bar(); 9 ... 10 11 tmp = OK; EIOAssigning is overwritten OK at the ! end UNSAVED! 12 return ...; oftmp the isfunction out of scope! ! 13}

18 Finding Dropped Errors!

1. Transform all error-propagation bugs into overwrites! !

2. Apply error-propagation analysis! 3. Find whether each assignment may overwrite an error! ! • Before each assignment t = s, retrieve the associated mapping w! ! • Let T = w(t), S = w ( s ) , and E be the set of all error codes! (1) If T E = No error overwrite ! \ {} (2) If T E = S = e OK to overwrite same error \ { } (3) Otherwise Error overwrite!!

19 Sample Output!

1 int nextId(){ 2 static int id; 3 return id++; “result” has unchecked error 4 } 5 6 int getError(){ 7 return –EIO; unchecked error “EIO” is returned 8 } 9 10 int load(){ 11 int status, result = 0; 12 13 if (nextId()) 14 status = getError(); “status” receives unchecked error from function “getError” 15 16 result = status; “result” receives unchecked error from “status” 17 18 if (nextId()) “result” has unchecked error 19 result = -EPIPE; overwriting unchecked error in “result” 20 21 return result; 22 }

20 Sample Output!

1 int nextId(){ 2 static int id; 3 return id++; example.c:7: unchecked error “EIO” is returned 4 } 5 example.c:14: “status” receives unchecked error from funcon “getError” 6 int getError(){ example.c:16: “result” receives unchecked error from “status” 7 return –EIO; example.c:18: “result” has unchecked error 8 } example.c:3: “result” has unchecked error 9 example.c:18: “result” has unchecked error 10 int load(){ example.c:19: overwring unchecked error in “result” 11 int status, result = 0;

12 Complete diagnostic path trace 13 if (nextId()) 14 status = getError(); 15 example.c:7: unchecked error “EIO” is returned 16 result = status; example.c:14: “status” receives unchecked error from funcon “getError” 17 example.c:16: “result” receives unchecked error from “status” 18 if (nextId()) 19 result = -EPIPE; example.c:19: overwring unchecked error in “result” 20 Diagnostic path slice 21 return result; 22 }

21 Experimental Results!

• Case studies: CIFS, , , IBM JFS, ReiserFS and VFS in Linux 2.6.27 kernel!

• 501 bug reports in total! Many errors actually dropped!

True Bugs Per Category False Posives Per Category

25 18 44 8% 6% 23% Overwrien Overwrien Out of Scope 97 Out of Scope 51% 269 Unsaved 48 Unsaved 86% 26%

312 reports 189 reports

22 Unsaved Errors!

• 269 true bugs, 97 false positives!

Unsaved True Bugs per 12

24 ext4 68 ext3 38 IBM JFS VFS ReiserFS 58 69 cifs

• Most common unsaved errors: EIO, ENOSPC and ENOMEM!

23 Inconsistencies!

• Inconsistent use of function return values! ! • For example, a function whose return value is: ! o Saved at 17 call sites! o Unsaved at 35 other call sites! § 9 out of 35 identified as true bugs! § The rest are false positives (errors are intentionally dropped!)! ! • Some developers have suggested to use annotations to mark intentionally dropped errors!

24 False Positives – Unsaved Errors!

• Redundant error reporting!

1 buffer_head *b = NULL; 2 ... 3 if (buffer_dirty(b)) 4 sync_dirty_buffer(b); An error may not be saved 85% 5 6 if (!buffer_uptodate(b)) { Checking “b” 7 reiserfs_warning(...); 8 retval = -EIO; 9 }

• Error paths: already failing!

15% • Met preconditions: cannot possibly fail!

25 (Updated) Running Time!

• 3.2 GHz Intel Pentium 4 processor workstation with 3 GB RAM!

Running Time (in minutes) 2.5 2 1.5 1 0.5 0 CIFS ext3 ext4 IBM JFS ReiserFS KLOC: 91 83 97 93 88 26 (Updated) Memory Usage!

• 3.2 GHz Intel Pentium 4 processor workstation with 3 GB RAM!

Memory Usage (in MB) 400

300

200

100

0 CIFS ext3 ext4 IBM JFS ReiserFS KLOC: 91 83 97 93 88 27 Other File Systems!

• 43 other Linux file systems! o Bug reports to be examined!

• Different Linux versions! o Significant evolution in each release!

• NASA/JPL Laboratory for Reliable Software! o Using tool to check code in the Mars Science Laboratory! o Found one real error propagation bug so far “We’ve found a legitimate problem…We call a non-void function (that can return some critical error codes) and don’t assign the return value, dropping some nice things such as EASSERT, EABOUND, and EEBADHDR on the ground. We would have expected the compiler or [another code-checking tool] to catch that, actually…We’re going to rerun on a big update to the code soon.”

28 Summary!

• Proposed interprocedural flow- and context-sensitive static analysis that detects overwritten, out-of-scope and unsaved unchecked errors!

• Analyzed five Linux file systems and the VFS!

• Reported 501 bug reports, of which 312 are true bugs!

• Eliminating error propagation bugs increases the trustworthiness of file systems and, in turn, of computer systems as a whole! ! 29 Developers Feedback!

“I think this is an excellent way of detecting bugs that happen rarely enough that there are no good reproduction cases, but likely hit users on occasion and are otherwise impossible to diagnose.” - Andreas Dilger (ext4)!

“Ew, this [bug] is hard to figure out.” - Matthew Wilcox (FS)!

“Thank you for helping to improve JFS!” - David Kleikamp (IBM JFS)!

“So that is a nice find.” - Jan Harkes ()!

“Thank you for looking into this. It’s a great idea.” - Matthew Wilcox (FS)!

“Thanks for your efforts!” - Jeff Mahoney (ReiserFS)!

“This sounds interesting - please forward them to me.” - Steve French(CIFS)!

30