Physical Disentanglement in a Container-Based System

Lanyue Lu, Yupu Zhang, Thanh Do, Samer AI-Kiswany Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin Madison Motivation Isolation in various subsystems ➡ resources: virtual machines, containers ➡ security: BSD jail, sandbox ➡ reliability: address spaces File systems lack isolation ➡ physical entanglement in modern file systems Three problems due to entanglement ➡ global failures, slow recovery, bundled performance Global Failures Definition ➡ a failure which impacts all users of the file system or even the Read-Only ➡ mark the file system as read-only ➡ e.g., metadata corruption, I/O failure Crash ➡ crash the file system or the operating system ➡ e.g., unexpected states, pointer faults

200 Read-Only Crash

150

100

50 Number of Failure Instances 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Linux 3.X Versions

Read-Only Crash 400 350 300 250 200 150 100 50 Number of Failure Instances 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Linux 3.X Versions Slow Recovery checker ➡ repair a corrupted file system ➡ usually offline

Current checkers are not scalable ➡ increasing disk capacities and file system sizes ➡ scan the whole file system ➡ checking is unacceptably long Scalability of fsck on Ext3

Ext3 1007 1000

800 723

600 476 400

Fsck Time (s) 231 200

0 200GB 400GB 600GB 800GB File-system Capacity Bundled Performance Shared transaction ➡ all updates share a single transaction ➡ unrelated workloads affect each other

Consistency guarantee ➡ the same journal mode for all files ➡ limited flexibility for different tradeoffs Bundle Performance on Ext3

180 SQLite Varmail 146.7 150

120

90 76.1

60

Throughput (MB/s) 30 20 1.9 0 Alone Together Server Virtualization

VM1 VM2 VM3 VM4 VM5

VD1 VD2 VD3 Shared file system VD4 VD5 All VMs will crash

VM1 VM2 VM3 VM4 VM5

metadata VD1 VD2 VD3 Read-Only corruption or Crash VD4 VD5 Virtual Machines

VM1 VM2 VM3 100

80

60 fsck: 496s + bootup: 68s 40

20 Throughput (IOPS) 0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Time (Second) Our Solution A new abstraction for disentanglement A disentangled file system: IceFS Three major benefits of IceFS ➡ failure isolation ➡ localized recovery ➡ specialized journaling Motivation IceFS Evaluation Conclusion A New Abstraction: Cube What is a cube ➡ an independent container for a group of files ➡ physically isolated Interface ➡ create a cube: (cube_flag) ➡ delete a cube: () ➡ add files to a cube ➡ remove files to a cube ➡ set cube attributes Cube Example

Cube 2 Cube 1 / a c2 c1 b

d A Disentangled File System No shared physical resources ➡ no shared metadata: e.g., block groups ➡ no shared disk blocks or buffers No access dependency ➡ partition linked lists or trees ➡ avoid directory hierarchy dependency No bundled transactions ➡ use separate transactions ➡ enable customized journaling modes IceFS A prototype based on Ext3 in Linux 3.5 Disentanglement ➡ directory indirection ➡ transaction splitting Directory Indirection

Cube 2 Cube 1 / a c2 c1 b

d

Pathname matching for cubes’ top directory paths Cubes’ dentries are pinned in memory Transaction Splitting App 1 App 2 App 3 App 1 App 2 App 3

on-disk on-disk journal journal

in-memory tx preallocated tx committed tx

(a) Ext3/Ext4 (b) IceFS

Separate transactions from different cubes commit to the disk journal in parallel Benefits of Disentanglement Localized reactions to failures ➡ per-cube read-only and crash Localized recovery ➡ only check faulty cubes ➡ offline and online Specialized journaling ➡ parallel journaling ➡ diverse journal modes ➡ no journal, no fsync, writeback, ordered, data Motivation IceFS Evaluation Conclusion Fast Recovery

Ext3 IceFS 1007 1000

800 723

600 476 400

Fsck Time (s) 231 200 91 122 35 64 0 200GB 400GB 600GB 800GB File-system Capacity Specialized Journaling

OR: ordered SQLite Varmail 250 220.3 NJ: no journal

200

150 120.6 125.4 103.4 100 76.1

Throughput (MB/s) 50 1.9 9.8 5.6 0 OR OR OR OR NJ OR OR NJ Ext3 IceFS Server Virtualization

VM1 VM2 VM3 VM4 VM5

metadata VD1 VD2 VD3 corruption Shared file system VD4 VD5 Server Virtualization

VM1 VM2 VM3

100 IceFS-Offline fsck: 35s 80 + bootup: 67s 60 40

20

0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700

100 IceFS-Online fsck: 74s 80 + bootup: 39s Throughput (IOPS) 60 40 20 0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Time (Second) Conclusion File systems lack physical isolation Our contribution ➡ a new abstraction: cube ➡ a disentangled file system: IceFS ➡ demonstrate its benefits: ➡ isolated failures ➡ localized recovery ➡ specialized journaling ➡ improving both reliability and performance Questions ? IceFS Disk Layout

cube inode number, cube pathname orphan inode list, cube attributes

SB S0 S1 S2 Sn bg bg bg bg bg

cube 0 cube 1 cube 2

Each cube has one sub-super block (Si) and its own block groups (bg)