EROFS: A Compression-friendly Readonly for Resource-scarce Devices

Xiang Gao#, Mingkai Dong*, Xie Miao#, Wei Du#, Chao Yu#, Haibo Chen*,#

# Technologies Co., Ltd. *IPADS, Shanghai Jiao Tong University USENIX ATC 2019, Renton, WA, USA 2 System resources in Android

Read-only Read-only /oem /system 3072 2654 Read-only

/system Partition Size (MB) /odm

Read-only 840 654 650 512 /vendor 2.3.6 4.3 5.1 6.0 7.0 8.0

Android Version ... System resources consume significant storage! Read-only system partitions → compressed read-only file systems 3 Existing solution

Squashfs is the state-of-the-art compressed read-only file system We tried to use for system resources in Android

/system /oem /odm /vendor Squashfs Squashfs Squashfs Squashfs

Result: The system lagged and even froze for seconds and then rebooted

4 Why does Squashfs fail?

Fixed-sized input compression

1. Divide to fixed-sized chunks 2. Compress each chunk 3. Concatenate

6 Why does Squashfs fail?

Fixed-sized input compression

1. Divide to fixed-sized chunks 2. Compress each chunk 3. Concatenate

7 Why does Squashfs fail? Read Amplification

1 byte

Fixed-sized input compression decompression amplification

...... I/O amplification 8 Why does Squashfs fail? Read Amplification Massive Memory Consumption buffer_head 1 byte ... allocation copy decompression allocation amplification decompress allocation Page cache ...... copy I/O amplification 9 EROFS

Fixed-sized output compression 1. Prepare a large amount of data 2. Compress to a fixed-sized block 3. Repeat

10 EROFS

Fixed-sized output compression 1. Prepare a large amount of data 2. Compress to a fixed-sized block 3. Repeat

ü Reduce read amplification ü Better compression ratio ü Allows in-place decompression

12 Choosing the page for I/O

Page cache for partially Cached I/O compressed blocks For partially decompressed blocks Page cache Allocate a page in dedicated page cache for I/O ü Following decompression can reuse the cached page

In-place I/O Page cache of requested file For blocks to be fully decompressed Reuse the page allocated by VFS if possible Page cache ü Memory allocation and consumption are reduced

13 Decompression

0 1 2 3 4 5 6 7

Page cache

14 Decompression Vmap decompression 1. Count data blocks to decompress. 2. Allocate temporary physical pages or choose pages allocated by VFS. 3. Allocate a continuous VM area via vmap(), and map physical pages in the area. 4. For in-place I/O, copy the compressed block to a temporary per-CPU page. 5. Decompress to the VM area. Page cache

0 1 2 3 4 5 6 7

How data is compressed 15 Decompression Vmap decompression 1. Count data blocks to decompress. 2. Allocate temporary physical pages or choose pages allocated by VFS. 3. Allocate a continuous VM area via vmap(), and map physical pages in the area. 4. For in-place I/O, copy the compressed block to a temporary per-CPU page. 5. Decompress to the VM area. Page cache

0 1 2 3 4 5 6 7

How data is compressed 16 Decompression Vmap decompression 1. Count data blocks to decompress. 2. Allocate temporary physical pages or choose pages allocated by VFS. 3. Allocate a continuous VM area via vmap(), and map physical pages in the area. 4. For in-place I/O, copy the compressed block to a temporary per-CPU page. 5. Decompress to the VM area. Page cache

0 1 2 3 4 5 6 7 same physical pages

How data is compressed 17 Decompression Vmap decompression 1. Count data blocks to decompress. 2. Allocate temporary physical pages or choose pages allocated by VFS. 3. Allocate a continuous VM area via vmap(), and map physical pages in the area. 4. For in-place I/O, copy the compressed block to a temporary per-CPU page. 5. Decompress to the VM area. Page cache

0 1 2 3 4 5 6 7 same physical pages

How data is compressed 18 Decompression Vmap decompression 1. Count data blocks to decompress. 2. Allocate temporary physical pages or choose pages allocated by VFS. 3. Allocate a continuous VM area via vmap(), and map physical pages in the area. 4. For in-place I/O, copy the compressed block to a temporary per-CPU page. 5. Decompress to the VM area. Page cache

0 1 2 3 4 5 6 7 same physical pages

0 1 2 3 4

How data is compressed 19 Decompression

Vmap decompression

✓ For all cases ✘ Frequent vmap/vunmap ✘ Unbounded physical page allocations ✘ Data copy for in-place I/O

20 Decompression

Vmap decompression Buffer decompression Pre-allocate four-page per-CPU buffers ✓ For all cases ✘ For decompression <4 pages ✘ Frequent vmap/vunmap ✓ No vmap/vunmap ✘ Unbounded physical page allocations ✓ No physical page allocation ✘ Data copy for in-place I/O ✓ No data copy for in-place I/O

21 Decompression Vmap decompression Buffer decompression Pre-allocate four-page per-CPU buffers ✓ For all cases ✘ For decompression <4 pages ✘ Frequent vmap/vunmap ✓ No vmap/vunmap ✘ Unbounded physical page allocations ✓ No physical page allocation ✘ Data copy for in-place I/O ✓ No data copy for in-place I/O Details in the paper Rolling decompression In-place decompression ✓ For decompression < a pre-allocated ✓ No data copy for in-place I/O VM area size ✓ No vmap/vunmap Ø Decompression policy ✓ No physical page allocation ✘ Data copy for in-place I/O Ø Optimizations

22 Evaluation Setup

Platform CPU DRAM Storage HiKey 960 Kirin 960 (Cortex-A73 x 4 + Cortex-A53 x 4) 3 GB 32 GB UFS Low-end smartphone MT6765 (Cortex-A53 x 8) 2 GB 32 GB eMMC High-end smartphone Kirin 980 (Cortex-A76 x 4 + Cortex-A55 x 4) 6 GB 64 GB UFS

Micro-benchmarks Evaluated file systems • Platform: HiKey 960 EROFS: LZ4, 4KB-sized output • Tool: FIO Squashfs: LZ4, {4,8,16,128}KB chunk size • Workload: enwik9, silesia.tar : LZO, 128KB chunk size Application benchmarks readonly mode w/o integrity checks • Platform: smartphones : no compression • Workload: 13 popular applications F2FS: no compression More results in the paper 23 Micro-benchmark: FIO Throughput FIO, enwik9, HiKey 960 A73 2362 MHz 300 EROFS 250 Squashfs-4K Squashfs-8K 200 Squashfs-16K (MB/s) Squashfs-128K 150 Ext4 100 F2FS Btrfs Throughput 50

0 Sequential Random Stride

24 Micro-benchmark: FIO Throughput FIO, enwik9, HiKey 960 A73 2362 MHz 300 EROFS 250 Squashfs-4K Squashfs-8K 200 Squashfs-16K (MB/s) Squashfs-128K 150 Ext4 100 F2FS Btrfs Throughput 50

0 Sequential Random Stride Btrfs performs worst in all cases. 25 Micro-benchmark: FIO Throughput FIO, enwik9, HiKey 960 A73 2362 MHz 300 EROFS 250 Squashfs-4K Squashfs-8K 200 Squashfs-16K (MB/s) Squashfs-128K 150 Ext4 100 F2FS Btrfs Throughput 50

0 Sequential Random Stride Larger chunk size brings better performance for Squashfs, if the cached results can are used. 26 Micro-benchmark: FIO Throughput FIO, enwik9, HiKey 960 A73 2362 MHz 300 EROFS 250 Squashfs-4K Squashfs-8K 200 Squashfs-16K (MB/s) Squashfs-128K 150 Ext4 100 F2FS Btrfs Throughput 50

0 Sequential Random Stride EROFS performs comparable or even better than Ext4. 27 Read Amplification and Resource Consumption

IO (MB) Consumption (MB) Seq. Random Stride Storage Memory Requested/Ext4 16 16 16 953.67 988.51 Squashfs-4K 10.65 26.19 26.23 592.43 1597.50 Squashfs-8K 9.82 33.52 34.08 530.43 1534.09 Squashfs-16K 9.05 46.42 48.32 479.38 1481.12 Squashfs-128K 7.25 165.27 203.91 379.76 1379.84 EROFS 10.14 26.12 25.93 533.67 1036.88

28 Read Amplification and Resource Consumption

IO (MB) Consumption (MB) Seq. Random Stride Storage Memory Requested/Ext4 16 16 16 953.67 988.51 Squashfs-4K 10.65 26.19 26.23 592.43 1597.50 Squashfs-8K 9.82 33.52 34.08 530.43 1534.09 Squashfs-16K 9.05 46.42 48.32 479.38 1481.12 Squashfs-128K 7.25 165.27 203.91 379.76 1379.84 EROFS 10.14 26.12 25.93 533.67 1036.88 84% 87%

29 Read Amplification and Resource Consumption

IO (MB) Consumption (MB) Seq. Random Stride Storage Memory Requested/Ext4 16 16 16 953.67 988.51 Squashfs-4K 10.65 26.19 26.23 592.43 1597.50 Squashfs-8K 9.82 33.52 34.08 530.43 1534.09 Squashfs-16K 9.05 46.42 48.32 479.38 1481.12 Squashfs-128K 7.25 165.27 203.91 379.76 1379.84 EROFS 10.14 26.12 25.93 533.67 1036.88 84% 87% 44% 92%

30 Real-world Application Boot Time Low-end Smartphone 97% 104% 104%104% 97% 110%104% 100% 87% 95% 92% 89% 90% 85% 75% 50% 3.2% 25% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 High-end Smartphone 95% 101% 95% 99% 100% 86% 89% 93% 89% 87% 81% 85% 77% 81% 75% 50% 10.9%

Boot Time (Relative to Ext4) to (Relative Time Boot 25% 0%

1 2 3 4 5 6 7 8 9 10 11 12 13 31 Deployment

üDeployed in HUAWEI EMUI 9.1 as a top feature üUpstreamed to 4.19 üSystem storage consumption decreased >30% üPerformance comparable or even better than Ext4 üRunning on 10,000,000+ smartphones

32 Conclusion EROFS: an Enhanced Read-Only File System with compression support Fixed-sized output compression with four decompression approaches: Vmap decompression Buffer decompression Thank you & questions? Rolling decompression In-place decompression Running on 10,000,000+ smartphones

EROFS üReduce system storage consumption >30% EROFS üProvide comparable or even better performance than Ext4

33 34 Backup Slides

35 Throughput and space savings Computation > I/O I/O > Computation Less I/O => Better Throughput 800 EROFS-Seq. Ext4-Seq. 700 600 EROFS-Rand. Ext4-Rand.

(MB/s) 500 400 300 200 Throughput 100 0 0 6 15 24 34 47 52 65 74 85 90 96 Space savings (%)

FIO, customized workload generated from enwik9, HiKey 960 A73 2362 MHz 36 Decompression Observation: In decompression, LZ4 looks backward at most 64KB of decompressed data. Rolling decompression Pre-allocate a large VM area and 17 physical pages per CPU Use 17 physical pages in turn.

✓ For decompression < the VM area size ✓ No vmap/vunmap ✓ No physical page allocation ✘ Data copy for in-place I/O

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2

37 Decompression Seq. A Seq. corrupted

Before decompress in-place After In-place decompression If no corruption will happen decompressed Seq. A ✓ For decompression < the VM area size ✓ No vmap/vunmap ✓ No physical page allocation ✓ No data copy for in-place I/O

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2

38