ParaFS: A Log-Structured to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang, Jiwu Shu, Youyou Lu

Tsinghua University

1 Outline • Background and Motivation • ParaFS Design • Evaluation • Conclusion

2 Solid State Drives – Internal Parallelism • Internal Parallelism – Channel Level, Chip Level, Die Level, Plane Level – Chips in one package share the same 8/16-bit-I/O bus, but have separated chip enable (CE) and ready/busy (R/B) control signals. – Each die has one internal R/B signal. – Each plane contains thousands of flash blocks and one data register. ü Internal Parallelism à High Bandwidth. Die Level Plane Level Channel Level Block 0 Block 0 Block 1 Block 1 H/W Interface H/W Flash Flash Chip Chip Block ... Block ... Host FTL Register Register Interconnect Plane 0 Plane 1 Flash Flash Die 0 Die 1 Chip Chip Chip Level Flash File Systems • Log-structured File System – Duplicate Functions: Space Allocation, Garbage Collection. – Semantic Isolation: FTL Abstraction, Block I/O Interface, Log on Log.

Log-structured File System Namespace Alloc. GC

READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC

Channel 0 Channel 1 Channel N … Flash Flash Flash 4 Observation • F2FS vs. (under heavy write traffic) – YCSB: 1000w random Read and Update operations – 16GB flash space + 24GB write traffic 100

EXT4 F2FS 75 7 50 6 5 25

4 (%) Efficiency GC 0 3 1 4 8 16 32 25000 2 20000

1 Blocks

Normalized Throughput Normalized 0 15000 1 4 8 16 32 10000 Number of Channels 5000

F2FS has poorer performance than Ext4 Recycled of # 0 on SSDs. 1 4 8 16 32 Problem

• Internal Parallelism Conflicts – Broken Data Grouping : Grouped data are broken and dispatched to different locations. – Uncoordinated GC Operations: GC processes in two levels are performed out-of-order. – Ineffective I/O Scheduling: erase operations always block the read/write operations, while the writes always delay the reads.

• The flash storage architecture block the optimizations.

6 Current Approaches (1) • Log-structured File System – Duplicate Functions: Space Allocation, Garbage Collection. – Semantics Isolation: FTL Abstraction, Block I/O Interface, Log on Log.

Log-structured File System NILFS2, F2FS[FAST’15] Namespace Alloc. GC SFS [FAST’12] , READ / WRITE / TRIM Multi-streamed SSD [HotStorage’14] FTL Mapping Alloc. GC WL ECC DFS [FAST’10] , Nameless Write [FAST’12] Channel 0 Channel 1 Channel N … Flash Flash Flash Flash Memory 7 Current Approaches (2) • Object-based File System [FAST’13] – Move the semantics from FS to FTL. – Difficult to be adopted due to dramatic changes – Internal parallelism under-explored Log-structured File System Object-based File System Namespace Namespace Alloc. GC GET / PUT READ / WRITE / TRIM Object-based FTL FTL Mapping Mapping Alloc. GC WL ECC Alloc. GC WL ECC

Channel 0 Channel 1 Channel N Channel 0 Channel 1 Channel N … … Flash Flash Flash Flash Flash Flash Flash Memory Flash Memory ParaFS Architecture Goal: How to exploit the internal parallelism of the flash devices while ensuring effective data grouping, efficient garbage collection, and consistent performance? ParaFS Log-structured File System Namespace Namespace Alloc. GC 2-D Allocation Coordinated GC Parallelism-Aware Scheduler READ / WRITE / TRIM FTL READ / WRITE / ERASE … Mapping S-FTL Alloc. GC WL ECC Block Mapping WL ECC … Channel 0 Channel 1 Channel N … Channel 0 Channel 1 Channel N Flash Flash Flash … Flash Memory Flash Flash Flash Flash Memory Outline • Background and Motivation • ParaFS Design – ParaFS Architecture – 2-D Data Allocation – Coordinated Garbage Collection – Parallelism-Aware Scheduling • Evaluation • Conclusion

10 ParaFS Architecture • Simplified FTL (S-FTL) – Exposing Physical Layout to FS: # of flash channels, Size of flash block, Size of flash page. – Static Block Mapping: Block-level, rarely modified. – Data Allocation Functionality is removed. – GC process is simplified to Erase process. – WL, ECC: functions which need hardware supports.

IOCTL ParaFS READ / WRITE / ERASE … • Interface S-FTL – Block Mapping WL ECC – Erase à Trim … Flash Memory ParaFS Architecture

• ParaFS: Allocation, Garbage Collection, Scheduling

2-D Allocation Region 0 Region N ParaFS Namespace … 2-D Allocation Coordinated GC Parallelism-Aware Scheduler Coordinated GC Thread 0 Thread N READ / WRITE / ERASE … S-FTL … Block Mapping WL ECC

… Parallelism-Aware Scheduler Channel 0 Channel 1 Channel N Req. Que. 0 Req. Que. N … Read Read Flash Flash Flash Write … Write Erase Erase Flash Memory Read Read Problem #1: Grouping vs. Parallelism • Hot/Cold Data Grouping Hot Segment Cold Segment

FS Vision of Data A B C D E F G H

FTL Vision of Data

Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 0 Block 0 Block 0 A0 B0 C0 D0 … … 1E… F1… G1 H1

N N N N … … … … … Block N Block N Block N Block N 0 0 0 0 … … 1… 1… 1 1

N N N N Die 0 Die 1 Die 0 Die 1 Chip 0 Chip 1 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H

FTL Vision of Data Recycle Unit

A B C D E F G H C C C C G G G G

Parallel Block 0 1 2 3 • Exploiting internal parallelism with page unit striping causes heavy garbage collection overhead. 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H

FTL Vision of Data Recycle Unit

A E C D B F G H C G C C D H G G

Parallel Block 0 1 2 3 • Not fully exploiting internal parallelism of the device, and performs badly in small sync write situation, mailserver. 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H

FTL Vision of Data Recycle Unit

A BE C D BE F G H C G C C D H G G

Parallel Block 0 1 2 3

• Larger GC unit causes extra garbage collection overhead. 1. 2-D Data Allocation • Aligned Data Layout in ParaFS – Region à Flash Channel Segment à Flash Block Page à Flash Page

Channel-level Log-structured Segment Write Request Dimension 1 2 3 4 5…N Written Data Pag e

Region 0 Region 1 Region N

Alloc. Head 0 Hotness-level Alloc. Dimension … Alloc. Head 1 … … Space … … Alloc. Head m

Free Free Seg. List Space

Channel 0 Channel 1 Channel N Flash Memory … Flash Flash Flash 1. 2-D Data Allocation • Data Allocation Schemes Comparison – Page Stripe – Block Stripe – Super Block – 2-D Allocation

Parallelism Garbage Collection Stripe Parallelism GC Grouping GC Granularity Level Granularity Maintenance Overhead Page Stripe Page High Block No High Block Stripe Block Low Block Yes Low Super Block Page High Multiple Yes Medium Blocks 2-D Allocation Page High Block Yes Low Problem #2: GC vs. Parallelism • Uncoordinated Garbage Collection

FS Vision of Data A B C D E F G H

FTL Vision of Data

Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 1 Block 0 Block 1 A B E G C D F H FTL GC early … … … … … …

Die 0 Die 0 Chip 0 Chip 1 Problem #2: GC vs. Parallelism • Uncoordinated Garbage Collection FS GC early

FS Vision of Data A B C D E F G H

FTL Vision of Data

Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 1 Block 0 Block 1 A B E G First C D F H Second Choice Choice … … … … … …

Die 0 Die 0 Chip 0 Chip 1 2. Coordinated Garbage Collection

• Garbage Collection in Two levels brings overheads – FTL only gets page invalidation after Trim commands. • Due to the no-overwrite pattern. • Pages that are invalid in the FS, moved during FTL-level GC. – Unexpected timing of GC starts in two levels. • FTL GC starts before FS GC starts. • FTL GC blocks the FS I/O and causes unexpected I/O latency. – Waste Space. • Each level keeps over-provision space for GC efficiency. • Each level keeps meta data space, to record page and block(segment) utilization, to remapping.

21 2. Coordinated Garbage Collection

• Coordinated GC process in two levels – FS-level GC • Foreground GC or Background GC (trigger condition, selection policy) • Migration the valid pages from victim segments. • Do the checkpoint in case of crash. • Send the Erase Request to S-FTL. – FTL-level GC • Find the corresponding flash block by static block mapping table. • Erase the flash block directly. • Multi-threaded Optimization – One GC process per Region (Flash Channel).

– One Manager process to do the checkpoint after GC. 22 3. Parallelism-Aware Scheduling

• Request Dispatching Phase – Select the least busy channel to dispatch write request

New Write Request Queue Write Write Read Write Write Read Write Read Read Read Req. Que. 0 Req. Que. 1 Req. Que. 2 Req. Que. 3

� = (� ���� , � ����)

23 3. Parallelism-Aware Scheduling

• Request Dispatching Phase – Select the least busy channel to dispatch write request • Request Scheduling Phase – Time Slice for Read Request Scheduling and Write/Erase Request Scheduling. – Schedule Write or Erase Request according to Space Utilization and Number of Concurrent Erasing Channels.

� = �� + ��

– In implementation

1 ?= 2� + 1�

24 3. Parallelism-Aware Scheduling • Request Scheduling Phase – Example. Free Space Used Space

� = 2 ∗ 75% + 0 > 1 Write Erase Erase Write

� = 2 ∗ 25% + 0 < 1

� = 2 ∗ 30% + 25% < 1

� = 2 ∗ 30% + 50% > 1

50%

Channel 0 Channel 1 Channel 2 Channel 3 � = 75% � = 25% � = 30% � = 30% Outline • Background and Motivation • ParaFS Design • Evaluation • Conclusion

26 Evaluation

• ParaFS implemented on kernel 2.6.32 based on F2FS – Ext4 (In-place update), (CoW), back-ported F2FS (Log-structured update) – F2FS_SB: back-ported F2FS with large segment configuration • Testbed: – PFTL, S-FTL

• Workloads: Workload Pattern R:W FSYNC I/O Size Fileserver random read and write files 33/66 N 1 MB Postmark create, delete, read and append files 20/80 Y 512 B MobiBench random update records in SQLite 1/99 Y 4 KB YCSB read and update records in MySQL 50/50 Y 1 KB Evaluation – Light Write Traffic

• Normalized to F2FS in 8-Channel case

1.2 3 BtrFS EXT4 F2FS F2FS_SB ParaFS BtrFS EXT4 F2FS F2FS_SB ParaFS 1 2.5

0.8 2

0.6 1.5

0.4 1

Normalized Throughput Normalized 0.2 Throughput Normalized0.5

0 0 Fileserver Postmark MobiBench YCSB Fileserver Postmark MobiBench YCSB Channel = 8 Channel = 32

- Outperforms other file systems in all cases. - Improves performance by up to 13% (Postmark 32-Channel)

28 Evaluation – Heavy Write Traffic

24 24 20 20 16 16 12 12 8 8 4 4 0 0 1 4 8 16 32 1 4 8 16 32 Fileserver Postmark 16 7 6 12 5 4 8 3 4 2 1 0 0 1 4 8 16 32 1 4 8 16 32 MobiBench YCSB - Achieves the best throughput among all evaluated file systems - Outperforms: Ext4(1.0X ~ 2.5X) F2FS(1.6X ~ 3.1X) F2FS_SB(1.2X ~1.5X) Evaluation - Endurance

50000 1

40000 0.8

30000 0.6

20000 0.4

10000 Efficiency GC 0.2

# of Recycled Blocks Recycled of # 1 4 8 16 32 1 4 8 16 32 Recycled Block Count GC Efficiency

80 120 100 60 ParaFS decreases the write traffic to the flash memory by 31.7% ~ 58.1%, compared with80 F2FS. 40 60 40 20 20 Write Traffic (GB) Write Traffic (GB) 0 0 1 4 8 16 32 1 4 8 16 32 Write Traffic from FS Write Traffic from FTL Evaluation – Performance Consistency

• Two Optimizations – MGC: Multi-threaded GC process – PS: Parallelism-aware Scheduling

- Multi-threaded GC improves 18.5% (MGC vs. Base) - Parallelism-aware Scheduling: 20% improvement in first stage. (PS vs. Base) much consistent performance in GC stage. (ParaFS vs. MGC)

31 Conclusion • Internal parallelism has not been leveraged in the FS- level. – The mechanisms: Data Grouping, Garbage Collection, I/O Scheduling – The architecture: Semantics Isolation & Redundant Functions

• ParaFS bridges the semantic gap and exploits internal parallelism in the FS-level. – (1) 2-D Allocation: page unit striping and data grouping. – (2) Coordinated GC: improve the GC efficiency and speed. – (3) Parallelism-aware Scheduling: more consistent performance.

• ParaFS shows performance improvement by up to 3.1X and write reduction by 58.1% compared to F2FS.

32 Thanks

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang, Jiwu Shu, Youyou Lu [email protected] http://storage.cs.tsinghua.edu.cn/~lu

33