Parafs: a Log-Structured File System to Exploit the Internal Parallelism of Flash Devices
Total Page:16
File Type:pdf, Size:1020Kb
ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices Jiacheng Zhang, Jiwu Shu, Youyou Lu Tsinghua University 1 Outline • Background and Motivation • ParaFS Design • Evaluation • Conclusion 2 Solid State Drives – Internal Parallelism • Internal Parallelism – Channel Level, Chip Level, Die Level, Plane Level – Chips in one package share the same 8/16-bit-I/O bus, but have separated chip enable (CE) and ready/busy (R/B) control signals. – Each die has one internal R/B signal. – Each plane contains thousands of flash blocks and one data register. ü Internal Parallelism à High Bandwidth. Die Level Plane Level Channel Level Block 0 Block 0 Block 1 Block 1 H/W Interface Flash Flash Chip Chip Block ... Block ... Host FTL Register Register Interconnect Plane 0 Plane 1 Flash Flash Die 0 Die 1 Chip Chip Chip Level Flash File Systems • Log-structured File System – Duplicate Functions: Space Allocation, Garbage Collection. – Semantic Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System Namespace Alloc. GC READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC Channel 0 Channel 1 Channel N … Flash Flash Flash Flash Memory 4 Observation • F2FS vs. EXt4 (under heavy write traffic) – YCSB: 1000w random Read and Update operations – 16GB flash space + 24GB write traffic 100 EXT4 F2FS 75 7 50 6 5 25 4 (%) Efficiency GC 0 3 1 4 8 16 32 25000 2 20000 1 Blocks Normalized Throughput Normalized 0 15000 1 4 8 16 32 10000 Number of Channels 5000 F2FS has poorer performance than Ext4 Recycled of # 0 on SSDs. 1 4 8 16 32 Problem • Internal Parallelism Conflicts – Broken Data Grouping : Grouped data are broken and dispatched to different locations. – Uncoordinated GC Operations: GC processes in two levels are performed out-of-order. – Ineffective I/O Scheduling: erase operations always block the read/write operations, while the writes always delay the reads. • The flash storage architecture block the optimizations. 6 Current Approaches (1) • Log-structured File System – Duplicate Functions: Space Allocation, Garbage Collection. – Semantics Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System NILFS2, F2FS[FAST’15] Namespace Alloc. GC SFS [FAST’12] , READ / WRITE / TRIM Multi-streamed SSD [HotStorage’14] FTL Mapping Alloc. GC WL ECC DFS [FAST’10] , Nameless Write [FAST’12] Channel 0 Channel 1 Channel N … Flash Flash Flash Flash Memory 7 Current Approaches (2) • Object-based File System [FAST’13] – Move the semantics from FS to FTL. – Difficult to be adopted due to dramatic changes – Internal parallelism under-explored Log-structured File System Object-based File System Namespace Namespace Alloc. GC GET / PUT READ / WRITE / TRIM Object-based FTL FTL Mapping Mapping Alloc. GC WL ECC Alloc. GC WL ECC Channel 0 Channel 1 Channel N Channel 0 Channel 1 Channel N … … Flash Flash Flash Flash Flash Flash Flash Memory Flash Memory ParaFS Architecture Goal: How to exploit the internal parallelism of the flash devices while ensuring effective data grouping, efficient garbage collection, and consistent performance? ParaFS Log-structured File System Namespace Namespace Alloc. GC 2-D Allocation Coordinated GC Parallelism-Aware Scheduler READ / WRITE / TRIM FTL READ / WRITE / ERASE … Mapping S-FTL Alloc. GC WL ECC Block Mapping WL ECC … Channel 0 Channel 1 Channel N … Channel 0 Channel 1 Channel N Flash Flash Flash … Flash Memory Flash Flash Flash Flash Memory Outline • Background and Motivation • ParaFS Design – ParaFS Architecture – 2-D Data Allocation – Coordinated Garbage Collection – Parallelism-Aware Scheduling • Evaluation • Conclusion 10 ParaFS Architecture • Simplified FTL (S-FTL) – Exposing Physical Layout to FS: # of flash channels, Size of flash block, Size of flash page. – Static Block Mapping: Block-level, rarely modified. – Data Allocation Functionality is removed. – GC process is simplified to Erase process. – WL, ECC: functions which need hardware supports. IOCTL ParaFS READ / WRITE / ERASE … • Interface S-FTL – ioctl Block Mapping WL ECC – Erase à Trim … Flash Memory ParaFS Architecture • ParaFS: Allocation, Garbage Collection, Scheduling 2-D Allocation Region 0 Region N ParaFS Namespace … 2-D Allocation Coordinated GC Parallelism-Aware Scheduler Coordinated GC Thread 0 Thread N READ / WRITE / ERASE … S-FTL … Block Mapping WL ECC … Parallelism-Aware Scheduler Channel 0 Channel 1 Channel N Req. Que. 0 Req. Que. N … Read Read Flash Flash Flash Write … Write Erase Erase Flash Memory Read Read Problem #1: Grouping vs. Parallelism • Hot/Cold Data Grouping Hot Segment Cold Segment FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 0 Block 0 Block 0 A0 B0 C0 D0 … … 1E… F1… G1 H1 N N N N … … … … … Block N Block N Block N Block N 0 0 0 0 … … 1… 1… 1 1 N N N N Die 0 Die 1 Die 0 Die 1 Chip 0 Chip 1 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H FTL Vision of Data Recycle Unit A B C D E F G H C C C C G G G G Parallel Block 0 1 2 3 • Exploiting internal parallelism with page unit striping causes heavy garbage collection overhead. 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H FTL Vision of Data Recycle Unit A E C D B F G H C G C C D H G G Parallel Block 0 1 2 3 • Not fully eXploiting internal parallelism of the device, and performs badly in small sync write situation, mailserver. 1. 2-D Data Allocation • Current Data Allocation Schemes – Page Stripe FS Vision of Data – Block Stripe Segment 0 Segment 1 – Super Block A B C D E F G H FTL Vision of Data Recycle Unit A BE C D BE F G H C G C C D H G G Parallel Block 0 1 2 3 • Larger GC unit causes extra garbage collection overhead. 1. 2-D Data Allocation • Aligned Data Layout in ParaFS – Region à Flash Channel Segment à Flash Block Page à Flash Page Channel-level Log-structured Segment Write Request Dimension 1 2 3 4 5…N Written Data Pag e Region 0 Region 1 Region N Alloc. Head 0 Hotness-level Alloc. Dimension … Alloc. Head 1 … … Space … … Alloc. Head m Free Free Seg. List Space Channel 0 Channel 1 Channel N Flash Memory … Flash Flash Flash 1. 2-D Data Allocation • Data Allocation Schemes Comparison – Page Stripe – Block Stripe – Super Block – 2-D Allocation Parallelism Garbage Collection Stripe Parallelism GC Grouping GC Granularity Level Granularity Maintenance Overhead Page Stripe Page High Block No High Block Stripe Block Low Block Yes Low Super Block Page High Multiple Yes Medium Blocks 2-D Allocation Page High Block Yes Low Problem #2: GC vs. Parallelism • Uncoordinated Garbage Collection FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 1 Block 0 Block 1 A B E G C D F H FTL GC early … … … … … … Die 0 Die 0 Chip 0 Chip 1 Problem #2: GC vs. Parallelism • Uncoordinated Garbage Collection FS GC early FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 1 Block 0 Block 1 A B E G First C D F H Second Choice Choice … … … … … … Die 0 Die 0 Chip 0 Chip 1 2. Coordinated Garbage Collection • Garbage Collection in Two levels brings overheads – FTL only gets page invalidation after Trim commands. • Due to the no-overwrite pattern. • Pages that are invalid in the FS, moved during FTL-level GC. – UneXpected timing of GC starts in two levels. • FTL GC starts before FS GC starts. • FTL GC blocks the FS I/O and causes uneXpected I/O latency. – Waste Space. • Each level keeps over-provision space for GC efficiency. • Each level keeps meta data space, to record page and block(segment) utilization, to remapping. 21 2. Coordinated Garbage Collection • Coordinated GC process in two levels – FS-level GC • Foreground GC or Background GC (trigger condition, selection policy) • Migration the valid pages from victim segments. • Do the checkpoint in case of crash. • Send the Erase Request to S-FTL. – FTL-level GC • Find the corresponding flash block by static block mapping table. • Erase the flash block directly. • Multi-threaded Optimization – One GC process per Region (Flash Channel). – One Manager process to do the checkpoint after GC. 22 3. Parallelism-Aware Scheduling • Request Dispatching Phase – Select the least busy channel to dispatch write request New Write Request Queue Write Write Read Write Write Read Write Read Read Read Req. Que. 0 Req. Que. 1 Req. Que. 2 Req. Que. 3 �"#$%%&' = )(�+&$, ×����+&$, , �3+45& ×����3+45&) 23 3. Parallelism-Aware Scheduling • Request Dispatching Phase – Select the least busy channel to dispatch write request • Request Scheduling Phase – Time Slice for Read Request Scheduling and Write/Erase Request Scheduling. – Schedule Write or Erase Request according to Space Utilization and Number of Concurrent Erasing Channels. � = �×� + �×�& – In implementation 1 ?= 2×� + 1×�& 24 3. Parallelism-Aware Scheduling • Request Scheduling Phase – Example. Free Space Used Space �D = 2 ∗ 75% + 0 > 1 Write Erase Erase Write �G = 2 ∗ 25% + 0 < 1 �I = 2 ∗ 30% + 25% < 1 �J = 2 ∗ 30% + 50% > 1 50% Channel 0 Channel 1 Channel 2 Channel 3 � = 75% � = 25% � = 30% � = 30% Outline • Background and Motivation • ParaFS Design • Evaluation • Conclusion 26 Evaluation • ParaFS implemented on LinuX kernel 2.6.32 based on F2FS – Ext4 (In-place update), BtrFS (CoW), back-ported F2FS (Log-structured update) – F2FS_SB: back-ported F2FS with large segment configuration • Testbed: – PFTL, S-FTL • Workloads: Workload Pattern R:W FSYNC I/O Size Fileserver random read and write files 33/66 N 1 MB Postmark create, delete, read and append files 20/80 Y 512 B MobiBench random update records in SQLite 1/99 Y 4 KB YCSB read and update records in MySQL 50/50 Y 1 KB Evaluation – Light Write Traffic • Normalized to F2FS in 8-Channel case 1.2 3 BtrFS EXT4 F2FS F2FS_SB ParaFS BtrFS EXT4 F2FS F2FS_SB ParaFS 1 2.5 0.8 2 0.6 1.5 0.4 1 Normalized Throughput Normalized 0.2 Throughput Normalized0.5 0 0 Fileserver Postmark MobiBench YCSB Fileserver Postmark MobiBench YCSB Channel = 8 Channel = 32 - Outperforms other file systems in all cases.