Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM

Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM PCIe 3D-XPoint & Optane Flash SSD (~10 μs) SATA/SAS (~60 μs) NVDIMM-N Flash SSD (~10ns) (~100μs) SATA HDD Operations Per Second Operations Millisec Microsec Nanosec 2 Large Overhead in Software • Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden • Software overhead includes Software Storage NIC – Complicated I/O stacks 100% 80% 82% 77% – Redundant memory copies 60% – Frequent user/kernel mode switches 40% 50% Latency 20% (Normalized) 7% Decomposition 0% HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM • SW overhead must be addressed 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC to fully exploit low-latency NVM storage (Source: Ahn et al. MICRO-48) 3 Eliminate SW Overhead with mmap • In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance • Mapping files onto user memory space App App – Provide memory-like access user r/w syscalls – Avoid complicated I/O stacks kernel load/store File System – Minimize user/kernel mode switches instructions – No data copies Device Driver Persistent Disk/Flash Memory • A mmap syscall will be a critical interface Traditional I/O Memory mapped I/O 4 Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) 1.5 1.42 0.16 • read : 1.06 1.02 read system calls 1 0.64 계열 4 • default-mmap : 0.64 mmappagewithout fault overhead any special flags 0.5 0.63 • populatepage table-mmap entry : construction Elapsed Time (sec) Time Elapsed 0.39 mmap with a MAP_POPULATE flag 0 file access 5 Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) 1.5 1.42 0.16 1.06 1.02 1 0.64 계열 4 0.64 page fault overhead 0.5 0.63 page table entry construction Elapsed Time (sec) Time Elapsed 0.39 0 file access 5 Overhead in Memory Mapped File I/O • Memory mapped I/O can avoid the SW overhead of traditional file I/O • However, Memory mapped I/O causes another SW overhead – Page fault overhead – TLB miss overhead – PTE construction overhead • It decreases the advantages of memory mapped file I/O 6 Techniques to Alleviate Storage Latency • Readahead – Preload pages that are expected to be accessed App • Page cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces Page Cache – Utilize user hints to manage pages Readahead Secondary • However, these can’t be used in memory based FSs Storage 7 Techniques to Alleviate Storage Latency • Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed App • Page cache ⇒ Mapping cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces ⇒ Extended madvise Page Cache – Utilize user hints to manage pages Readahead Secondary • However, these can’t be used in memory based FSs Storage ⇒ New optimization is needed in memory based FSs 7 Map-ahead Constructs PTEs in Advance • When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing demand paging) – Pages that are expected to be accessed (map-ahead) • Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ rnd – Random fault : map-ahead window size ↓ NON_SEQ winsize ↓ seq • Map-ahead can reduce # of page faults rnd rnd PURE_SEQ NEARLY_SEQ winsize ↑ seq winsize - seq 8 Comparison of Demand Paging & Map-ahead • Existing demand paging application a b c d e f page fault time 1 2 3 4 5 handler 9 Comparison of Demand Paging & Map-ahead • Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault time 1 2 3 4 5 handler 9 Comparison of Demand Paging & Map-ahead • Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault time 1 2 3 4 5 handler • Map-ahead application a b c d e f time page fault 1 2 3 4 5 handler Performance gain Map-ahead window size == 5 9 Async Map-ahead Hides Mapping Overhead • Sync map-ahead application a b c d e f time page fault 1 2 3 4 5 handler Map-ahead window size == 5 10 Async Map-ahead Hides Mapping Overhead • Sync map-ahead application a b c d e f time page fault 1 2 3 4 5 handler • Async map-ahead application a b c d e f time page fault 1 2 handler Performance gain kernel threads 3 4 5 10 Extended madvise Makes User Hints Available • Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging • Extended madvise makes user hints available in memory based FSs Existing madvise Extended madvise MADV_SEQUENTIAL Run readahead aggressively Run map-ahead aggressively MADV_WILLNEED MADV_RANDOM Stop readahead Stop map-ahead MADV_DONTNEED Release pages in page $ Release mapping in mapping $ 11 Mapping Cache Reuses Mapping Data • When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large • With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit) VMA VMA VMA mmaped VA | PA VA | PA VMA region VMA VA | PA VMA mm_struct Process address Page table Physical memory red-black tree space 12 Experimental Environment • Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – Linux kernel 4.4 – Ext4-DAX filesystem • Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload Netlist DDR4 NVDIMM-N 16GB × 2 – Apache HTTP server with httperf 13 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 14 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 23% 4 38% 3 46% 14 thousand page faults 2 13 thousand page faults Bandwidth (GB/s) 1 16 million page faults 0 4 8 16 32 64 Record Size (KB) 14 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 14 fio : Random Read default-mmap populate-mmap map-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populatepopulate-mmap-mmap map-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populate-mmap mapmap-ahead-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populate-mmap map-ahead extended madvise 4 madvise(MADV_RANDOM) == default-mmap 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 Web Server Experimental Setting • Apache HTTP server – Memory mapped file I/O Apache HTTP server – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB • httperf clients – 8 additional machines – Zipf-like distribution 1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth) httperf clients 16 Apache HTTP Server kernel libphp libc others No mapping cache ) 12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit) 4 3 2 1 CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s) 17 Apache HTTP Server kernel libphp libc others No mapping cache ) 12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit) 4 3 2 1 CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s) 17 Conclusion • SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead • Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead • To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache • Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑ 18 QnA [email protected] 19.

Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM

Copy on Write Based File Systems Performance Analysis and Implementation

The Page Cache Today’S Lecture RCU File System Networking(Kernel Level Syncmem

Oracle® Linux Administrator's Solutions Guide for Release 6

Beej's Guide to Unix IPC

Mmap and Dma

An Introduction to Linux IPC

Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser

Extracting Compressed Pages from the Windows 10 Virtual Store WHITE PAPER | EXTRACTING COMPRESSED PAGES from the WINDOWS 10 VIRTUAL STORE 2

Table of Contents

NOVA: a Log-Structured File System for Hybrid Volatile/Non

Hibachi: a Cooperative Hybrid Cache with NVRAM and DRAM for Storage Arrays

Oracle® Linux 7 Working with LXC