Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM PCIe 3D-XPoint & Optane Flash SSD (~10 μs) SATA/SAS (~60 μs) NVDIMM-N Flash SSD (~10ns) (~100μs) SATA HDD Operations Per Second Operations Millisec Microsec Nanosec 2 Large Overhead in Software • Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden • Software overhead includes Software Storage NIC – Complicated I/O stacks 100% 80% 82% 77% – Redundant memory copies 60% – Frequent user/kernel mode switches 40% 50% Latency 20% (Normalized) 7% Decomposition 0% HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM • SW overhead must be addressed 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC to fully exploit low-latency NVM storage (Source: Ahn et al. MICRO-48) 3 Eliminate SW Overhead with mmap • In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance • Mapping files onto user memory space App App – Provide memory-like access user r/w syscalls – Avoid complicated I/O stacks kernel load/store File System – Minimize user/kernel mode switches instructions – No data copies Device Driver Persistent Disk/Flash Memory • A mmap syscall will be a critical interface Traditional I/O Memory mapped I/O 4 Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) 1.5 1.42 0.16 • read : 1.06 1.02 read system calls 1 0.64 계열 4 • default-mmap : 0.64 mmappagewithout fault overhead any special flags 0.5 0.63 • populatepage table-mmap entry : construction Elapsed Time (sec) Time Elapsed 0.39 mmap with a MAP_POPULATE flag 0 file access 5 Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) 1.5 1.42 0.16 1.06 1.02 1 0.64 계열 4 0.64 page fault overhead 0.5 0.63 page table entry construction Elapsed Time (sec) Time Elapsed 0.39 0 file access 5 Overhead in Memory Mapped File I/O • Memory mapped I/O can avoid the SW overhead of traditional file I/O • However, Memory mapped I/O causes another SW overhead – Page fault overhead – TLB miss overhead – PTE construction overhead • It decreases the advantages of memory mapped file I/O 6 Techniques to Alleviate Storage Latency • Readahead – Preload pages that are expected to be accessed App • Page cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces Page Cache – Utilize user hints to manage pages Readahead Secondary • However, these can’t be used in memory based FSs Storage 7 Techniques to Alleviate Storage Latency • Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed App • Page cache ⇒ Mapping cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces ⇒ Extended madvise Page Cache – Utilize user hints to manage pages Readahead Secondary • However, these can’t be used in memory based FSs Storage ⇒ New optimization is needed in memory based FSs 7 Map-ahead Constructs PTEs in Advance • When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing demand paging) – Pages that are expected to be accessed (map-ahead) • Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ rnd – Random fault : map-ahead window size ↓ NON_SEQ winsize ↓ seq • Map-ahead can reduce # of page faults rnd rnd PURE_SEQ NEARLY_SEQ winsize ↑ seq winsize - seq 8 Comparison of Demand Paging & Map-ahead • Existing demand paging application a b c d e f page fault time 1 2 3 4 5 handler 9 Comparison of Demand Paging & Map-ahead • Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault time 1 2 3 4 5 handler 9 Comparison of Demand Paging & Map-ahead • Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault time 1 2 3 4 5 handler • Map-ahead application a b c d e f time page fault 1 2 3 4 5 handler Performance gain Map-ahead window size == 5 9 Async Map-ahead Hides Mapping Overhead • Sync map-ahead application a b c d e f time page fault 1 2 3 4 5 handler Map-ahead window size == 5 10 Async Map-ahead Hides Mapping Overhead • Sync map-ahead application a b c d e f time page fault 1 2 3 4 5 handler • Async map-ahead application a b c d e f time page fault 1 2 handler Performance gain kernel threads 3 4 5 10 Extended madvise Makes User Hints Available • Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging • Extended madvise makes user hints available in memory based FSs Existing madvise Extended madvise MADV_SEQUENTIAL Run readahead aggressively Run map-ahead aggressively MADV_WILLNEED MADV_RANDOM Stop readahead Stop map-ahead MADV_DONTNEED Release pages in page $ Release mapping in mapping $ 11 Mapping Cache Reuses Mapping Data • When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large • With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit) VMA VMA VMA mmaped VA | PA VA | PA VMA region VMA VA | PA VMA mm_struct Process address Page table Physical memory red-black tree space 12 Experimental Environment • Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – Linux kernel 4.4 – Ext4-DAX filesystem • Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload Netlist DDR4 NVDIMM-N 16GB × 2 – Apache HTTP server with httperf 13 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 14 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 23% 4 38% 3 46% 14 thousand page faults 2 13 thousand page faults Bandwidth (GB/s) 1 16 million page faults 0 4 8 16 32 64 Record Size (KB) 14 fio : Sequential Read default-mmap populate-mmap map-ahead extended madvise 7 6 5 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 14 fio : Random Read default-mmap populate-mmap map-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populatepopulate-mmap-mmap map-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populate-mmap mapmap-ahead-ahead extended madvise 4 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 fio : Random Read default-mmap populate-mmap map-ahead extended madvise 4 madvise(MADV_RANDOM) == default-mmap 3 2 Bandwidth (GB/s) 1 0 4 8 16 32 64 Record Size (KB) 15 Web Server Experimental Setting • Apache HTTP server – Memory mapped file I/O Apache HTTP server – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB • httperf clients – 8 additional machines – Zipf-like distribution 1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth) httperf clients 16 Apache HTTP Server kernel libphp libc others No mapping cache ) 12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit) 4 3 2 1 CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s) 17 Apache HTTP Server kernel libphp libc others No mapping cache ) 12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit) 4 3 2 1 CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s) 17 Conclusion • SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead • Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead • To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache • Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑ 18 QnA [email protected] 19.