Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM
PCIe 3D-XPoint & Optane Flash SSD (~10 μs) SATA/SAS (~60 μs) NVDIMM-N Flash SSD (~10ns) (~100μs)
SATA HDD Operations Per Second Operations
Millisec Microsec Nanosec 2 Large Overhead in Software
• Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden
• Software overhead includes Software Storage NIC – Complicated I/O stacks 100% 80% 82% 77% – Redundant memory copies 60% – Frequent user/kernel mode switches 40% 50% Latency 20% (Normalized) 7%
Decomposition 0% HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM • SW overhead must be addressed 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC to fully exploit low-latency NVM storage (Source: Ahn et al. MICRO-48)
3 Eliminate SW Overhead with mmap
• In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance
• Mapping files onto user memory space App App – Provide memory-like access user r/w syscalls – Avoid complicated I/O stacks kernel load/store File System – Minimize user/kernel mode switches instructions – No data copies Device Driver Persistent Disk/Flash Memory • A mmap syscall will be a critical interface Traditional I/O Memory mapped I/O
4 Memory Mapped I/O is Not as Fast as Expected
• Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N)
1.5 1.42 0.16 • read : 1.06 1.02 read system calls 1 0.64 계열 4 • default-mmap : 0.64 mmappagewithout fault overhead any special flags 0.5 0.63 • populatepage table-mmap entry : construction
Elapsed Time (sec) Time Elapsed 0.39 mmap with a MAP_POPULATE flag 0 file access
5 Memory Mapped I/O is Not as Fast as Expected
• Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N)
1.5 1.42 0.16 1.06 1.02 1 0.64 계열 4 0.64 page fault overhead 0.5 0.63 page table entry construction
Elapsed Time (sec) Time Elapsed 0.39 0 file access
5 Overhead in Memory Mapped File I/O
• Memory mapped I/O can avoid the SW overhead of traditional file I/O
• However, Memory mapped I/O causes another SW overhead – Page fault overhead – TLB miss overhead – PTE construction overhead
• It decreases the advantages of memory mapped file I/O
6 Techniques to Alleviate Storage Latency
• Readahead – Preload pages that are expected to be accessed App • Page cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces Page Cache – Utilize user hints to manage pages Readahead
Secondary • However, these can’t be used in memory based FSs Storage
7 Techniques to Alleviate Storage Latency
• Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed App • Page cache ⇒ Mapping cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces ⇒ Extended madvise Page Cache – Utilize user hints to manage pages Readahead
Secondary • However, these can’t be used in memory based FSs Storage ⇒ New optimization is needed in memory based FSs
7 Map-ahead Constructs PTEs in Advance
• When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing demand paging) – Pages that are expected to be accessed (map-ahead) • Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ rnd – Random fault : map-ahead window size ↓ NON_SEQ winsize ↓ seq • Map-ahead can reduce # of page faults rnd rnd PURE_SEQ NEARLY_SEQ winsize ↑ seq winsize - seq
8 Comparison of Demand Paging & Map-ahead
• Existing demand paging
application a b c d e f
page fault time 1 2 3 4 5 handler
9 Comparison of Demand Paging & Map-ahead
• Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss
application a b c d e f
page fault time 1 2 3 4 5 handler
9 Comparison of Demand Paging & Map-ahead
• Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss
application a b c d e f
page fault time 1 2 3 4 5 handler
• Map-ahead
application a b c d e f
time page fault 1 2 3 4 5 handler Performance gain
Map-ahead window size == 5
9 Async Map-ahead Hides Mapping Overhead
• Sync map-ahead
application a b c d e f
time page fault 1 2 3 4 5 handler
Map-ahead window size == 5
10 Async Map-ahead Hides Mapping Overhead
• Sync map-ahead
application a b c d e f
time page fault 1 2 3 4 5 handler
• Async map-ahead
application a b c d e f
time page fault 1 2 handler Performance gain
kernel threads 3 4 5
10 Extended madvise Makes User Hints Available
• Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging • Extended madvise makes user hints available in memory based FSs
Existing madvise Extended madvise MADV_SEQUENTIAL Run readahead aggressively Run map-ahead aggressively MADV_WILLNEED
MADV_RANDOM Stop readahead Stop map-ahead
MADV_DONTNEED Release pages in page $ Release mapping in mapping $
11 Mapping Cache Reuses Mapping Data
• When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large • With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit)
VMA VMA VMA mmaped VA | PA VA | PA VMA region VMA VA | PA VMA
mm_struct Process address Page table Physical memory red-black tree space
12 Experimental Environment
• Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – Linux kernel 4.4 – Ext4-DAX filesystem
• Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload Netlist DDR4 NVDIMM-N 16GB × 2 – Apache HTTP server with httperf
13 fio : Sequential Read
default-mmap populate-mmap map-ahead extended madvise 7
6
5
4
3
2 Bandwidth (GB/s)
1
0 4 8 16 32 64 Record Size (KB)
14 fio : Sequential Read
default-mmap populate-mmap map-ahead extended madvise 7
6
5 23%
4 38% 3 46% 14 thousand page faults 2 13 thousand page faults Bandwidth (GB/s)
1 16 million page faults
0 4 8 16 32 64 Record Size (KB)
14 fio : Sequential Read
default-mmap populate-mmap map-ahead extended madvise 7
6
5
4
3
2 Bandwidth (GB/s)
1
0 4 8 16 32 64 Record Size (KB)
14 fio : Random Read
default-mmap populate-mmap map-ahead extended madvise 4
3
2
Bandwidth (GB/s) 1
0 4 8 16 32 64 Record Size (KB)
15 fio : Random Read
default-mmap populatepopulate-mmap-mmap map-ahead extended madvise 4
3
2
Bandwidth (GB/s) 1
0 4 8 16 32 64 Record Size (KB)
15 fio : Random Read
default-mmap populate-mmap mapmap-ahead-ahead extended madvise 4
3
2
Bandwidth (GB/s) 1
0 4 8 16 32 64 Record Size (KB)
15 fio : Random Read
default-mmap populate-mmap map-ahead extended madvise 4 madvise(MADV_RANDOM) == default-mmap 3
2
Bandwidth (GB/s) 1
0 4 8 16 32 64 Record Size (KB)
15 Web Server Experimental Setting
• Apache HTTP server – Memory mapped file I/O Apache HTTP server – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB • httperf clients – 8 additional machines – Zipf-like distribution
1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth)
httperf clients
16 Apache HTTP Server
kernel libphp libc others No mapping cache )
12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit)
4
3
2
1
CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s)
17 Apache HTTP Server
kernel libphp libc others No mapping cache )
12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit)
4
3
2
1
CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s)
17 Conclusion
• SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead • Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead • To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache • Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑
18 QnA [email protected]
19