Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han Storage Latency Close to DRAM

PCIe 3D-XPoint & Optane Flash SSD (~10 μs) SATA/SAS (~60 μs) NVDIMM-N Flash SSD (~10ns) (~100μs)

SATA HDD Operations Per Second Operations

Millisec Microsec Nanosec 2 Large Overhead in Software

• Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden

• Software overhead includes Software Storage NIC – Complicated I/O stacks 100% 80% 82% 77% – Redundant memory copies 60% – Frequent user/kernel mode switches 40% 50% Latency 20% (Normalized) 7%

Decomposition 0% HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM • SW overhead must be addressed 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC to fully exploit low-latency NVM storage (Source: Ahn et al. MICRO-48)

3 Eliminate SW Overhead with

• In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance

• Mapping files onto user memory space App App – Provide memory-like access user r/w syscalls – Avoid complicated I/O stacks kernel load/store – Minimize user/kernel mode switches instructions – No data copies Persistent Disk/Flash Memory • A mmap syscall will be a critical interface Traditional I/O Memory mapped I/O

4 Memory Mapped I/O is Not as Fast as Expected

• Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N)

1.5 1.42 0.16 • read : 1.06 1.02 read system calls 1 0.64 계열 4 • default-mmap : 0.64 mmappagewithout fault overhead any special flags 0.5 0.63 • populatepage table-mmap entry : construction

Elapsed Time (sec) Time Elapsed 0.39 mmap with a MAP_POPULATE flag 0 file access

5 Memory Mapped I/O is Not as Fast as Expected

• Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N)

1.5 1.42 0.16 1.06 1.02 1 0.64 계열 4 0.64 fault overhead 0.5 0.63 entry construction

Elapsed Time (sec) Time Elapsed 0.39 0 file access

5 Overhead in Memory Mapped File I/O

• Memory mapped I/O can avoid the SW overhead of traditional file I/O

• However, Memory mapped I/O causes another SW overhead – overhead – TLB miss overhead – PTE construction overhead

• It decreases the advantages of memory mapped file I/O

6 Techniques to Alleviate Storage Latency

• Readahead – Preload pages that are expected to be accessed App • Page User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces – Utilize user hints to manage pages Readahead

Secondary • However, these can’t be used in memory based FSs Storage

7 Techniques to Alleviate Storage Latency

• Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed App • Page cache ⇒ Mapping cache User Hints – Cache frequently accessed pages (madvise/fadvise) Main Memory • fadvise/madvise interfaces ⇒ Extended madvise Page Cache – Utilize user hints to manage pages Readahead

Secondary • However, these can’t be used in memory based FSs Storage ⇒ New optimization is needed in memory based FSs

7 Map-ahead Constructs PTEs in Advance

• When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing ) – Pages that are expected to be accessed (map-ahead) • Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ rnd – Random fault : map-ahead window size ↓ NON_SEQ winsize ↓ seq • Map-ahead can reduce # of page faults rnd rnd PURE_SEQ NEARLY_SEQ winsize ↑ seq winsize - seq

8 Comparison of Demand Paging & Map-ahead

• Existing demand paging

application a b c d e f

page fault time 1 2 3 4 5 handler

9 Comparison of Demand Paging & Map-ahead

• Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss

application a b c d e f

page fault time 1 2 3 4 5 handler

9 Comparison of Demand Paging & Map-ahead

• Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss

application a b c d e f

page fault time 1 2 3 4 5 handler

• Map-ahead

application a b c d e f

time page fault 1 2 3 4 5 handler Performance gain

Map-ahead window size == 5

9 Async Map-ahead Hides Mapping Overhead

map-ahead

application a b c d e f

time page fault 1 2 3 4 5 handler

Map-ahead window size == 5

10 Async Map-ahead Hides Mapping Overhead

• Sync map-ahead

application a b c d e f

time page fault 1 2 3 4 5 handler

• Async map-ahead

application a b c d e f

time page fault 1 2 handler Performance gain

kernel threads 3 4 5

10 Extended madvise Makes User Hints Available

• Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging • Extended madvise makes user hints available in memory based FSs

Existing madvise Extended madvise MADV_SEQUENTIAL Run readahead aggressively Run map-ahead aggressively MADV_WILLNEED

MADV_RANDOM Stop readahead Stop map-ahead

MADV_DONTNEED Release pages in page $ Release mapping in mapping $

11 Mapping Cache Reuses Mapping Data

• When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large • With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit)

VMA VMA VMA mmaped VA | PA VA | PA VMA region VMA VA | PA VMA

mm_struct address Page table Physical memory red-black tree space

12 Experimental Environment

• Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – kernel 4.4 – Ext4-DAX filesystem

• Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload Netlist DDR4 NVDIMM-N 16GB × 2 – Apache HTTP server with httperf

13 fio : Sequential Read

default-mmap populate-mmap map-ahead extended madvise 7

6

5

4

3

2 Bandwidth (GB/s)

1

0 4 8 16 32 64 Record Size (KB)

14 fio : Sequential Read

default-mmap populate-mmap map-ahead extended madvise 7

6

5 23%

4 38% 3 46% 14 thousand page faults 2 13 thousand page faults Bandwidth (GB/s)

1 16 million page faults

0 4 8 16 32 64 Record Size (KB)

14 fio : Sequential Read

default-mmap populate-mmap map-ahead extended madvise 7

6

5

4

3

2 Bandwidth (GB/s)

1

0 4 8 16 32 64 Record Size (KB)

14 fio : Random Read

default-mmap populate-mmap map-ahead extended madvise 4

3

2

Bandwidth (GB/s) 1

0 4 8 16 32 64 Record Size (KB)

15 fio : Random Read

default-mmap populatepopulate-mmap-mmap map-ahead extended madvise 4

3

2

Bandwidth (GB/s) 1

0 4 8 16 32 64 Record Size (KB)

15 fio : Random Read

default-mmap populate-mmap mapmap-ahead-ahead extended madvise 4

3

2

Bandwidth (GB/s) 1

0 4 8 16 32 64 Record Size (KB)

15 fio : Random Read

default-mmap populate-mmap map-ahead extended madvise 4 madvise(MADV_RANDOM) == default-mmap 3

2

Bandwidth (GB/s) 1

0 4 8 16 32 64 Record Size (KB)

15 Web Server Experimental Setting

• Apache HTTP server – Memory mapped file I/O Apache HTTP server – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB • httperf clients – 8 additional machines – Zipf-like distribution

1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth)

httperf clients

16 Apache HTTP Server

kernel libphp libc others No mapping cache )

12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit)

4

3

2

1

CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s)

17 Apache HTTP Server

kernel libphp libc others No mapping cache )

12 6 Mapping cache (2GB) 5 Mapping cache (No-Limit)

4

3

2

1

CPU Cycles (trillions, 10 (trillions, Cycles CPU 0 600 600 700700 800800 900900 10001000 Request Rate (reqs/s)

17 Conclusion

• SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead • Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead • To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache • Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑

18 QnA [email protected]

19