Umap : Enabling Application-Driven Optimizations for Page Management Ivy B

UMap : Enabling Application-driven Optimizations for Page Management Ivy B. Peng Marty McFadden Eric Green [email protected] [email protected] [email protected] Lawrence Livermore National Lawrence Livermore National Lawrence Livermore National Laboratory Laboratory Laboratory Livermore, USA Livermore, USA Livermore, USA Keita Iwabuchi Kai Wu Dong Li [email protected] [email protected] [email protected] Lawrence Livermore National University of California, Merced University of California, Merced Laboratory Merced, USA Merced, USA Livermore, USA Roger Pearce Maya Gokhale [email protected] [email protected] Lawrence Livermore National Lawrence Livermore National Laboratory Laboratory Livermore, USA Livermore, USA ABSTRACT KEYWORDS Leadership supercomputers feature a diversity of storage, memory mapping, memmap, page fault, user-space paging, from node-local persistent memory and NVMe SSDs to userfaultfd, page management network-interconnected ash memory and HDD. Memory Reference Format: mapping les on dierent tiers of storage provides a uniform Ivy B. Peng, Marty McFadden, Eric Green, Keita Iwabuchi, Kai Wu, interface in applications. However, system-wide services like Dong Li, Roger Pearce, and Maya Gokhale. 2019. UMap : Enabling mmap are optimized for generality and lack exibility for Application-driven Optimizations for Page Management. enabling application-specic optimizations. In this work, we present UMap to enable user-space page management that 1 INTRODUCTION can be easily adapted to access patterns in applications and Recently, leadership supercomputers provide enormous stor- storage characteristics. UMap uses the userfaultfd mecha- age resources to cope with expanding data sets in applica- nism to handle page faults in multi-threaded applications tions. The storage resources come in a hybrid format for eciently. By providing a data object abstraction layer, UMap balanced cost and performance [9, 11, 13]. Fast and small is extensible to support various backing stores. The design of storage, which is implemented using advanced technologies UMap supports dynamic load balancing and I/O decoupling like persistent memory and NVMe SSDs, often co-locate with for scalable performance. UMap also uses application hints computing units inside compute node. Storage with massive to improve the selection of caching, prefetching, and eviction capacity, on the other hand, uses cost-eective technologies policies. We evaluate UMap in ve benchmarks and real ap- like HDD and is interconnected to compute nodes through plications on two systems. Our results show that leveraging the network. In between, burst buers use fast memory tech- application knowledge for page management could substan- nologies and are accessible through the network. Memory tially improve performance. On average, UMap achieved 1.25 mapping provides a uniform interface to access les on dif- to 2.5 times improvement using the adapted congurations ferent types of storage as if to dynamically allocated memory. compared to the system service. For instance, out-of-core data analytic workloads often need to process large datasets that exceed the memory capacity of a compute node [17]. Using memory mapping to access MCHPC’19, Denver, USA these datasets shift the burden of paging, prefetching, and 2019. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. caching data between storage and memory to the operating systems. MCHPC’19, Denver, USA Peng et al. Currently, operating systems provide the mmap system SSD, applications achieved 1.25 to 2.5 times improvement call to map les or devices into memory. This system service compared to the standard system service. On the Intel testbed performs well in loading dynamic libraries and could also with network-interconnected HDD, UMap brings the per- support out-of-core execution. However, as a system-level formance of the asteroid detection application close to that service, it has to be tuned for performance reliability and uses local SSD for 500 GB data sets. In summary, our main consistency over a broad range of workloads. Therefore, it contributions are as follows: may reduce opportunities in optimizing performance based We propose an open-source library1, called UMap that • on application characteristics. Moreover, backing stores on leverages lightweight userfaultfd mechanism to enable dierent storage exhibit distinctive performance character- application-driven page management. istics. Consequently, congurations tuned for one type of We describe the design of UMap for achieving scalable • storage will need to be adjusted when mapping on another performance in multi-threaded data-intensive applica- type of storage. In this work, we provide UMap to enable tions. application-specic optimizations for page management in We demonstrate ve use cases of UMap and show that • memory mapping various backing stores. UMap is highly enabling congurable page size is essential for perfor- congurable to adapt user-space paging to suit application mance tuning in data-intensive applications. needs. It facilitates application control on caching, prefetch- UMap improves the performance of tested applications • ing, and eviction policies with minimal porting eorts from by 1.25 to 2.5 times compared to the standard mmap the programmer. As a user-level solution, UMap connes system service. changes within an application without impacting other applications sharing the platform, which is unachievable in 2 BACKGROUND AND MOTIVATION system-level approaches. In this section, we introduce memory mapping, prospective We prioritize four design choices for UMap based on sur- benets from user-space page management, and the enabling veying realistic use cases. First, we choose to implement mechanism userfaultfd. UMap as a user-level library so that it can maintain compati- bility with the fast-moving Linux kernel without the need 2.1 Memory Mapping to track and modify for frequent kernel updates. Also, we Memory mapping links images and les in persistent storage employ the recent userfaultfd [7] mechanism, other than to the virtual address space of a process. The operating sys- the signal handling + callback function approach to reduce tem employs demand paging to bring only accessed virtual overhead and performance variance in multi-threaded appli- pages into physical memory because virtual memory can be cations. Third, we target an adaptive solution that sustains much larger than physical memory. An access to memory- performance even at high concurrency for data-intensive mapped regions triggers a page fault if no page table entry applications, which often employ a large number of threads (PTE) is present for the accessed page. When such a page for hiding data access latency. Our design pays particular fault is raised, the operating system resolves it by copying in consideration on load imbalance among service threads to the physical data page from storage to the in-memory page improve the utilization of shared resources even when data cache. accesses to pages are skewed. UMap dynamically balances Common strategies for optimizing memory mapping in workloads among all service threads to eliminate bottleneck the operating systems include page cache, read-ahead, and on serving hot pages. Finally, for exible and portable tun- madvise hints. The page cache is used to keep frequently used ing on dierent computing systems, UMap provides both pages in memory while less important pages may need to API and environmental controls to enable congurable page be evicted from memory to make room for newly requested sizes, eviction strategy, application-specic prefetching, and pages. Least Recently Used (LRU) policy is commonly used detailed diagnosis information to the programmer. for selecting pages to be evicted. The operating system may We evaluate the eectiveness of UMap in ve use cases, proactively ush dirty pages, i.e., modied pages in the page including two data-intensive benchmarks, i.e., a synthetic cache, into storage when the ratio of dirty page exceeds a sort benchmark and a breadth-rst search (BFS) kernel, and threshold value [19]. Read-ahead preloads pages into physi- three real applications, i.e., Lrzip [8], N-Store database [2], cal memory to avoid the overhead associated with page fault and an asteroid detection application that processes massive handling, TLB misses and user-to-kernel mode transition. data sets from telescopes. We conduct out-of-core experi- Finally, the madvise interface takes hints to allow the op- ments on two systems with node-local SSD and network- erating system to make informed decisions for managing interconnected HDD storage. Our results show that UMap pages. can enable exible user-space page management in data- intensive applications. On the AMD testbed with local NVMe 1UMAP v2.0.0 https://github.com/LLNL/umap. UMap : Enabling Application-driven Optimizations for Page Management MCHPC’19, Denver, USA 2.2 User-space Page Management Application User-space page management uses application threads to resolve page faults and manage virtual memory in the background as dened by the application. The userfaultfd is a Virtual Address Space lightweight mechanism to enable user-space paging com- UMap Physical pages pared to the traditional SIGSEGV signal and callback func- Page faults ... tion [7]. Applications register address ranges to be manged in user-space, and specify the type of events, e.g., page faults Filler 0 Filler 1 Filler 2 Filler 3 and events in un-cooperative mode, to be tracked. Page faults Store Store in the address ranges are delivered

Umap : Enabling Application-Driven Optimizations for Page Management Ivy B

Proceedings 2005

System Calls System Calls

Dell Update Packages for Linux Operating Systems User's Guide

Inside the Linux 2.6 Completely Fair Scheduler

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

Package Manager: the Core of a GNU/Linux Distribution

Machine Learning for Load Balancing in the Linux Kernel Jingde Chen Subho S

Linux CPU Schedulers: CFS and Muqss Comparison

Linux-Cookbook.Pdf

Evolutions in Recent Hardware and Applications to Virtualization Clementine Maurice

Cross-Layer Resource Control and Scheduling for Improving Interactivity in Android

Linux Internals