Tracing Data Access Pattern with Bounded Overhead and Best-Effort Accuracy

Tracing Data Access Pattern with Bounded Overhead and Best-effort Accuracy SeongJae Park <[email protected]> These slides were presented during The Kernel Summit 2019 (https://linuxplumbersconf.org/event/4/contributions/548/) This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/. I, SeongJae Park <[email protected]> ● Interested in memory management and parallel programming for OS kernels ● Recently received a PhD in Seoul National University ● Will start work in Amazon from next week Overview ● Needs for Data Accesses Monitoring ● Existing Schemes and Their Limitations ● DAMON: Data Access MONitor ● Evaluations of DAMON ● Future Plans ● Conclusion Increasing Working Sets and Decreasing DRAM ● Modern workloads (e.g., cloud, big data, ML) have common characteristics ○ Huge working sets and low locality (different from traditional workloads) ● DRAM capacity is decreasing in relative to the WSS increase ● A survey of Memory:CPU ratio tendency clearly shows this change ○ Increase in virtual machines shows increase of working sets size ○ Decrease in physical machines shows relative decrease of DRAM size ● Therefore, DRAM would not be able to fully accommodates working sets Nitu, Vlad, et al. "Welcome to zombieland: Practical and energy-efficient memory disaggregation in a datacenter." Proceedings of the Thirteenth EuroSys Conference. ACM, 2018. Multi-tier Memory Systems Arrives at Last ● Storage devices (using FLASH and/or PCM) have continuously evolved ○ Much faster than HDD so that could be used as memory ○ Much larger than DRAM so that could accommodate the huge modern working sets ○ That said, those are obviously slower than DRAM and smaller than HDD ● Consequently, multi-tier memory systems will be prevalent ○ There are similar concepts such as Heterogeneous or hierarchical memory ○ Configure DRAM as main memory and modern fast storage devices as secondary memory ○ SWAP, which has considered harmful in past, would be a straight-forward configuration Guest VM Guest VM Guest VM 1GB DRAM 1GB DRAM 1GB DRAM Host Physical Machine 2GB DRAM 3GB Swap Device How Multi-tier Memory Should be Managed? ● Applications have multiple data objects and multiple execution phase ● Each execution phase has different data access pattern (DAP) ○ The term, DAP in this talk means: Access frequencies of each data object for a given time ● For each phase, objects that will be frequently accessed should be in the main memory while others be in the secondary memory a, b, t in DRAM; c, t in DRAM; d, t in DRAM; others in the others in the others in the secondary mem secondary mem secondary mem a def foo(a,b,c,d): b t = sum(a, b) c t = mult(t, c) d d = t t Time Outdated Memory Management Mechanisms ● Designed upon the old assumptions ○ Assumes small working sets, high locality, and slow secondary memory ● Relies on DAP estimation algorithms based on LRU or simple heuristics ● Needs to be optimized for the modern workloads and memory ● Several problems have reported with solutions ○ Let’s explore such optimization cases (2 cases related with kernel space and 1 case related with user space) Case 1: LRU-based Page Reclamation ● Linux selects pages to be reclaimed based on a pseudo-LRU algorithm ● Problems ○ Not useful if workloads has no strong locality ○ Modern workloads tend to have complex and dynamic access patterns ○ Swapping out pages to be used in near future degrades performance Case 1: LRU-based Page Reclamation ● Linux selects pages to be reclaimed based on a pseudo-LRU algorithm ● Problems ○ Not useful if workloads has no strong locality ○ Modern workloads tend to have complex and dynamic access patterns ○ Swapping out pages to be used in near future degrades performance Good selection, evicted pages are not swapped in for a while :D Case 1: LRU-based Page Reclamation ● Linux selects pages to be reclaimed based on a pseudo-LRU algorithm ● Problems ○ Not useful if workloads has no strong locality ○ Modern workloads tend to have complex and dynamic access patterns ○ Swapping out pages to be used in near future degrades performance Good selection, evicted Bad selection, evicted pages are not swapped pages are swapped in in for a while :D soon! :’( Case 1: LRU-based Page Reclamation ● Linux selects pages to be reclaimed based on a pseudo-LRU algorithm ● Problems ○ Not useful if workloads has no strong locality ○ Modern workloads tend to have complex and dynamic access patterns ○ Swapping out pages to be used in near future degrades performance ● A proposed DAP-based optimization: proactive reclamation[1,2] ○ Identify cold pages (idle for >2 minutes) and reclaim those to secondary memory (ZRAM) ○ Use page table access bits for the identification of the cold pages ○ Successfully applied to Google internal clusters ○ Nevertheless, the idle page identification overhead is not trivial; one dedicated CPU required [1] Lagar-Cavilla, Andres, et al. "Software-Defined Far Memory in Warehouse-Scale Computers." Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2019. [2] Jonathan Corbet, “Proactively reclaiming idle memory.” https://lwn.net/Articles/787611/ Case 2: Aggressive THP Promotions ● THP does promotions and demotions between regular pages and huge pages ● Problems ○ The selection of candidate pages to be promoted are very simple; It aggressively promotes any page as long as 2 MiB contiguous memory region can be allocated ○ If promoted huge pages are not fully accessed, memory is wasted ● A proposed DAP-based optimization, Ingens[1] ○ Start from a regular page and promotes to a huge page if adjacent 2MiB are really used ○ Demotes a huge page into 512 regular pages if those are not fully accessed ○ Track accesses using page table accessed bits, meaning arbitrarily increasing overhead [1] Kwon, Youngjin, et al. "Coordinated and efficient huge page management with ingens." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. Case 3: madvise() Improvements ● User programs are allowed to voluntarily report their data access pattern hints ○ Programmers would know the data access pattern of their program, right? ■ Sadly, not always, especially in case of huge and complex program ○ mlock(), madvise(), … ○ Proper use of the system calls significantly improves the performance under memory pressure ● Problems ○ Hints are not rich enough to be useful for the modern systems ○ Allows only voluntary reports ● A proposed optimization: memory hinting API for external process[1] ○ Implements more hints: MADV_COLD and MADV_PAGEOUT ○ Allows external processes to be able to report other’s notification ○ Successfully adopted for Android phones ○ Just leaves the data access pattern monitoring task to user space [1] Minchan Kim, “introduce memory hinting API for external process.” https://lore.kernel.org/lkml/[email protected]/ The Unsolved Problem: Data Access Monitoring ● All the three optimizations we explored have no solution for Scalable, efficient, and effective data accesses monitoring ● Proactive reclamation and Ingens have unbounded monitoring overhead ○ In other words, those are not scalable ○ It would be hard to be merged into upstream ○ LSFMM19 discussions about proactive reclamation also pointed the scalability problem ● madvise() improvements simply leaves the task to users Overview ● Needs for Data Accesses Monitoring ● Existing Schemes and Their Limitations ● DAMON: Data Access MONitor ● Evaluations of DAMON ● Future Plans ● Conclusion Workload Instrumentation ● Install a hook function, which will be invoked for every memory access, into target workloads in compile time or using binary reversing techniques ● The hook function then records every related information ○ Including Time, Type of the access, the access target address, the value written to or read by ● Pros ○ Can monitor all informations of every memory access ○ Can be used for various purposes including safety checks ● Cons ○ Incurs high overhead ■ A workload having 10 minutes of runtime and 600 MiB of working sets (433.milc) consumed more than 24 hours of CPU time and 500 GiB storage spaces when an instrumentation based memory monitoring scheme (PIN) is applied record_load(X, r1); r1 = X; /* LOAD r1, X */ r1 = X; /* LOAD X, r1 */ record_store(Y, r2); Y = r2; /* STORE r2, Y */ Y = r2; /* STORE Y, r2 */ Page Table Accessed-bit Tracking ● When a page is accessed, the ‘accessed’ bit of the PTE for the page is set ● By periodically clearing and monitoring the bits for the target workloads, we can monitor memory accesses ● Many approaches including the proactive reclamation and Ingens use this ● Pros ○ Incurs lower overhead compared to the workload instrumentation ○ Simple to be applied ● Cons ○ Cannot monitor every information of every accesses ■ Know only existence of accesses ■ That said, not a big problem for multi-tier memory focused performance optimization ○ Overhead arbitrarily increases as the size of the target working sets grows Idle Page Tracking ● Page table accessed bit tracking based Linux kernel feature ● Allow users to clear and monitor the access bits using a sysfs interface ○ Points a page using pfn ● Complicated to be used from user space ○ Need to read and write up to 4 files (maps, pagemap, kpageflags, and idle_page_tracking/bitmap) ■ Finally read/write idle_page_tracking/bitmap ■ I tried this to

Load more