A Case for NUMA-Aware Contention Management on Multicore Systems
Total Page:16
File Type:pdf, Size:1020Kb
A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov Sergey Zhuravlev Mohammad Dashti Simon Fraser University Simon Fraser University Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract performance of individual applications or threads by as much as 80% and the overall workload performance by On multicore systems, contention for shared resources as much as 12% [23]. occurs when memory-intensive threads are co-scheduled Unfortunately studies of contention-aware algorithms on cores that share parts of the memory hierarchy, such focused primarily on UMA (Uniform Memory Access) as last-level caches and memory controllers. Previous systems, where there are multiple shared LLCs, but only work investigated how contention could be addressed a single memory node equipped with the single memory via scheduling. A contention-aware scheduler separates controller, and memory can be accessed with the same competing threads onto separate memory hierarchy do- latency from any core. However, new multicore sys- mains to eliminate resource sharing and, as a conse- tems increasingly use the Non-Uniform Memory Access quence, to mitigate contention. However, all previous (NUMA) architecture, due to its decentralized and scal- work on contention-aware scheduling assumed that the able nature. In modern NUMA systems, there are mul- underlying system is UMA (uniform memory access la- tiple memory nodes, one per memory domain (see Fig- tencies, single memory controller). Modern multicore ure 1). Local nodes can be accessed in less time than re- systems, however, are NUMA, which means that they mote ones, and each node has its own memory controller. feature non-uniform memory access latencies and multi- When we ran the best known contention-aware sched- ple memory controllers. ulers on a NUMA system, we discovered that not only do We discovered that state-of-the-art contention man- they not manage contention effectively, but they some- agement algorithms fail to be effective on NUMA sys- times even hurt performance when compared to a de- tems and may even hurt performance relative to a default fault contention-unaware scheduler (on our experimental OS scheduler. In this paper we investigate the causes for setup we observed as much as 30% performance degra- this behavior and design the first contention-aware algo- dation caused by a NUMA-agnostic contention-aware al- rithm for NUMA systems. gorithm relative to the default Linux scheduler). The focus of our study is to investigate (1) why contention- 1 Introduction management schedulers that targeted UMA systems fail to work on NUMA systems and (2) devise an algorithm Contention for shared resources on multicore proces- that would work effectively on NUMA systems. sors is a well-known problem. Consider a typical mul- Why existing contention-aware algorithms may hurt ticore system, schematically depicted in Figure 1, where performance on NUMA systems: Existing state-of- cores share parts of the memory hierarchy, which we the-art contention-aware algorithms work as follows on term memory domains, and compete for resources such NUMA systems. They identify threads that are sharing as last-level caches (LLC), system request queues and a memory domain and hurting each other’s performance memory controllers. Several studies investigated ways of and migrate one of the threads to a different domain. This reducing resource contention and one of the promising may lead to a situation where a thread’s memory is lo- approaches that emerged recently is contention-aware cated in a different domain than that in which the thread scheduling [23, 10, 16]. A contention-aware scheduler is running. (E.g., consider a thread being migrated from identifies threads that compete for shared resources of a core C1 to core C5 in Figure 1, with its memory being lo- memory domain and places them into different domains. cated in Memory Node #1). We refer to migrations that In doing so the scheduler can improve the worst-case may place a thread into a domain remote from its mem- known to work well on UMA systems may actually C1 C2 C5 C6 hurt performance on NUMA systems. L3 cache L3 cache C3 C4 C7 C8 • We identify NUMA-agnostic migration as the cause for this phenomenon and identify the reasons why Memory Node #1 Memory Node #2 performance degrades. We also show that remote access latency is not the key reason why NUMA- Domain 1 Domain 2 agnostic migration hurt performance. • We design and implement Distributed Intensity NUMA Online (DINO), a new contention-aware al- Memory Node #3 Memory Node #4 gorithm for NUMA systems. DINO prevents super- fluous thread migrations, but when it does perform C9 C10 C13 C14 L3 cache L3 cache migrations, it moves the memory of the threads C11 C12 C15 C16 along with the threads themselves. DINO performs up to 20% better than the default Linux scheduler Domain 3 Domain 4 and up to 50% better than Distributed Intensity, which is the best contention-aware scheduler known Figure 1: A schematic view of a system with four mem- to us [23]. ory domains and four cores per domain. There are 16 • We devise a page migration strategy that works on- cores in total, and a shared L3 cache per domain. line, uses Instruction-Based Sampling, and elimi- nates on average 75% of remote accesses. ory NUMA-agnostic migrations. Our algorithms were implemented at user-level, since NUMA-agnostic migrations create several problems, modern operating systems typically export the interfaces an obvious one being that the thread now incurs a higher for implementing the desired functionality. If needed, the latency when accessing its memory. However, contrary algorithms can also be moved into the kernel itself. to a commonly held belief that remote access latency – The rest of this paper is organized as follows. Sec- i.e., the higher latency incurred when accessing a remote tion 2 demonstrates why existing contention-aware al- domain relative to accessing a local one – would be the gorithms fail to work on NUMA systems. Section 3 key concern in this scenario, we discovered that NUMA- presents and evaluates DINO. Section 4 analyzes mem- agnostic migrations create other problems, which are far ory migration strategies. Section 5 provides the exper- more serious than remote access latency. In particular, imental results. Section 6 discusses related work, and NUMA-agnostic migrations fail to eliminate contention Section 7 summarizes our findings. for some of the key hardware resources on multicore systems and create contention for additional resources. That is why existing contention-aware algorithms that 2 Why existing algorithms do not work on perform NUMA-agnostic migrations not only fail to be NUMA systems effective, but can substantially hurt performance on mod- ern multicore systems. As we explained in the introduction, existing contention- Challenges in designing contention-aware algo- aware algorithms perform NUMA-agnostic migration, rithms for NUMA systems: To address this problem, a and so a thread may end up running on a node remote contention-aware algorithm on a NUMA system must from its memory. This creates additional problems be- migrate the memory of the thread to the same domain sides introducing remote latency overhead. In particu- memory where it migrates the thread itself. However, the need to lar, NUMA-agnostic migrations fail to eliminate controller contention interconnect move memory along with the thread makes thread mi- , and create additional contention grations costly. So the algorithm must minimize thread . The focus of this section is to experimentally migrations, performing them only when they are likely to demonstrate why this is the case. significantly increase performance, and when migrating To this end, in Section 2.1, we quantify how con- memory it must carefully decide which pages are most tention for various shared resources contributes to per- profitable to migrate. Our work addresses these chal- formance degradation that an application may experi- lenges. ence as it shares the hardware with other applications. The contributions of our work can be summarized as We show that memory controller contention and inter- follows: connect contention are the most important causes of per- formance degradation when an application is running re- • We discover that contention-aware algorithms motely from its memory. Then, in Section 2.2 we use 2 four AMD Barcelona processors running at 2.3GHz, and Core 1 Core 2 Core 3 Core 4 64GB of RAM, 16GB per domain. The operating system is Linux 2.6.29.6. Figure 2 schematically represents the L2 cache L2 cache L2 cache L2 cache architecture of each processor in this system. We identify four sources of performance degradation Shared L3 Cache that can occur on modern NUMA systems, such as those shown in Figures 1 and 2: System Request Interface • Contention for the shared last-level cache (CA). Crossbar switch This also includes contention for the system request queue and the crossbar. Memory HyperTransport™ Controller • Contention for the memory controller (MC). This also includes contention for the DRAM prefetching unit. • Contention for the inter-domain interconnect (IC). to DRAM to other chips Figure 2: A schematic view of a system used in this • Remote access latency, occurring when a thread’s study. A single domain is shown. memory is placed in a remote node (RL). Base Case Cache Contention To quantify the effects of performance degradation caused by these factors we use the methodology depicted 0) T C MT MC NONE 4) TC in Figure 3. We run a target application, denoted as MT MC CACACA T with a set of three competing applications, denoted No Cache Contention as C. The memory of the target application is denoted MT , and the memory of the competing applications is 1) T C 5) TC CACACA denoted MC. We vary (1) how the target application is MT, MC MCMCMC MC MT RLRLRL placed with respect to its memory, (2) how it is placed with respect to the competing applications, and (3) how 2) T C 6) TC ICICIC CACACA the memory of the target is placed with respect to the MC MT RLRLRL MT,MC MCMCMC memory of the competing applications.