Coherence Stalls Or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing

Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen, University of Rochester https://www.usenix.org/conference/atc16/technical-sessions/presentation/srikanthan This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing Sharanyan Srikanthan Sandhya Dwarkadas Kai Shen Department of Computer Science, University of ochester srikanth,sandhya,kshen cs.rochester.edu { } Abstract isting x86 performance counters, SAM is able to sepa- rate inter-socket traffic due to coherence from that due to The efficiency of modern multiprogrammed multicore memory access, favoring colocation for threads experi- machines is heavily impacted by traffic due to data shar- encing high coherence traffic, and distribution for threads ing and contention due to competition for shared re- with high cache and/or memory bandwidth demand. Al- sources. In this paper, we demonstrate the importance though SAM is able to infer inter-socket coherence traf- of identifying latency tolerance coupled with instruction- fic, it cannot determine the impact of the coherence ac- level parallelism on the benefits of colocating threads tivity on application performance. This inability to relate on the same socket or physical core for parallel effi- traffic to performance limits its effectiveness to prioritize ciency. By adding hardware counted CPU stall cycles tasks in multiprogrammed workloads. due to cache misses to the measured statistics, we show In this paper, we develop and evaluate a multicore that it is possible to infer latency tolerance at low cost. CPU scheduler that combines information on sources of We develop and evaluate SAM-MPH, a multicore CPU traffic with tolerance for latency and need for computa- scheduler that combines information on sources of traf- tional resources. First, we demonstrate the importance of fic with tolerance for latency and need for computational identifying latency tolerance in determining the benefits resources. We also show the benefits of using a history of of colocating threads on the same socket or physical core past intervals to introduce hysteresis when making map- on parallel efficiency in multiprogrammed workloads. ping decisions, thereby avoiding oscillatory mappings High inter-socket coherence activity does not translate and transient migrations that would impact performance. to proportional benefit from thread colocation for differ- Experiments with a broad range of multiprogrammed ent applications or threads within an application. The parallel, graph processing, and data management work- higher the latency hiding ability of the thread, the lower loads on 0-CPU and 0-CPU machines show that SAM- the impact of inter-socket coherence activity on perfor- MPH obtains ideal performance for standalone applica- mance. We infer the benefits of colocation using com- tions and improves performance by up to 61 over the monly available hardware performance counters, in par- default Linux scheduler for mixed workloads. ticular, CPU stall cycles on cache misses. A low value for CPU stall cycles is an indication of latency tolerance, 1 Introduction making stall cycles an appropriate metric for prioritizing threads for colocation. Modern multi-socket multicore machines present a Second, we focus on the computational needs of in- complex challenge in terms of performance portability dividual threads. Hyperthreading [13], where hardware and isolation, especially for parallel applications in mul- computational contexts share a physical core, present tiprogrammed workloads. Performance is heavily im- complex tradeoffs for applications that share data. Colo- pacted by traffic due to data sharing and contention due cating threads on the same physical core allows direct to competition for shared resources, including physical access to a shared cache thereby resulting in low latency cores, caches, memory, and the interconnect. data communication when fine-grain sharing (sharing of Significant effort has been devoted to mitigating the cache-resident data) is exhibited across threads. At the impact on performance of competition for shared re- same time, competition for functional units can reduce sources [7, 15, 17, 18, 23, 26] for applications that do not the instructions per cycle (IPC) for the individual threads share data. Our own past work, Sharing-Aware Map- relative to running on independent physical cores. The per (SAM) [22], has shown that inter-socket coherence benefits of colocation are therefore a function of granu- activity among threads can be used as a criterion for larity of sharing (whether the reads and writes by differ- same-socket thread colocation for improved system ef- ent threads occur while the data is still cache resident) as ficiency and parallel application performance. Using ex- well as the instruction-level parallelism available within 1 USENIX Association 2016 USENIX Annual Technical Conference 323 each thread. We find that a combination of IPC andco- constrained resources. Additionally, SAM currently does herence activity thresholds are sufficient to inform this not differentiate between logical hardware contexts and tradeoff. physical cores. Furthermore, while SAMs successive it- Finally, we show that utilizing interval history in phase erations are able to capture changes in application behav- classification can avoid the oscillatory and transient task ior and workload mixes to effect task placement changes, migrations that may result from merely reacting to im- it merely reacts to the current state of task placement to mediate past behavior [8]. In particular, we keep track effect a more efficient task placement. Thus, it misses of the level of coherence activity that was incurred by a opportunities to learn from past placement decisions as thread in prior intervals, as well as its tolerance for la- well as to adapt to periodicity in application behavior. tency, and use this information to introduce hysteresis Our goal in this paper is to address these shortcomings in when identifying a phase transition. SAM and realize multiprogrammed performance much The combination of these three optimizations enable closer to standalone static best placement. us to obtain ideal performance for standalone applications and improve performance by up to 61 over linux 3 dentifying atency Tolerance and Com for multiprogrammed workloads. Performance of multiprogrammed workloads improve on average by 27 and putational eeds 43 over standard Linux on 40- and 80-CPU machines In this section, we demonstrate the importance of respectively. When compared with SAM [22], our ap- identifying latency tolerance and computational needs in proach yields an average improvement of 9 and 21 multicore task placement, and show how this informa- on the two machines with a peak improvement of 24 tion may be inferred from widely available performance and 27. We also reduce performance disparity across counter events. the applications in each workload. These performance benefits are achieved with very little increase in monitor- .1 Tolerance for Coherence Actiity atency ing overhead. Data sharing across tasks can result in varying communication latencies primarily dictated by task place- 2 Background: Separating Traffic due to ment. The closer the tasks sharing the data, the lower Sharing and emory Access the latency. For example, when tasks that share data are placed across sockets, the need to move data across sock- This work builds on our prior effort of SAM [22], a ets results in substantially increased latency. Hence, the Sharing-Aware-Mapper that monitors individual task 1 natural choice would be to prioritize tasks with high co- behavior using hardware performance counters. SAM herence activity for colocation on the same socket. identifies and combines information from commonly However, the performance impact of coherence activ- available hardware performance counter events to sep- ities depends in reality on the latency tolerance of the arate traffic due to data sharing from that due to memory application. We focus here on identifying this latency access. Further, the non-uniformity in traffic is captured tolerance in addition to being able to measure and iden- by separately characterizing intra- and inter-socket co- tify data sharing. herence activity, and local versus remote memory access. We introduce two additional metrics to augment inter- Following an iterative, interval-based approach, SAM socket coherence activity as a measure of sharing be- uses the information about individual task traffic patterns havior: IPC (Instructions per Cycle) and SPC (Stalls per to retain colocation for tasks with high intra-socket co- inter-socket Coherence event). The Intel platforms pro- herence activity, and consolidate tasks with high inter- vide access to a counter that tracks cycles stalled on all socket coherence activity. At the same time, SAM dis- last-level cache misses. While these stalls include those tributes tasks with high memory bandwidth needs, colo- due to coherence activity, they also include stalls on other cating them with the memory they access. These de- forms of misses. When coherence activity is high, stalls cisions reduce communication

Coherence Stalls Or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7