Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen, University of Rochester https://www.usenix.org/conference/atc15/technical-session/presentation/srikanthan

This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225

Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems

Sharanyan Srikanthan Sandhya Dwarkadas Kai Shen Department of Computer Science, University of Rochester srikanth,sandhya,kshen @cs.rochester.edu { }

Abstract sockets with separate and memory bandwidth re- sources. The mapping efficiency is further complicated Modern multicore platforms suffer from inefficiencies by the dynamic nature of many workloads. due to contention and communication caused by shar- A large body of prior work has devised ing resources or accessing shared data. In this pa- techniques to mitigate resource contention [5–7, 14, 15, per, we demonstrate that information from low-cost hard- 17, 18, 21, 22] in the absence of data sharing. Rela- ware performance counters commonly available on mod- tively few have investigated data sharing issues [19, 20] ern processors is sufficient to identify and separate the for parallel applications. Tam et al. [19] used direct sam- causes of communication traffic and performance degra- pling of data access addresses to infer data sharing be- dation. We have developed SAM, a Sharing-Aware Map- havior. While more recent multicore platforms do some- per that uses the aggregated coherence and bandwidth times provide such sampled tracing capability, trace pro- event counts to separate traffic caused by data sharing cessing in software comes at a significant cost. Tang from that due to memory accesses. When these counts ex- et al. [20] demonstrated the behavior of latency-critical ceed pre-determined thresholds, SAM effects task to core datacenter applications under different task placement assignments that colocate tasks that share data and dis- strategies. Their results reinforce the need to provide a tribute tasks with high demand for cache capacity and low-cost, online, automated approach to place simultane- memory bandwidth. Our new mapping policies automat- ously executing tasks on CPUs for high efficiency. ically improve execution speed by up to 72% for individ- This paper presents our strategy for ual parallel applications compared to the default Linux task placement that manages data sharing and resource scheduler, while reducing performance disparities across contention in an integrated fashion. In order to separate applications in multiprogrammed workloads. the impact of data sharing from resource contention, we use aggregate information from existing hardware perfor- 1 Introduction mance counters along with a one-time characterization of event thresholds that impact performance significantly. Multicore processors share substantial hardware re- Specifically, we use performance counters that identify sources including last-level cache (LLC) space and mem- on- and off-chip traffic due to coherence activity (when ory bandwidth. At the same time, a parallel application data for a cache miss is sourced from another core) and with multiple tasks1 running on different CPU cores may combine this knowledge with LLC miss rates and band- simultaneously access shared data. Both data and re- width consumption to separate sharing-related slowdown source sharing can result in performance slowdowns and from slowdown due to resource contention. symptoms including high data traffic (bandwidth con- Our adaptive online Sharing-Aware Mapper (SAM) sumption) and high LLC miss rates. To maximize effi- uses an iterative, interval-based approach. Based on the ciency, multicore platforms on server, desktop, as well sampled counter values in the previous interval, as well as mobile environments must intelligently map tasks to as measured thresholds in terms of performance impact CPU cores for multiprogrammed and parallel workloads. using microbenchmarks, we identify tasks that share data Despite similar symptoms, data and resource shar- and those that have high memory and/or cache demand. ing behaviors require very different task CPU map- → We then perform a rebalance in order to colocate the ping policies—in particular, applications with strong data tasks that share data on cores that are in proximity and sharing benefit from colocating the related tasks on cores with potentially shared caches. This decision is weighed that are in proximity to each other (e.g., cores on one against the need to distribute tasks with high bandwidth socket) while applications with high memory demand uses across cores that share fewer resources. or large working sets might best be distributed across SAM improves the execution speed by up to 72% for 1In this paper, a task refers to an OS-schedulable entity such as a stand-alone parallel applications compared to the default or a . Linux scheduler, without the need for user input or ma-

1 USENIX Association 2015 USENIX Annual Technical Conference 529 nipulation. For concurrent execution of multiple parallel processors, accounting for coherence activity in a manner and sequential applications, our performance improve- portable across platforms can still be a challenge. Our ments are up to 36%, while at the same time reducing goal is to identify performance counters that are available performance disparities across applications. The rest of across a range of architectural performance monitoring this paper presents the design, implementation, and eval- units, and that can help isolate intra- and inter-socket co- uation of our sharing-aware mapping of tasks to CPUs on herence. We use the following four counters—last-level multicore platforms. cache (LLC) hits, LLC misses, misses at the last private level of cache, and remote memory accesses. In multi-socket machines, there is a clear increase 2 Sharing and Contention Tracking in overhead when coherence activities cross the socket Hardware counters are commonplace on modern pro- boundary. Any coherence request that can be resolved cessors, providing detailed information such as the in- from within the socket is handled using an intra-socket struction mix, rate of execution, and cache/memory ac- protocol. If the request cannot be satisfied locally, it is cess behavior. These counters can also be read at low treated as a last-level cache (LLC) miss and handed over latency (on the order of a µSec). We explore the use to an inter-socket coherence protocol. of commonly available event counters to efficiently track We use the cache misses at the last private level, as well data sharing as well as resource contention on multicores. as LLC hits and misses, to indirectly infer the intra-socket Our work addresses several challenges. First, pre- coherence activities. LLC hit counters count the number defined event counters may not precisely suit our infor- of accesses served directly by the LLC and do not include mation tracking needs. In particular, no single counter data accesses satisfied by intra-socket coherence activity. reports the data sharing activities on multicores. Sec- Thus, by subtracting LLC hits and misses from the last ondly, it may be challenging to identify event thresholds private level cache misses, we can determine the number that should trigger important control decisions (such as of LLC accesses that were serviced by the intra-socket saturating uses of a bottleneck resource). Finally, while coherence protocol. many events may be defined in a performance counter ar- To measure inter-socket coherence activity, we exploit chitecture, usually only a small number can be observed the fact that the LLC treats accesses serviced by both off- at one time. For example, the Intel Ivy Bridge archi- socket coherence as well as by memory as misses. The tecture used in our evaluation platform can only monitor difference between LLC misses and memory accesses four programmable counters at a time (in addition to three gives us the inter-socket coherence activity. In our im- fixed-event counters). plementation, the Intel Ivy Bridge processor directly sup- In this paper, we use these low-cost performance coun- ports counting LLC misses that were not serviced by ters to infer valuable information on various bottlenecks memory, separating and categorizing them based on co- in the system. Particularly, we focus on intra- and inter- herence state. We sum the counters to determine inter- socket coherence activity, memory bandwidth utiliza- socket coherence activities. tion, and access to remote memory (NUMA). We use We devise a synthetic microbenchmark to help ana- the counters and microbenchmarks to analyze the effect lyze the performance impact of cross-socket coherence of these factors on performance. We obtain thresholds activities. The microbenchmark creates two threads that for memory bandwidth utilization and coherence activity share data. We ensure that the locks and data do not in- that result in significant performance degradation. These duce any false sharing. Using a dual-socket machine, we thresholds further enable our system to identify and miti- compare the performance of the microbenchmark when gate sharing bottlenecks during execution. consolidating both threads on a single socket against exe- cution when distributing them across sockets (a common 2.1 Coherence Activities result of Linux’s default scheduling strategy). The latter induces inter-socket coherence traffic while all coherence Coherence can be a significant bottleneck in large scale traffic uses the on-chip network in the former case. systems. In multithreaded applications, access to shared We vary the rate of coherence activity by changing the data and synchronization variables trigger coherence ac- ratio of computation to shared data access within each tivities when the threads are distributed across multiple loop in order to study its performance impact. Figure 1 cores. When data-sharing threads are colocated on the shows the performance of consolidating the threads onto same multicore socket, coherence is handled within the the same socket relative to distributing across sockets. socket using a high speed bus or a ring. When threads are At low coherence traffic rates, the consolidation and dis- distributed across sockets, the coherence cost increases tribution strategies do not differ significantly in perfor- significantly due to the higher latency of off-chip access. mance, but when the traffic increases, the performance Despite the gamut of events monitored by modern day improvement from colocation can be quite substantial.

2 530 2015 USENIX Annual Technical Conference USENIX Association 1.8 0.06 High−RBHR workloads 0.05 1.6 0.04

1.4 0.03 Low−RBHR workloads

0.02

1.2 LLC Misses per cycle Performance speedup 0.01

1 0 0 0.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 −3 Number of tasks Coherence events per CPU cyclex 10 Figure 1: Speedup when consolidating two threads that Figure 2: Socket-level memory bandwidth usage (mea- share data onto the same socket (eliminating inter-socket sured by the LLC MISSES performance counter) for coherence) in comparison to distributing them across workloads of high and low row buffer hit ratios (RBHRs) sockets as coherence activity is varied. at increasing number of tasks.

tasks with high RBHR may utilize more memory band- We use these experiments to identify a per-thread co- width than ten low-RBHR tasks do. If we use the max- herence activity threshold at which inter-socket coher- imum bandwidth usage of high-RBHR workloads as the ence causes substantial performance degradation (which available memory resource, then a low-RBHR workload we identify as a degradation of >5%). Specifically on would never reach it and therefore be always determined our experimental platform with 2.2 GHz processors, the 4 as having not used the available resource (even when the threshold is 2.5 10− coherence events per CPU cycle, × opposite is true). or 550,000 coherence events per second. In this paper, we are more concerned with detecting 2.2 Memory Bandwidth Utilization and mitigating memory bandwidth bottlenecks than uti- lizing the memory bandwidth at its maximum possible Our approach requires the identification of a memory level. We can infer from Figure 2 that the difference bandwidth utilization threshold that signifies the resource between the low / high-RBHR bottleneck bandwidths (at exhaustion and likely performance degradation. Since all 10 tasks) is about 10%. We conservatively use the low cores on a socket share the access to memory, we ac- RBHR bandwidth as the maximum available without at- count for this resource threshold on a per-socket basis— tempting to track RBHR. Our high-resource-use thresh- aggregating the memory bandwidth usage of all tasks old is set at 20% below the above-defined maximum running on each particular socket. available memory bandwidth. Specifically on our experi- mental platform with 2.2 GHz processors, the threshold is One challenge we face is that the maximum bandwidth 0.034 LLC misses per cycle, or 75,000,000 LLC misses utilization is not a static hardware property but it further per second. A conservative setting might result in detect- depends on the row buffer hit ratio (RBHR) of the mem- ing bottlenecks prematurely but will avoid missing ones ory access loads. DRAM rows must be pre-charged and that will actually result in performance degradation. activated before they can be read or written to. For spatial locality, DRAMs often activate an entire row of data (on 2.3 Performance Counters on Ivy Bridge the order of 4 KB) instead of just the cache line being ac- cessed. Once this row is opened, subsequent accesses to Our experimental platform contains processors from the same row incur much lower latency and consume less Intel’s Ivy Bridge line. On our machine, cores have pri- device time. Hence, the spatial locality in an application’s vate L1 and L2 caches, and an on-chip shared L3 (LLC). memory access pattern plays a key role in its performance The Intel Performance Monitoring Unit (PMU) provides and resource utilization. the capability of monitoring certain events (each of which Figure 2 shows the aggregate memory bandwidth used may have several variants) at the granularity of individual with increasing numbers of tasks. All tasks in each test hardware contexts. Instructions, cycles, and unhalted cy- run on one multicore socket in our machine. We show cles can be obtained from fixed counters in the PMU. The results for both low- and high-RBHR workloads, using a remaining events must use the programmable counters. memory copy microbenchmark where the number of con- We encounter two constraints in using the pro- tiguous words copied is one cache line within a row or an grammable counters—there are only four programmable entire row, respectively. Behaviors of most real-world ap- counters and we can monitor only two variants of any par- plications lie between these two curves. We can clearly ticular event using these counters. Solutions to both con- see that the bandwidth available to a task is indeed af- straints require multiplexing the programmable counters. fected by its RBHR. For example, on our machine, three We separate the counters into disjoint groups and then

3 USENIX Association 2015 USENIX Annual Technical Conference 531 alternate the groups monitored during successive inter- memory bandwidth demand. vals of application execution, effectively sampling each Figure 3 defines the symbols that are used for SAM’s counter group from a partial execution. task CPU mapping algorithm illustrated in Figure 4. → In order to monitor intra-socket coherence activity, Reducing inter-socket coherence activity by colocating we use the main event MEM LOAD UOPS RETIRED. tasks has the highest priority. Once tasks with high inter- One caveat is that since the event is a load-based event, socket coherence traffic are identified, we make an at- only reads are monitored. In our experiments, we found tempt to colocate the tasks by moving them to sockets that since typical long-term data use involves a read fol- with idle cores that already contain tasks with high inter- lowed by a write, these counters were a good indica- socket coherence activity, if there are any. If not, we use tor of coherence activity. We use three variants of this our task categorization to distinguish CPU or memory event, namely L2 MISS (misses at the last private level bound tasks—SAM swaps out memory intensive tasks of cache), L3 HIT (hits at, or accesses serviced by, the only after swapping CPU bound tasks, and avoids moving LLC), and L3 MISS (misses at the LLC), and use these tasks with high intra-socket coherence activity. Moving events to infer the intra-socket coherence activity, as de- memory-intensive tasks is our reluctant last choice since scribed in Section 2.1. they might lead to expensive remote memory accesses. To obtain inter-socket coherence activity, we use the Our second priority is to reduce remote memory ac- MEM LOAD UOPS LLC MISS RETIRED event. We cesses. Remote memory accesses are generally more ex- get the inter-socket coherence activity by summing up pensive than local memory accesses. Since a reluctant two variants of this event—REMOTE FWD and RE- migration might cause remote memory accesses, we reg- MOTE HITM. ister the task’s original CPU placement prior to the mi- We read the remote DRAM access event count gration. We then use this information to relocate the task from MEM LOAD UOPS LLC MISS RETIRED: RE- back to its original socket whenever we notice that it in- MOTE DRAM. curs remote memory accesses. To avoid disturbing other We use the LLC MISSES event (which includes both tasks, we attempt to swap tasks causing remote memory reads and writes) as an estimate of overall bandwidth con- accesses with each other whenever possible. Otherwise, sumption. Although the LLC MISSES event count in- we search for idle or CPU-bound tasks to swap for the cludes inter-socket coherence and remote DRAM access task that generates remote memory accesses. events, since these latter are prioritized in our algorithm, After reducing remote memory accesses, we look to further division of bandwidth was deemed unnecessary. balancing memory bandwidth utilization. SAM identi- In total, we monitor seven programmable events, with fies memory-intensive tasks on sockets where the band- three variants for each of two particular events, allowing width is saturated. The identified tasks are relocated to measurement of one sample for each event across two it- other sockets whose bandwidth is not entirely consumed. erations. Since different interval sizes can have statistical We track the increase in bandwidth utilization after every variation, we normalize the counter values with the mea- migration to avoid overloading a socket with too many sured unhalted cycles for the interval. memory-intensive tasks. In some situations, such reloca- tion can cause an increase in remote memory accesses, but this is normally less damaging than saturating the 3 SAM: Sharing-Aware Mapping memory bandwidth. We have designed and implemented SAM, a perfor- mance monitoring and adaptive mapping system that si- 3.2 Implementation Notes multaneously reduces costly communication and maxi- Performance counter statistics are collected on a per mizes resource utilization efficiency for the currently ex- task (process or thread) basis with the values accumulated ecuting tasks. We identify resource sharing bottlenecks in the task control block. Performance counter values are and their associated costs/performance degradation im- read on every operating system tick and the values are pact. Our mapping strategy attempts to shift load away attributed to the currently executing task. from such bottleneck resources. Our mapping strategy is implemented as a kernel mod- 3.1 Design ule that is invoked by a privileged daemon process. The collected counter values for currently executing tasks are Using thresholds for each bottleneck as described in examined at regular intervals. Counter values are first Section 2, we categorize each task based on whether its normalized using unhalted cycles and then used to derive characteristics exceed these thresholds. The four activity values for memory bandwidth utilization, remote mem- categories of interest are: inter-socket coherence, intra- ory accesses, and intra- and inter-socket coherence activ- socket coherence, remote DRAM access, and per-socket ity for each running task. These values are attributed to

4 532 2015 USENIX Annual Technical Conference USENIX Association // Inter-socket coherence activity found. Need to colocate appropriate tasks. T 𝒞𝒞 : Per task inter-socket coherence threshold 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 T M : Per task memory utilization threshold 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖 ∈ 𝒮𝒮 𝑖𝑖𝑖𝑖 (𝒫𝒫𝑖𝑖 ! = ∅) T ℛ : Per task remote memory access threshold 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑗𝑗 ∈ 𝒮𝒮 ⋀ (𝑗𝑗 ! = 𝑖𝑖) 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝒮𝒮 : Set of all sockets 𝑤𝑤ℎ𝑖𝑖𝑖𝑖𝑖𝑖 ((|𝒫𝒫𝑖𝑖 + 𝒫𝒫𝑖𝑖 | < |𝐶𝐶𝑖𝑖|) ⋀ (𝒫𝒫𝑗𝑗 ! = ∅)) i 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝒞𝒞 : Set of cores in socket i 𝑤𝑤ℎ𝑖𝑖𝑖𝑖𝑖𝑖 (𝒫𝒫𝑖𝑖 ! = ∅ ⋀ 𝒫𝒫𝑗𝑗 ! = ∅) 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝒞𝒞𝑖𝑖,𝑗𝑗 : 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑐𝑐ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (𝒫𝒫𝑗𝑗 [0], 𝒫𝒫𝑖𝑖 [0]) 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) 𝑖𝑖,𝑗𝑗 𝑤𝑤ℎ𝑖𝑖𝑖𝑖𝑖𝑖 (𝒫𝒫𝑖𝑖 ! = ∅ 𝒫𝒫𝑗𝑗 ! = ∅) 𝒞𝒞 : 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑐𝑐ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑟𝑟⋀ 𝐶𝐶𝐶𝐶𝐶𝐶 𝑗𝑗 𝑖𝑖 𝑀𝑀𝑖𝑖,𝑗𝑗 ∶ 𝑀𝑀𝑀𝑀𝑚𝑚𝑜𝑜𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏ℎ 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝒫𝒫 [0], 𝒫𝒫 [0] 𝑀𝑀𝑀𝑀𝑀𝑀 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖,𝑗𝑗 𝑖𝑖 ℛ ∶ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖 𝑤𝑤ℎ𝑖𝑖𝑖𝑖𝑖𝑖 (𝒫𝒫 ! =𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖∅ ⋀ 𝒫𝒫 𝑀𝑀𝑀𝑀 𝑀𝑀! = ∅) 𝑀𝑀𝑀𝑀𝑀𝑀 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 // This is 𝑗𝑗a reluctant𝑖𝑖 task migration. Store𝑖𝑖 original 𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝒞𝑖𝑖,𝑗𝑗 : 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑠𝑠𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝒫𝒫 [0], 𝒫𝒫 [0]}, 𝑙𝑙𝑙𝑙𝑙𝑙 𝑎𝑎 = 𝒫𝒫 [0] 𝑀𝑀𝑀𝑀𝑀𝑀 // socket id to restore task when possible. 𝑖𝑖 { | 𝑖𝑖 𝑖𝑖,𝑗𝑗 𝑇𝑇 𝒫𝒫𝑅𝑅𝑅𝑅𝑅𝑅 : 𝑗𝑗 𝑗𝑗 ∈ 𝒞𝒞 ⋀ 𝑀𝑀 > 𝑀𝑀 } 𝑖𝑖 𝑖𝑖 𝑖𝑖,𝑗𝑗 𝑇𝑇 𝒫𝒫 ∶ { 𝑗𝑗 | 𝑗𝑗 ∈ 𝒞𝒞 ⋀ 𝑅𝑅 > 𝑅𝑅 } 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑗𝑗,𝑎𝑎) 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑖𝑖𝑖𝑖 (𝑆𝑆𝑆𝑆𝑆𝑆 == −1) 𝒫𝒫𝑖𝑖 ∶ { 𝑗𝑗 | 𝑗𝑗 ∈ 𝒞𝒞𝑖𝑖 ⋀ 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖,𝑗𝑗 = 0} 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑗𝑗,𝑎𝑎) 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 // Mitigate any remote memory accesses𝑆𝑆𝑆𝑆𝑆𝑆 encountered= 𝑖𝑖 . 𝒫𝒫𝑖𝑖 : { 𝑗𝑗 | 𝑗𝑗 ∈ 𝒞𝒞𝑖𝑖 ⋀ 𝐶𝐶𝑖𝑖,𝑗𝑗 > 𝒞𝒞𝑇𝑇 } 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖,𝑗𝑗 𝑇𝑇 𝑖𝑖,𝑗𝑗 𝑇𝑇 𝑅𝑅𝑅𝑅𝑅𝑅 𝒫𝒫 : { 𝑗𝑗 | 𝑗𝑗 ∈ 𝒞𝒞 ⋀ (𝐶𝐶 ≤ 𝒞𝒞 ) ⋀ (𝐶𝐶 ≥ 𝒞𝒞 )} 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖 ∈ 𝒮𝒮 𝑖𝑖𝑖𝑖 (𝒫𝒫𝑖𝑖 ! = ∅) 𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀 𝑀𝑀𝑀𝑀 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 // Swap with𝑅𝑅𝑅𝑅𝑅𝑅 a task migrated reluctantly from the current socket. 𝒫𝒫𝑖𝑖 ∶ 𝒞𝒞𝑖𝑖 − 𝒫𝒫𝑖𝑖 − 𝒫𝒫𝑖𝑖 − 𝒫𝒫𝑖𝑖 − 𝒫𝒫𝑖𝑖 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑗𝑗 ∈ 𝒫𝒫𝑖𝑖 , 𝑙𝑙𝑙𝑙𝑙𝑙 𝑎𝑎 = 𝑆𝑆𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑖𝑖,𝑗𝑗)

𝑀𝑀𝑖𝑖 ∶ ∑𝑗𝑗∈𝒞𝒞𝑖𝑖 ℳ𝑖𝑖,𝑗𝑗

𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑖𝑖,𝑗𝑗) 𝑖𝑖𝑖𝑖 (∃𝑘𝑘∈𝒞𝒞𝑎𝑎(𝑆𝑆𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑎𝑎,𝑘𝑘) == 𝑖𝑖)) 𝑆𝑆𝑆𝑆𝑆𝑆 ∶ 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑤𝑤ℎ𝑖𝑖𝑖𝑖ℎ 𝑡𝑡ℎ𝑒𝑒 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑟𝑟𝑟𝑟𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖, 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 𝑤𝑤𝑤𝑤𝑤𝑤 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 (𝑗𝑗, 𝑘𝑘) 𝑎𝑎 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖𝑖𝑖 (𝒫𝒫 ! = ∅𝐼𝐼𝐼𝐼𝐼𝐼) 𝐼𝐼 𝑎𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶 (𝑗𝑗, 𝒫𝒫 [0]) Figure 3: SAM algorithm definitions. 𝑎𝑎 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖𝑖𝑖 (𝒫𝒫 ! = ∅𝐶𝐶𝐶𝐶𝐶𝐶) 𝑎𝑎 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑀𝑀𝑀𝑀𝑀𝑀 (𝑗𝑗, 𝒫𝒫 [0]) 𝑎𝑎 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖𝑖𝑖 (𝒫𝒫 ! = ∅𝑀𝑀)𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝑗𝑗, 𝒫𝒫𝑎𝑎 [0]), 𝑙𝑙𝑙𝑙𝑙𝑙 𝑏𝑏 = 𝒫𝒫𝑎𝑎 [0] the corresponding core and used to update/maintain per- 𝑖𝑖𝑖𝑖 (𝑆𝑆𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑎𝑎,𝑏𝑏) == −1) 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑎𝑎,𝑏𝑏) core and per-socket data structures that store the consol- // Balance the memory intensive𝑆𝑆𝑆𝑆𝑆𝑆 load to other= 𝑖𝑖 sockets. 𝑀𝑀𝑀𝑀𝑀𝑀 ) idated information. Per-core values are added to deter- 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖 ∈ 𝒮𝒮 𝑖𝑖𝑖𝑖 (𝒫𝒫𝑖𝑖 ! = ∅) ⋀ (ℳ𝑖𝑖 > ℳ𝑇𝑇) // Balance memory intensive tasks across sockets. 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑗𝑗 ∈ 𝒮𝒮 𝑖𝑖𝑖𝑖 ((𝑗𝑗 ! = 𝑖𝑖) ⋀ (ℳ𝑗𝑗 < ℳ𝑇𝑇) mine per-socket values. // If cores are unavailable, look for other sockets to balance load.

Task relocation is accomplished by manipulating task 𝑀𝑀𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀 𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(𝒫𝒫𝑖𝑖 , 𝒫𝒫𝑗𝑗 ) affinity (using sched setaffinity on Linux) to restrict 𝑖𝑖𝑖𝑖 (𝑟𝑟𝑟𝑟𝑟𝑟 == 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵_𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) scheduling to specific cores or sockets. This allows seam- 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 less inter-operation with Linux’s default scheduler and Figure 4: SAM task CPU mapping algorithm. → load balancer. SAM’s decisions are taken at 100 mSec intervals. We 4.1 Benchmarks performed a sensitivity analysis on the impact of the in- terval length and found that while SAM was robust to We use a variety of benchmarks from multiple domains changes in interval size (varied from 10 mSecs to over 1 in our evaluation. First, we use the synthetic microbench- second), a 100-mSec interval hit the sweet spot in terms marks described in Sections 2.1 and 2.2 that stress the of balancing reaction times. We find that we can detect inter- and intra-socket coherence and memory bandwidth and decide on task migrations effectively at this rate. We respectively. We create two versions of the coherence were able to obtain good control and response for inter- activity microbenchmark—one (Hubench) generating a 3 near-maximum rate of coherence activity (1.3 10− co- vals up to one second. Effecting placement changes at in- × tervals higher than a second can reduce the performance herence events per CPU cycle) and another (Lubench) benefits due to lack of responsiveness. generating coherence traffic close to the threshold we 4 identified in Section 2 (2.6 10− coherence events per × cycle). We also use the high-RBHR memcpy microbench- 4 Evaluation mark (MemBench). MemBench saturates the memory bandwidth on one socket and needs to be distributed We assess the effectiveness of our hardware counter- across sockets to maximize memory bandwidth utiliza- based sharing and contention tracking, and evaluate the tion. performance of our adaptive task CPU mapper. Our PARSEC 3.0 [2] is a parallel benchmark suite con- → performance monitoring infrastructure was implemented taining a range of applications including image process- in Linux 3.14.8. Our software environment is Fedora 19 ing, chip design, data compression, and content similarity running GCC 4.8.2. We conducted experiments on a search. We use a subset of the PARSEC benchmarks that dual-socket machine with each socket containing an In- exhibit non-trivial data sharing. In particular, Canneal tel Xeon E5-2660 v2 “Ivy Bridge” processor (10 physi- uses cache-aware simulated annealing to minimize the cal cores with 2 hyperthreads each, 2.20 GHz, 25 MB of routing cost of a chip design. Bodytrack is a computer L3 cache). The machine has a NUMA configuration in vision application that tracks a human body through an which each socket has an 8 GB local DRAM partition. image sequence. Both benchmarks depend on data that

5 USENIX Association 2015 USENIX Annual Technical Conference 533 Application Coherence Remote Workload SAM Default Linux memory characteristic 1 Hubench (8t) 4 0 Data sharing 0.9 Lubench (8t) 4 0 Data sharing

0.8 MemBench (8t) 0 0 Memory bound MemBench (6t) 0 0 Memory bound 0.7 Canneal (4t) 8 3 CPU bound 0.6 Bodytrack (4t) 2 0 CPU bound 0.5 TunkRank (4t) 6 1 Data sharing 0.4 TunkRank (10t) 13 1 Data sharing Speedup/Slowdown 0.3 RBM (4t) 12 0 CPU bound ALS (4t) 1 0 CPU bound 0.2 Small Dataset 0.1 ALS (4t) 6 3 CPU bound 0 HubenchLubenchMemBench (8t)MemBench (8t)Canneal (8t)Bodytrack (6t) TunkRank(4t) TunkRank (4t)RBM (4t)ALS (4t) (10t)ALS Small ALS(4t) (4t)SGD (10t)SGD (4t)BSGD (10t)BSGD (4t)SVD (10t)SVD (4t)PMF (10t)LSGD (4t)LSGD (4t)SVD (10t)SVD (8t) (6t) ALS (10t) 9 1 CPU bound SGD (4t) 12 2 Data sharing SGD (10t) 14 3 Data sharing BSGD (4t) 8 2 Data sharing Figure 5: Performance of standalone applications on BSGD (10t) 20 3 Data sharing Linux with SAM and without (default Linux) SAM. The SVD (4t) 12 1 Data sharing performance metric is the execution speed (the higher the SVD (10t) 24 7 Data sharing better) normalized to that of the best static task CPU → PMF (4t) 20 0 CPU bound mapping determined through offline testing. LSGD (4t) 6 3 Data sharing LSGD (10t) 18 3 Data sharing is shared among its worker threads. We observe that the SVD (8t) 22 6 Data sharing data sharing however is not as high as other workloads SVD (6t) 18 6 Data sharing that we discuss below. Table 1: Actions taken by our scheduler for each stan- The SPEC CPU2006 [1] benchmark suite has a good dalone application run. Coherence indicates the num- blend of memory-intensive and CPU-bound applications. ber of task migrations performed to reduce inter-socket Many of its applications have significant memory utiliza- coherence. Remote memory indicates the number of tion with high computation load as well. The memory- task migrations performed to reduce remote memory ac- intensive applications we use are libq, soplex, mcf, milc, cesses. Workload characteristic classifies each applica- and omnetpp. The CPU-bound applications we use are tion as either CPU bound, data sharing intensive, or mem- sjeng, bzip2, h264ref, hmmer, and gobmk. ory bound. GraphLab [13] and GraphChi [12] are emerging sys- tems that support graph-based parallel applications. Un- like most PARSEC applications, the graph-based appli- default Linux scheduler. The performance is normalized cations tend to have considerable sharing across worker to that of the best static task CPU mapping obtained of- → threads. In addition, the number of worker threads is fline for each application. not always static and is dependent on the phase of ex- We can see that SAM considerably improves the appli- ecution and amount of parallelism available in the ap- cation performance. Our task placement mirrors that of plication. Such phased behaviors are a good stress the best configuration obtained offline, thereby resulting test to ascertain SAM’s stability of control. Our eval- in almost identical performance. Improvement over the uation uses a range of machine learning and filtering default Linux scheduler can reach 72% in the best case applications—TunkRank (Twitter influence ranking), Al- and varies in the range of 30–40% for most applications. ternating Least Squares (ALS) [23], Stochastic gradi- This performance improvement does not require any ap- ent descent (SGD) [11], Singular Value Decomposition plication changes, application user/deployer knowledge (SVD) [10], Restricted Bolzman Machines (RBM) [8], of system topology, or need for recompilation. Probabilistic Matrix Factorization (PMF) [16], Biased Parallel applications that have nontrivial data sharing SGD [10], and Lossy SDG [10]. among its threads benefit from SAM’s remapping strat- 4.2 Standalone Application Evaluation egy. Figure 6(A) shows the average per-thread instruc- tions per unhalted cycle (IPC). SAM results in increased We first evaluate the impact of SAM’s mapping deci- IPC for almost all the applications. Figure 6(B) demon- sions on standalone parallel application executions. Fig- strates that we have almost eliminated all the per-thread ure 5 shows the performance obtained by SAM and the inter-socket coherence traffic, replacing it with intra-

6 534 2015 USENIX Annual Technical Conference USENIX Association SAM Default Linux

−3 (A) Instructions per cycle x 10 (B) Inter−socket coherence 2 3

2.5 1.5 2

1 1.5

1 0.5 Instructions per cycle 0.5 Inter−socket coherence per cycle 0 HubenchLubenchMemBench (8t)MemBench (8t)CannealBodytrack (8t)TunkRank (6t) (4t)TunkRank (4t)RBM (4t)ALS (4t) (10t)ALS SmallALS (4t) SGD(4t) (10t)SGD (4t)BSGD (10t)BSGD (4t)SVD (10t)SVD (4t)PMF (10t)LSGD (4t)LSGD (4t)SVD (10t)SVD (8t) (6t) 0 HubenchLubenchMemBench (8t)MemBench (8t)CannealBodytrack (8t)TunkRank (6t) (4t)TunkRank (4t)RBM (4t)ALS (4t) (10t)ALS SmallALS (4t) SGD(4t) (10t)SGD (4t)BSGD (10t)BSGD (4t)SVD (10t)SVD (4t)PMF (10t)LSGD (4t)LSGD (4t)SVD (10t)SVD (8t) (6t)

−4 x 10 (C) Intra−socket coherence (D) Off−chip traffic per socket 6 0.06

5 0.05

4 0.04

3 0.03

2 0.02 LLC Misses per cycle 1 0.01 Intra−socket coherence per cycle 0 HubenchLubenchMemBench (8t)MemBench (8t)CannealBodytrack (8t)TunkRank (6t) (4t)TunkRank (4t)RBM (4t)ALS (4t) (10t)ALS SmallALS (4t) SGD(4t) (10t)SGD (4t)BSGD (10t)BSGD (4t)SVD (10t)SVD (4t)PMF (10t)LSGD (4t)LSGD (4t)SVD (10t)SVD (8t) (6t) 0 HubenchLubenchMemBench (8t)MemBench (8t)CannealBodytrack (8t)TunkRank (6t) (4t)TunkRank (4t)RBM (4t)ALS (4t) (10t)ALS SmallALS (4t) SGD(4t) (10t)SGD (4t)BSGD (10t)BSGD (4t)SVD (10t)SVD (4t)PMF (10t)LSGD (4t)LSGD (4t)SVD (10t)SVD (8t) (6t)

Figure 6: Measured hardware metrics for standalone applications. Fig. (A): per-thread instructions per unhalted cycle (IPC); Fig. (B): per-thread inter-socket coherence activity; Fig. (C): per-thread intra-socket coherence activity; Fig. (D): per-socket LLC misses per cycle. All values are normalized to unhalted CPU cycles. socket coherence traffic (Figure 6(C)). Figure 6(D) uses Table 1) due to SAM’s affinity control, the relative impact LLC misses per socket per cycle to represent aggregate of this colocation is small. Second, some workloads are per socket off-chip traffic. We can see that off-chip traffic memory intensive but contain low inter-socket coherence for applications with high data sharing is reduced. Ta- activity. Linux uses a static heuristic of distributing tasks ble 1 summarizes the mapping actions (whether due to across sockets, which is already the ideal action for these inter-socket coherence or remote memory access) taken workloads. In such cases, SAM does not have additional for every workload shown in Figure 5. room to improve performance. Interestingly, although Figure 6(B) shows that Table1 does not contain a column for migrations to TunkRank (Twitter influence ranking) has fairly low av- balance memory bandwidth because for standalone ap- erage inter-socket coherence activity, it shows the most plications, SAM does not effect such migrations. First, performance boost with SAM. In this application, there for purely memory intensive workloads (MemBench), are phases of high coherence activity between phases of the default strategy to distribute tasks works ideally and heavy computation, which SAM is able to capture and therefore we do not perform any more migrations. Sec- eliminate. The application also scales well with addi- ond, for workloads that are both memory intensive and tional processors (high inherent parallelism). The better with high data sharing, SAM prioritizes inter-socket co- a data-sharing application scales, the higher the speedup herence activity avoidance over balancing memory band- when employing SAM. With SAM, TunkRank scales al- width utilization. mostlinearly on our multicore platform. Graph-based machine learning applications exhibit While most workloads achieve good speedup, in some phased execution and dynamic task creation and paral- cases SAM’s performance is only on par with (or slightly lelism (mainly in the graph construction and distribution better than) default Linux. Such limited benefits are due stages). The burst of task creation triggers load balancing to two primary reasons. First, applications may be CPU in Linux. While Linux respects core affinity, tasks may bound without much coherence or memory traffic. ALS, be migrated to balance load. Linux task migrations do not RBM, and PMF are examples of such applications. Al- sufficiently consider application characteristics and may though inter-socket coherence activity for these applica- result in increased remote DRAM accesses, inter-socket tions is above threshold, resulting in some migrations (see communication, resource contention, and bandwidth sat-

7 USENIX Association 2015 USENIX Annual Technical Conference 535 uration. SAM is able to mitigate the negative impact of Multiprog. these load balancing decisions, as demonstrated by the workload # Application mixes non-trivial number of migrations due to coherence activ- 1 12 MemBench, 8 HuBench ity and remote memory accesses performed by SAM (Ta- 2 14 MemBench, 6 HuBench ble 1). SAM constantly monitors the system and manip- 3 10 MemBench, 6 HuBench, 4 CPU ulates processor affinity (which the Linux load balancer 4 2 libq, 2 bzip2, 2 sjeng, 2 omnetpp respects) to colocate or distribute tasks across sockets and 5 2 libq, 2 soplex, 2 gobmk, 2 hmmer cores. 6 2 mcf, 2 milc, 2 sjeng, 2 h264ref 7 2 milc, 2 libq, 2 h264ref, 2 sjeng 4.3 Multiprogrammed Workload Evaluation 8 2 mcf, 2 libq, 2 h264ref, 2 sjeng, 4 TunkRank 9 10 SGD, 10 BSGD Table 2 summarizes the various application mixes we 10 10 SGD, 10 LSGD employed to evaluate SAM’s performance on multipro- 11 10 LSGD, 10 BSGD grammed workloads. They are designed to produce dif- 12 10 LSGD, 10 ALS ferent levels of data sharing and memory utilization. Two 13 10 SVD, 10 SGD 14 10 SVD, 10 BSGD factors come into play. First, applications incur perfor- 15 10 SVD, 10 LSGD mance penalties due to resource competition. Second, 16 10 SVD, 10 LSGD bursty and phased activities disturb the system frequently. 17 20 SGD, 20 BSGD In order to compare the performance of different mixes 18 20 SGD, 20 LSGD of applications, we first normalize each application’s run- 19 20 LSGD, 20 BSGD time with its offline optimum standalone runtime, as in 20 20 LSGD, 20 ALS Figure 5. We then take the geometric mean of the nor- 21 20 SVD, 20 SGD malized performance of all the applications in the work- 22 20 SVD, 20 BSGD load mix (counting each application once regardless of 23 20 SVD, 20 LSGD the number of tasks employed) to derive a single met- 24 16 BSGD, 10 MemBench, 14 CPU ric that represents the speedup of the mix of applications. 25 16 LSGD, 10 MemBench, 14 CPU We use this metric to compare the performance of SAM 26 16 SGD, 10 MemBench, 14 CPU against the default Linux scheduler. Table 2: Multiprogrammed application mixes. For each The performance achieved is shown in Figure 7. Nat- mix, the number preceding the application’s name indi- urally, due to competition for resources, most applica- cates the number of tasks it spawns. We generate various tions will run slower than their offline best standalone combinations of applications to evaluate scenarios with execution. Even if applications utilize completely differ- varying data sharing and memory utilization. ence resources, they may still suffer performance degra- dation. For example, when inter-socket coherence miti- performance to the case that all microbenchmarks reach gation conflicts with memory load balancing, no sched- respective optimum offline standalone performance si- uler can realize the best standalone performance for each multaneously. SAM is able to preferentially colocate co-executing application. tasks that share data over tasks that are memory inten- SAM outperforms the default Linux scheduler by 2% sive. Since the memory benchmark saturates memory to 36% as a result of two main strategies. First, we try bandwidth at fairly low task counts, colocation does not to balance resource utilization whenever possible with- impact performance. out affecting coherence traffic. This benefits application The speedups obtained for the SPEC CPU benchmark mixes that have both applications that share data and use mixes relative to Linux are not high. The SPEC CPU mix memory. Second, we have information on inter-socket of applications tend to be either memory or CPU bound. coherence activity and can therefore use it to colocate Since Linux prefers to split the load among sockets, it can tasks that share data. However, we do not have exact in- perform well for memory intensive workloads. We per- formation to indicate which tasks share data with which form significantly better than the Linux scheduler when other tasks. We receive the validation of a successful mi- the workload mix comprises applications that share data gration if it produces less inter-socket and more intra- and use memory bandwidth. socket activity after the migration. Higher intra-socket A caveat is that Linux’s static task CPU mapping pol- → activity helps us identify partial or full task groupings in- icy is very dependent on the order of task creation. For side a socket. example, different runs of the same SPECCPU mix of For the mixes of microbenchmarks (#1–#3), SAM ex- applications discussed above can result in a very differ- ecutes the job about 25% faster than Linux. SAM’s rel- ent sequence of task creation, forcing CPU-bound tasks ative performance approaches 1—indicating comparable to be colocated on a socket and memory-intensive tasks

8 536 2015 USENIX Annual Technical Conference USENIX Association to working sets that exceed the capacity of the private SAM Default Linux caches, resulting in hits in the LLC when migrations are 1 effected to reduce inter-socket coherence. 0.9 Figure 8(D) uses LLC misses per cycle to represent 0.8 aggregate per socket off-chip traffic. Our decision to pri-

0.7 oritize coherence activity can lead to reduced off-chip traffic, but this may be counteracted by an increase in 0.6 the number of remote memory accesses. At the same 0.5 time, our policy may also reduce main memory accesses 0.4 by sharing the last level cache with tasks that actually Speedup/Slowdown 0.3 share data (thereby reducing pressure on LLC cache ca-

0.2 pacity). We also improve memory bandwidth utilization

0.1 when possible without disturbing the colocated tasks that share data. Workload mixes #24–#26 have a combina- 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526 tion of data-sharing, memory-intensive, and CPU-bound Multiprogrammed workload number tasks. In these cases, SAM improves memory bandwidth Figure 7: Speedup (execution time normalized to the utilization by moving the CPU-bound tasks to accommo- offline optimum standalone execution: higher is better) date distribution of the memory-intensive tasks. of multiprogrammed workloads with (SAM) and without In our experiments, both hardware prefetcher and hy- (default Linux) SAM. We show the geometric mean of all perthreading are turned on by default. Hyperthreads add application speedups within each workload as well as the an additional layer of complexity to the mapping process max-min range (whiskers) for the individual applications. due to resource contention for logical computational units as well as the private caches. Since SAM utilizes one to be colocated on the other. We use the best-case Linux hardware context on each physical core before utilizing performance as our comparison base. Since SAM ob- the second, when the number of tasks is less than or equal serves and reacts to the behavior of individual tasks, its to the number of physical cores, SAM’s policy decisions scheduling performance does not depend heavily on the are not affected by hyperthreading. task creation order. In order to determine the interaction of the prefetcher Figure 7 also plots the minimum and maximum with the SAM mapper, we compare the relative perfor- speedups of applications in each workload mix using mance of SAM when turning off prefetching. We observe whiskers. The minimum and maximum speedups are im- that on average, prefetching is detrimental to the per- portant to understand the fairness of our mapping strat- formance of multiprogrammed workloads with or with- egy. The application that has the minimum speedup is out the use of SAM. The negative impact of prefetch- slowed down the most in the workload mix. The ap- ing when using SAM is slightly lower than with default plication that has the maximum speedup is the least af- Linux. fected by the contention in the system. We can see that SAM improves upon the fairness of the Linux scheduler 4.4 Overhead Assessment in a significant fashion. The geometric mean of the mini- mum speedup for all the workload mixes for SAM is 0.83. SAM’s overhead has three contributors: accessing per- The same for the default Linux scheduler is 0.69. Simi- formance counters, making mapping decisions based on larly, the geometric mean of the maximum speedup for the counter values, and migrating tasks to reflect the de- all workload mixes for SAM and Linux are 0.96 and 0.87 cisions. In our prototype, reading the performance coun- respectively. From this analysis, we can conclude that in ters, a cost incurred on every hardware context, takes addition to improving overall performance, SAM reduces takes 8.89 µSecs. Counters are read at a 1-mSec inter- the performance disparity (a measure of fairness) among val. Mapping decisions are centralized and are taken at multiple applications when run together. 100 mSecs intervals. Each call to the mapper, including Figure 8(A) plots the per-thread instructions per cycle the subsequent task migrations, takes about 9.97 µSecs. for the mixed workloads. We can see that largely, the ef- The overall overhead of our implementation is below 1%. fect of our actions is to increase the IPC. As with the stan- We also check the overall system overhead by running all dalone applications, Figure 8(B) shows that SAM signif- our applications with our scheduler but without perform- icantly reduces inter-socket coherence activity, replacing ing any actions on the decisions taken. There was no dis- it with intra-socket coherence activity (Figure 8(C)). Note cernible difference between the two runtimes, meaning that for workloads 14 and 15, SAM shows reductions in that the overhead is within measurement error. both intra- and inter-socket coherence. This is likely due Currently, SAM’s policy decisions are centralized in

9 USENIX Association 2015 USENIX Annual Technical Conference 537 SAM Default Linux

−3 (A) Instructions per cycle x 10 (B) Inter−socket coherence 2.5 1.5

2

1 1.5

1 0.5

Instructions per cycle 0.5 Inter−socket coherence per cycle 0 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526 Multiprogrammed workload number Multiprogrammed workload number −3 x 10 (C) Intra−socket coherence (D) Off−chip traffic per socket 1.4 0.05

1.2 0.04 1

0.8 0.03

0.6 0.02 0.4

LLC Misses per cycle 0.01 0.2 Intra−socket coherence per cycle 0 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526 Multiprogrammed workload number Multiprogrammed workload number Figure 8: Measured hardware metrics for multiprogrammed application workloads. Fig. (A): per-thread instructions per unhalted cycle (IPC); Fig. (B): per-thread inter-socket coherence activity; Fig. (C): per-thread intra-socket coher- ence activity; Fig. (D): per-socket LLC misses per cycle. All values are normalized to unhalted CPU cycles. a single daemon process, which works well for our 40- We found that SAM’s performance for our workloads CPU machine, since most of the overhead is attributed to was relatively resilient to a wide range of values for the periodically reading the performance counters. Data con- threshold. The coherence activity threshold could be var- solidation in the daemon process has a linear complexity ied from 10% of the value determined in Section 2 to 2 on the number of processors and dominates in cost over times this value. Using threshold values below the low the mapping decisions. For all practical purposes, SAM end result in a loss of performance of up to 9% for one scales linearly with the number of processors. In the fu- mixed workload. Using threshold values >2 times the ture, if the total number of cores is very large, mapping determined threshold for the mixed workloads and >3 policies might need to be distributed for scalability. times for the standalone applications results in a loss of up to 18% and 30% performance respectively for a couple 4.5 Sensitivity Analysis of workloads. SAM’s resilience with respect to the memory band- We study the sensitivity of SAM’s performance gains width threshold is also quite good, although the range of to the thresholds identified in Section 2 for determin- acceptable values is tighter than for coherence activity. ing high coherence activity and memory bandwidth con- For the memory bandwidth threshold, too low a value sumption. can lead to prematurely assuming that a socket’s mem- Too low a coherence activity threshold can classify ap- ory bandwidth is saturated, resulting in lost opportuni- plications with little benefit from colocation as ones with ties for migration. Too high a value can result in perfor- high data sharing; it can also result in misclassifying tasks mance degradation due to bandwidth saturation. In the that have just been migrated as ones with high data shar- case of the SPECCPU benchmarks, lowering our deter- ing due to the coherence activity generated as a result of mined threshold by 30% resulted in both sockets being migration. For some workloads, this may reduce the abil- considered as bandwidth saturated, and a performance ity to beneficially map tasks with truly high data sharing. loss of up to 27% due to lost migrations opportunities. Too high a coherence activity threshold can result in not Similarly, for MemBench, raising the memory threshold identifying a task as high data sharing when it could ben- by 25% leads to non-recognition of bandwidth saturation, efit from colocation. which produces up to 85% loss in performance.

10 538 2015 USENIX Annual Technical Conference USENIX Association 5 Related Work tasks. Ebrahimi et al. [6] proposed a new hardware de- sign to track contention at different cache/memory levels The performance impact of contention and interfer- and throttle tasks with unfair resource usage or dispro- ence on shared multicore resources (particularly the LLC portionate progress. At the software level, fair resource cache, off-chip bandwidth, and memory) has been well use can be accomplished through scheduling quantum ad- recognized in previous work. Suh et al. [18] used hard- justment [7] or duty cycle modulation-enabled speed bal- ware counter-assisted marginal gain analysis to mini- ancing [21]. These techniques are complementary and mize the overall cache misses. Software page color- orthogonal to our placement strategies, and can be used ing [5, 22] can effectively partition the cache space with- in conjunction with our proposed approach for improved out special hardware features. Mutlu et al. [15] pro- fairness and quality of service. posed parallelism-aware batch scheduling in DRAM to reduce inter-task interference at the memory level. Mars et al. [14] utilized active resource pressures to predict 6 Conclusions the performance interference between colocated applica- In this paper, we have designed and implemented a per- tions. These techniques, however, manage multicore re- formance monitoring and sharing-aware adaptive map- source contention without addressing the data sharing is- ping system. Our system works in conjunction with sues in parallel applications. Linux’s default scheduler to simultaneously reduce costly Blagodurov et al. [3] examined data placement in communications and improve resource utilization effi- non-uniform memory partitions but did not address data ciency. The performance monitor uses commonly avail- sharing-induced coherence traffic both within and across able hardware counter information to identify and sep- sockets. Tam et al. [19] utilized address sampling (avail- arate data sharing from DRAM memory access. The able on Power processors) to identify task groups with adaptive mapper uses a cost-sensitive approach based on strong data sharing. Address sampling is a relatively ex- the performance monitor’s behavior identification to relo- pensive mechanism to identify data sharing (compared cate tasks in an effort to improve both parallel and mixed to our performance counter-based approach) and is not workload application efficiency in a general-purpose ma- available in many processors in production today. chine. We show that performance counter information al- Calandrino and Anderson [4] proposed cache-aware lows us to develop effective and low-cost mapping tech- scheduling for real-time schedulers. The premise of their niques without the need for heavy-weight access traces. work is that working sets that do not fit in the shared For stand-alone parallel applications, we observe perfor- cache will cause thrashing. The working set size of an mance improvements as high as 72% without requiring application was approximated to the number of misses user awareness of machine configuration or load, or per- incurred at the shared cache. Their scheduling is aided forming compiler profiling. For multiprogrammed work- with job start times and execution time estimates, which loads consisting of a mix of parallel and sequential appli- non-real time systems often do not have. Knauerhase et cations, we achieve up to 36% performance improvement al. [9] analyze the run queue of all processors of a system while reducing performance disparity across applications and schedule them so as to minimize cache interference. in each mix. Our approach is effective even in the pres- Their scheduler is both fair and less cache contentious. ence of workload variability. They do not, however, consider parallel workloads and the effect of data sharing on the last level cache. Acknowledgments This work was supported in part Tang et al. [20] demonstrate the impact of task place- by the U.S. National Science Foundation grants CNS- ment on latency-critical datacenter applications. They 1217372, CCF-1217920, CNS-1239423, CCF-1255729, show that whether applications share data and/or have CNS-1319353, CNS-1319417, and CCF-137224, and by high memory demand can dramatically affect perfor- the Semiconductor Research Corporation Contract No. mance. They suggest the use of information on the num- 2013-HJ-2405. ber of accesses to “shared” cache lines to identify intra- application data sharing. While this metric is a use- ful indicator of interference between tasks, it measures References only one aspect of data sharing—the cache footprint, but [1] SPECCPU2006 benchmark. www.spec.org. misses another important aspect of data sharing—off- chip traffic in the case of active read-write sharing. Ad- [2] C. Bienia. Benchmarking Modern Multiprocessors. ditionally, their approach requires input statistics from an PhD thesis, Princeton University, January 2011. offline stand-alone execution of each application. Previous work has also pursued fair uses of shared [3] S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fe- multicore resources between simultaneously executing dorova. A case for NUMA-aware contention man-

11 USENIX Association 2015 USENIX Annual Technical Conference 539 agement on multicore systems. In USENIX Annual [14] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Technical Conf., Portland, OR, June 2011. Soffa. Bubble-up: Increasing utilization in mod- ern warehouse scale computers via sensible co- [4] J. Calandrino and J. H. Anderson. On the design and locations. In 44th Int’l Symp. on Microarchitecture implementation of a cache-aware multicore real- (MICRO), Porto Alegre, Brazil, Dec. 2011. time scheduler. In 21st EUROMICRO Conf. on Real-Time Systems (ECRTS), Dublin, Ireland, July [15] O. Mutlu and T. Moscibroda. Parallelism-aware 2009. batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In 35th Int’l [5] S. Cho and L. Jin. Managing distributed, shared L2 Symp. on Computer Architecture (ISCA), pages 63– caches through OS-level page allocation. In 39th 74, Beijing, China, June 2008. Int’l Symp. on Microarchitecture (MICRO), pages 455–468, Orlando, FL, Dec. 2006. [16] R. Salakhutdinov and A. Mnih. Bayesian prob- abilistic matrix factorization using Markov Chain [6] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. Patt. Fair- Monte Carlo. In 25th Int’l Conf. on Machine ness via source throttling: A configurable and high- Learning (ICML), pages 880–887, Helsinki, Fin- performance fairness substrate for multi-core mem- land, 2008. ory systems. In 15th Int’l Conf. on Architectural Support for Programming Languages and Operat- [17] K. Shen. Request behavior variations. In 15th ing Systems (ASPLOS), pages 335–346, Pittsburgh, Int’l Conf. on Architectural Support for Program- PA, Mar. 2010. ming Languages and Operating Systems (ASPLOS), pages 103–116, Pittsburg, PA, Mar. 2010. [7] A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via [18] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic an operating system scheduler. In 16th Int’l Conf. on partitioning of shared cache memory. The Journal Parallel Architecture and Compilation Techniques of Supercomputing, 28(1):7–26, Apr. 2004. (PACT), pages 25–38, Brasov, Romania, Sept. 2007. [19] D. Tam, R. Azimi, and M. Stumm. Thread cluster- [8] G. E. Hinton. A practical guide to training restricted ing: Sharing-aware scheduling on SMP-CMP-SMT boltzmann machines. In Neural Networks: Tricks of multiprocessors. In Second EuroSys Conf., pages the Trade - Second Edition, pages 599–619. 2012. 47–58, Lisbon, Portugal, Mar. 2007. [9] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and [20] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and S. Hahn. Using OS observations to improve perfor- M. L. Soffa. The impact of memory subsystem re- mance in multicore systems. IEEE Micro, 28(3):54– source sharing on datacenter applications. In 38th 66, May 2008. Int’l Symp. on Computer Architecture (ISCA), pages 283–294, San Jose, CA, June 2011. [10] Y. Koren. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In 14th [21] X. Zhang, S. Dwarkadas, and K. Shen. Hardware ACM SIGKDD Int’l Conf. on Knowledge Discovery execution throttling for multi-core resource man- and Data Mining (SIGKDD), pages 426–434, Las agement. In USENIX Annual Technical Conf., San Vegas, NV, 2008. Deigo, CA, June 2009. [11] Y. Koren, R. Bell, and C. Volinsky. Matrix factor- [22] X. Zhang, S. Dwarkadas, and K. Shen. To- ization techniques for recommender systems. IEEE wards practical page coloring-based multi-core Computer, 42(8):30–37, Aug. 2009. cache management. In 4th EuroSys Conf., pages [12] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: 89–102, Nuremberg, Germany, Apr. 2009. Large-scale graph computation on just a PC. In 10th [23] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. USENIX Symp. on Operating Systems Design and Large-scale parallel collaborative filtering for the Implementation (OSDI), pages 31–46, Hollywood, Netflix prize. In Proc. 4th Intl Conf. Algorithmic As- CA, 2012. pects in Information and Management, LNCS 5034, [13] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, pages 337–348. Springer, 2008. C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, July 2010.

12 540 2015 USENIX Annual Technical Conference USENIX Association