Data Sharing Or Resource Contention: Toward Performance Transparency on Multicore Systems

Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems Sharanyan Srikanthan Sandhya Dwarkadas Kai Shen Department of Computer Science University of Rochester 1 The Performance Transparency Challenge Modern multicore systems… IBM Power8’s 12-core… AMD’s 16-core, … Qualcomm Snapdragon… Intel’s 10-core (20-thread), … • Resource sharing – caches, on- and off-chip interconnects, memory • Non-uniform access latencies 2 Modern Multicore Systems: Resource Sharing Problem: Resource contention Socket 1 Socket 2 CPU CPU CPU CPU CPU CPU CPU CPU L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 L2 L2 L2 L2 L2 L2 L2 Cache Cache …….. Cache Cache Cache Cache …….. Cache Cache Shared Last Level Cache Shared Last Level Cache Inter-processor interconnect Memory Memory Controller Controller 3 Modern Multicore Systems: Non-Uniform Access Latencies Problem: Communication costs a function of thread/data placement Socket 1 Socket 2 CPU CPU CPU CPU CPU CPU CPU CPU 4 Cycles L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache 10 65 – 75 L2 L2 L2 L2 L2 L2 Cycles Cycles …….. Cache Cache Cache Cache …….. Cache Cache 40 Cycles 100 – 300 Cycles Inter-processor interconnect 130 Cycles 220 Cycles Ref: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf 4 Impact of Thread Placement on Data Sharing Costs Socket 1 Socket 2 CPU CPU CPU CPU CPU CPU CPU CPU 4 Cycles L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 L2 L2 L2 L2 L2 L2 L2 Cache Cache …….. Cache Cache Cache Cache …….. Cache Cache LLC LLC Inter-processor interconnect DRAM DRAM 5 Impact of Thread Placement on Data Sharing Costs Socket 1 Socket 2 CPU CPU CPU CPU CPU CPU CPU CPU 4 Cycles L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 L2 L2 L2 L2 L2 L2 L2 Cache Cache …….. Cache Cache Cache Cache …….. Cache Cache LLC LLC Inter-processor interconnect DRAM DRAM 6 Solutions? • Tam et al. [Eurosys 2007]: address sampling to identify data sharing, costly and specific to some processors • Tang et al. [ISCA 2011]: performance counters to identify resource contention, does not consider non- uniform communication • Lepers et al. [USENIX ATC, 2015]: non-uniform memory bandwidth utilization, does not consider data sharing Our approach: Sharing Aware Mapper (SAM): Separates data sharing from resource contention using performance counters and considers non-uniformity when mapping parallel and multiprogrammed workloads 7 Sharing-Aware Mapper (SAM) • Monitoring • Identifying bottlenecks • Taking remedial action 8 Sharing-Aware Mapper (SAM) • Monitoring: periodically reading hardware performance counters to characterize application behavior • Identifying bottlenecks • Taking remedial action 9 Monitoring Using Performance Counters • 4 metrics identified: 8 counter events need to be monitored – Inter-socket coherence activity • Last level cache misses served by remote cache – Intra-socket coherence activity • Last private level cache misses – (sum of hits and misses in LLC) – Local Memory Accesses • Approximated by LLC misses – Remote Memory Accesses • LLC misses served by remote DRAM • 4 hardware programmable counters available: requires multiplexing 10 Sharing-Aware Mapper (SAM) • Monitoring: periodically reading hardware performance counters to characterize application behavior • Identifying bottlenecks: using pre-characterized thresholds • Taking remedial action 11 Identifying Bottlenecks: Coherence Activity Threshold Comparing performance of two tasks • Microbenchmark forces running on same socket against running data to move from one on different sockets 2 task to the other, generating coherence 1.5 activity 1 • Rate of coherence activity varied by varying ratio of Speedup 0.5 private to shared variable 0 0.00 0.50 1.00 1.50 2.00 2.50 3.00 access Cross-socket coherence activity per second Threshold for flagging bottleneck 12 Identifying Bottlenecks: Memory Bandwidth Threshold Memory Bandwidth Utilization • Variable number of 120000000 memory intensive threads 100000000 run on the same processor 80000000 • ~10% difference in 60000000 40000000 available bandwidth at Second 20000000 saturation 0 • Conservative estimate can Memory Accessesper Memory 1 2 3 4 5 6 7 8 9 10 Number of threads sufficiently identify Memory accesses for high LBHR bottlenecks Memory accesses for low RBHR Threshold for flagging memory bandwidth bottleneck 13 Sharing-Aware Mapper (SAM) • Monitoring: periodically reading hardware performance counters to characterize application behavior • Identifying bottlenecks: using pre-characterized thresholds • Taking remedial action: periodically remapping to manipulate process-processor affinity 14 SAM Decision Making • Prioritize reduction of – Inter-socket coherence > – Remote memory access > – Socket bandwidth contention • Co-locate threads with inter-socket coherence activity if – Idle cores are available > – CPU bound tasks can be moved > – Memory bound tasks can be moved – But not at the expense of separating threads with intra- coherence activity • Return threads with remote memory accesses to original socket if possible • Balance memory accesses across sockets – Transfer excess memory intensive threads to other sockets • Idle processors > CPU bound 15 SAM Decision Making • Prioritize reduction of – Inter-socket coherence > – Remote memory access > – Socket bandwidth contention • Co-locate threads with inter-socket coherence activity if – Idle cores are available > – CPU bound tasks can be moved > – Memory bound tasks can be moved – But not at the expense of separating threads with intra- coherence activity • Return threads with remote memory accesses to original socket if possible • Balance memory accesses across sockets – Transfer excess memory intensive threads to other sockets • Idle processors > CPU bound 16 Example: Data Sharing and Resource Contention • There are 16 ways to Socket 1 Socket 2 map tasks to cores T1 T2 Idle Idle T3 T4 Idle Idle • Linux’s default load LLC LLC balancing can result in Inter-socket DRAM DRAM communication Imbalanced Tasks T1, T3 share data. Tasks T2 and memory bandwidth T4 use memory. utilization Remote memory accesses 17 Example: Data Sharing and Resource Contention Step 1: Reduce inter-socket coherence Socket 1 Socket 2 • Co-locate T1 and T3 to T1 T2 T3 Idle Idle T4 Idle Idle reduce cross-socket LLC LLC coherence DRAM DRAM Tasks T1, T3 share data. Tasks T2 and T4 use memory. 18 Example: Data Sharing and Resource Contention Step 2: Reduce remote memory accesses Socket 1 Socket 2 • Swap T2 and T4 to T1 T4 T3 Idle Idle T2 Idle Idle eliminate remote LLC LLC memory accesses DRAM DRAM Tasks T1, T3 share data. Tasks T2 and T4 use memory. 19 Example: Data Sharing and Resource Contention Step 3: Balance memory bandwidth utilization Socket 1 Socket 2 T1 T4 T3 Idle Idle T2 Idle Idle LLC LLC • Memory bandwidth is already balanced so no DRAM DRAM additional action is performed Tasks T1, T3 share data. Tasks T2 and T4 use memory. 20 Example: Data Sharing and Resource Contention Socket 1 Socket 2 Optimal Solution T1 T2 Idle Idle T3 T4 Idle Idle • SAM identifies task LLC LLC characteristics with performance counters DRAM DRAM SAM • Minimizes inter- processor Socket 1 Socket 2 communication, T1 T2 T3 Idle Idle T4 Idle Idle LLC LLC remote memory accesses while DRAM DRAM maximizing memory bandwidth utilization 21 Experimental Environment • Fedora 19, Linux 3.14.8 • Dual socket machine – Intel Xeon E5-2660 v2 “IvyBridge” processor (10 physical cores, 2 contexts each, 2.20 GHz, 25MB of L3 cache) • Benchmarks – Microbenchmarks (Used for thresholding) – SPECCPU ’06 (CPU & Memory bound workloads) – PARSEC 3.0 (Parallel workloads – light on data sharing) – ALS, Stochastic Gradient Descent, Single Value Decomposition and other parallel workloads (Parallel workloads – stronger data sharing, dynamic behavior, and emerging workloads) 22 Implementation Context Implemented as a daemon running periodically using CPU affinity masks to control placement Map threads to cores (Sharing Aware Mapper – SAM) Select which applications run together (Linux Scheduler) A B C D ……… X ….. (Tasks to be scheduled) 23 Standalone Applications Baseline (Normalization factor): Best static mapping determined by exhaustive search Speedup relative to Linux: Mean = 1.21, Min = 0.99, Max = 1.72 24 Multiple Applications Speedup relative to Linux: Mean = 1.16, Min = 1.02, Max = 1.36 SAM improves fairness: Average speedup spread per workload: SAM – 0.12, Linux – 0.18 25 Detailed Analysis 26 Overheads and Scaling • SAM’s overhead <1% (40 cores) – Performance counter reading • Invoked every tick – 8.9 µs per tick = 8.9 msec per second • Constant time overhead – Data consolidation and bottleneck identification • Invoked every 100ms – 9.9 µs per call = 99 µs per second • Scales linearly with number processing cores – Decision making • Invoked every 100ms – negligible time related to data consolidation • O(n2) complexity, but for n ≤ 40, time spent is within measurement error • SAM’s data consolidation and decision making is centralized 27 Conclusions • Performance counters provide sufficient information to identify and separate data sharing from resource contention • SAM improves performance and reduces performance disparities across applications by: – Minimizing cross-socket coherence – Minimizing remote memory accesses – Maximizing memory bandwidth utilization • Future work to improve data sharing aware scheduling – Benefits of hyper-threading – Distributed decision making for better scalability 28 Thank you Questions? Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems Sharanyan Srikanthan Sandhya Dwarkadas Kai Shen Department of Computer Science University of Rochester 29.

Data Sharing Or Resource Contention: Toward Performance Transparency on Multicore Systems

NUMA-Aware Thread Migration for High Performance NVMM File Systems

Resource Access Control in Real-Time Systems

Memory and Cache Contention Denial-Of-Service Attack in Mobile Edge Devices

A Case for NUMA-Aware Contention Management on Multicore Systems

Thread Evolution Kit for Optimizing Thread Operations on CE/Iot Devices

Computer Architecture Lecture 12: Memory Interference and Quality of Service

Optimizing Kubernetes Performance by Handling Resource Contention with Custom Scheduler

Addressing Shared Resource Contention in Multicore Processors Via Scheduling

Research Collection

Redhawk Linux User's Guide

Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime

KIT PPT Master