Speedup Stacks: Identifying Scaling Bottlenecks in Multi-Threaded Applications

Speedup Stacks: Identifying Scaling Bottlenecks in Multi-Threaded Applications Stijn Eyerman Kristof Du Bois Lieven Eeckhout ELIS Department, Ghent University, Belgium Abstract 16 14 blackscholes Multi-threaded workloads typically show sublinear 12 facesim cholesky speedup on multi-core hardware, i.e., the achieved speedup 10 is not proportional to the number of cores and threads. Sub- linear scaling may have multiple causes, such as poorly 8 scalable synchronization leading to spinning and/or yield- speedup 6 ing, and interference in shared resources such as the last- 4 level cache (LLC) as well as the main memory subsystem. 2 It is vital for programmers and processor designers to un- 0 derstand scaling bottlenecks in existing and emerging work- 1 thread 2 threads 4 threads 8 threads 16 threads loads in order to optimize application performance and design future hardware. Figure 1. Speedup as a function of the num- In this paper, we propose the speedup stack, which quan- ber of cores for blackscholes, facesim (both tifies the impact of the various scaling delimiters on multi- PARSEC) and cholesky (SPLASH-2). threaded application speedup in a single stack. We describe a mechanism for computing speedup stacks on a multi-core processor, and we find speedup stacks to be accurate within 5.1% on average for sixteen-threaded applications. We allel programming has become inevitable for mainstream present several use cases: we discuss how speedup stacks computing. One of the key needs to efficient programming can be used to identify scaling bottlenecks, classify bench- is to have the appropriate tools to analyze parallel perfor- marks, optimize performance, and understand LLC performance. In particular, a software developer needs analysis mance. tools to identify the performance scaling bottlenecks, not only on current hardware but also on future hardware with many more cores than are available today; likewise, com- 1 Introduction puter architects need analysis tools to understand the be- havioral characteristics of existing and future workloads to Power efficiency and diminishing returns in improving design and optimize future hardware. single-core performance have driven the computer indus- Speedup curves which report speedup as a function of try towards multi-core processors. Current general-purpose the number cores, as exemplified in Figure 1, are often used processors employ a limited number of cores in the typical to understand scaling behavior of an application. Although range of 4 to 8 cores, see for example Intel Nehalem, Intel a speedup curve gives a high-level view on application scal- Westmere, IBM POWER7, AMD Bulldozer, etc. It is to be ing behavior, it does not provide any insight with respect to expected that the number of cores will increase in the com- why an application does or does not scale. There are many ing years given the continuous transistor density improve- possible causes for poor scaling behavior, such as synchro- ments predicted by Moore’s law, as exemplified by Intel’s nization, as well as interference in both shared on-chip re- Many Integrated Core architecture with more than 50 cores sources (e.g., last-level cache) and off-chip resources (e.g., on a chip. main memory). Unfortunately, a speedup curve provides no A major challenge with increasing core counts is the clue whatsoever why an application exhibits poor scaling ability to program multi-core and many-core systems. Al- behavior. though parallel programming has been a challenge for many In this paper, we propose the speedup stack which is a years in the scientific computing community, given the re- novel representation that provides insight into an applica- cent advent of multi-core and many-core processors, par- tion’s scaling behavior on multi-core hardware. The height max theoretical speedup hardware. The concept of a speedup stack is broadly N speedup imbalance applicable to multiprocessor, chip-multiprocessor and spinning various forms of multi-threading architectures. yielding • We extend a previously proposed per-thread cycle ac- cache coherency counting architecture [7] for computing speedup stacks parallelization overhead on chip-multiprocessors. The extensions include sup- net negative interference negative interference actual speedup port for quantifying the impact of positive interference positive interference in shared caches on multi-threaded application perfor- base speedup mance along with support for quantifying the impact 0 of spinning and yielding. The overall accounting architecture can be implemented in hardware. Figure 2. Illustrative speedup stack. 2 Speedup stack of the speedup stack is defined as N with N the number For explaining the key concept of a speedup stack, we of cores or threads. The different components in a speedup refer to Figure 3. To simplify the discussion we focus on stack define the actual speedup plus a number of perfor- the parallelizable part of a program. Amdahl’s law already mance delimiters: last-level cache (LLC) and memory in- explains the impact of the sequential part on parallel perfor- terference components represent both positive and negative mance, hence, we do not consider it further in the remainder interference in the LLC and main memory; the spinning of the paper. If of interest, including the sequential part in component denotes time spent spinning on lock and bar- the speedup stack can easily be done. rier variables; the yield component denotes performance de- We define Ts as the execution time of (the parallelizable ficiency due to yielding on barriers and highly contended part of) a program under single-threaded execution. The ex- lock variables; additional components are due to cache co- ecution time of the same program during multi-threaded ex- herency, work imbalance and parallelization overhead. Fig- ecution will (most likely) be shorter, say Tp. We now break ure 2 shows an example speedup stack. The intuition is that up the execution time of a thread during multi-threaded ex- the scaling delimiters, such as negative LLC and memory ecution in various cycle components; note that the total ex- interference, spinning and yielding, and their relative con- ecution time is identical for all threads under this break- tributions, are immediately clear from the speedup stack. up. The idealized multi-threaded execution time, assuming Optimizing the largest scaling delimiters is likely to yield perfect parallelization, equals Ts/N with N the number of the largest speedup, hence, a speedup stack is an intuitive threads or cores. Note we use the terms ‘thread’ and ‘core’ and useful tool for both software and hardware optimiza- interchangeably as we assume chip-multiprocessors in this tions. paper, however, the concept of a speedup stack can also be Next to introducing the concept of the speedup stack, this applied to shared-memory multiprocessors (SMP) as well paper also describes a method for computing speedup stacks as simultaneous multi-threading (SMT) and other forms of from a single run, either in simulation software or real hard- multi-threading. ware; the cost for a hardware implementation is limited to Obviously, the idealized multi-threaded execution time 1.1KB per core or 18KB for a 16-core CMP. Our exper- Ts/N is not achieved in practice, hence multi-threaded ex- imental results demonstrate the accuracy of the approach: ecution time is typically larger, for a number of reasons. we achieve an average absolute error of 5.1% on 16-core Parallelizing an application incurs overhead in the form of processors across a set of SPLASH-2, PARSEC and Ro- additional instructions being executed to communicate data dinia benchmarks. Finally, we describe several applications between threads, recompute data, etc. This is referred to as for speedup stacks apart from the obvious application of an- parallelization overhead in the speedup stack. Other over- alyzing performance scaling bottlenecks. We use speedup head factors include spinning (active spinning on lock and stacks to classify benchmarks based on their scaling bottle- barrier variables), yielding (the operating system scheduling necks, we identify optimization opportunities, and we ana- out threads that are waiting on barriers or highly contended lyze LLC performance. locks), and imbalance (threads waiting for other threads to In summary, we make the following two major contribu- finish their execution). Finally, there are interference effects tions in this paper: in the memory hierarchy, both positive and negative in both the LLC and memory subsystem, as well as performance • We introduce the speedup stack which is a novel repre- penalties due to cache coherency. Positive interference ob- sentation that quantifies the impact of various scaling viously offsets negative interference. In a rare case, posi- bottlenecks in multi-threaded workloads on multi-core tive interference could lead to superlinear speedups in case (a) Single-threaded execution tion overhead) for thread i, and Pi positive interference for single-threaded execution time Ts thread i. Given the estimated single-threaded execution time, we can now estimate the achieved speedup Sˆ: (b) Multi-threaded execution N N P P per-thread execution time Tp Tˆ P Tˆ i Tp − j Oi,j + Pi Sˆ = s = i i = . (3) Tp Tp Tp per-thread execution time breakup: We now reformulate the above formula to: idealized execution time T /N positive interference N P PN s X j Oi,j Pi Sˆ = N − + i . (4) T T i p p parallelization overhead yielding negative interference This formula immediately leads to the speedup stack by spinning imbalance showing the different terms in the above formula in a cache coherency stacked bar. The height of the bar equals the maximum achievable speedup, namely N. The various terms de- Figure 3. Breaking up per-thread performance note the aggregate overhead components across all threads, for computing speedup stacks. along with the aggregate positive interference component. In summary, the speedup stack consists of the base speedup plus a number of components, see also Figure 2.

Speedup Stacks: Identifying Scaling Bottlenecks in Multi-Threaded Applications

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

Computer Systems Architecture

Computer Architecture: Parallel Processing Basics

Your Speedup Is a Megaflop!

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors

SIMD the Good, the Bad and the Ugly