Identifying Performance Bottlenecks on Modern Microarchitectures Using an Adaptable Probe
Total Page:16
File Type:pdf, Size:1020Kb
Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Identifying performance bottlenecks on modern microarchitectures using an adaptable probe Permalink https://escholarship.org/uc/item/9203n4p5 Authors Griem, Gorden Oliker, Leonid Shalf, John et al. Publication Date 2004-01-20 eScholarship.org Powered by the California Digital Library University of California Identifying Performance Bottlenecks on Modern Microarchitectures using an Adaptable Probe Gorden Griem*, Leonid Oliker*, John Shalf*, and Katherine Yelick*+ *Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, 94720 +Computer Science Division, University of California, 387 Soda Hall #1776, Berkeley, CA 94720 {ggriem, loliker, jshalf, kayelick}@lbl.gov Abstract Benchmarks, often present a narrow view of a broad, multi- The gap between peak and delivered performance for scien- dimensional parameter space of machine characteristics. We tific applications running on microprocessor-based systems therefore differentiate a “probe” from a “microbenchmark” or has grown considerably in recent years. The inability to synthetic benchmark on the basis that the latter typically offers achieve the desired performance even on a single processor is a single-valued result in order to rank processor performance often attributed to an inadequate memory system, but without consecutively – a few points of reference in a multidimen- identification or quantification of a specific bottleneck. In this sional space. A probe, by contrast, is used to explore a con- work, we use an adaptable synthetic benchmark to isolate ap- tinuous, multidimensional parameter space. The probe’s plication characteristics that cause a significant drop in per- parameterization helps the researcher uncover the peaks and formance, giving application programmers and architects valleys in a continuum of performance characteristics and ex- information about possible optimizations. Our adaptable plore the ambiguities of computer architectural comparisons probe, called sqmat, uses only four parameters to capture key that cannot be captured by a single-valued ranking. characteristics of scientific workloads: working-set size, com- In this paper, we introduce the sqmat probe [9], which at- putational intensity, indirection, and irregularity. This paper tempts to bridge the gap between these competing require- describes the implementation of sqmat and uses its tunable ments. It maintains the simplicity of a microbenchmark while parameters to evaluate four leading 64-bit microprocessors offering four distinct parameters to capture different types of that are popular building blocks for current high performance application workloads: working-set size (parameter “N”), systems: Intel Itanium2, AMD Opteron, IBM Power3, and computational intensity (parameter “M”), indirection (parame- IBM Power4. ter “I”), and irregularity (parameter “S”). By varying the parameters of sqmat, one can capture the 1. INTRODUCTION memory system behavior of a more diverse set of algorithms as shown in Table 1. With a high computational intensity (M), There is a growing gap between the peak speed of micro- the benchmark matches the characteristics of dense linear processor-based systems and the delivered performance for solvers that can be tiled into matrix-matrix operations (the so- scientific computing applications. This gap has raised the im- called “BLAS-3” operations). For example PARATEC [10] is portance of developing better benchmarking methods to im- a material science application that performs ab-initio quantum- prove performance understanding and prediction, while identi- mechanical total energy calculations using pseudopotentials fying hardware and application features that work well or and a plane wave basis set. This code relies on BLAS-3 librar- poorly together. Benchmarks are typically designed with two ies with direct memory addressing, thus having a high compu- competing interests in mind – capturing real application work- tational intensity, little indirection, and low irregularity. How- loads and identifying specific architectural bottlenecks that ever, not all dense linear algebra problems can be tiled in this inhibit the performance. Benchmarking suites like the NAS manner; instead they are organized as dense matrix-vector Parallel Benchmarks [7] and the Standard Performance (BLAS-2) or vector-vector (BLAS-1) operations, which re- Evaluation Corporation (SPEC) [1] emphasize the first goal of quire fewer operations on each element. This behavior is cap- representing real applications, but they are typically too large tured in sqmat by reducing the computational intensity, possi- to run on simulated architectures and are too complex to em- bly in combination with a reduced working set size (N). ploy for identification of specific architectural bottlenecks. Indirection, sometimes called scatter/gather style memory Likewise, the complexity of these benchmarks can often end access, occurs in sparse linear algebra, particle methods, and up measuring the quality of the compiler’s optimizations as grid-based applications with irregular domain boundaries. much as it does the underlying hardware architecture. At the Most are characterized by noncontiguous memory access; thus other extreme, microbenchmarks such as STREAM [6] are placing additional stress on memory systems that rely on large used to measure the performance of a specific feature of a cache lines to mask memory latency. The amount of irregular- given computer architecture. Such synthetic benchmarks are ity varies enormously in practice. For example, sparse matri- often easier to optimize so as to minimize the dependence on ces that arise in Finite Element applications often contain the maturity of the compiler technology. However, the sim- small dense sub-blocks, which cause a string of consecutive plicity and narrow focus of these codes often makes it quite indexes in an indirect access of a sparse matrix-vector product difficult to relate to real application codes. Indeed, it is rare (SPMV). Table 1 shows an example of SPMV where one-third that such probes offer any predictive value for the perform- of data are irregularly accessed; in general the irregularity (S) ance of full-fledged scientific application. would depend on the sparse matrix structure. Another example Floating-point M N CI S % 1 2 3 4 … 73 74 75 76 orig:sqmat irreg memory array DAXPY 1 1 0.5:0.5 - 0% Pointer memory 1 2 3 4 5 6 7 8 DGEMM 1 4 3.5:3.5 - 0% array MADCAP[12] 2 4 7.5:7.0 - 0% SPMV 1 4 3.5:3.5 3 33% Figure 1: Example of Sqmat indirection for S=4 Table 1: Mapping Sqmat parameters onto algorithms the results are written back to memory. We use a Java pro- gram to generate optimally hand-unrolled C-code [9], greatly of algorithmic irregularity can be found in GTC - a magnetic reducing the influence of the C-compiler’s code generation fusion code that solves the gyro-averaged Vlasov-Poisson and thereby making sure that the hardware architecture rather system of equations using the particle-in-cell (PIC) approach than the compiler is benchmarked. The innermost loop is un- [11]. PIC is a sparse method for calculating particle motion rolled enough to ensure that all available floating-point regis- and is characterized by relatively low computational intensity, ters are occupied by operands during each cycle of the loop. array indirection, and high irregularity. In both cases, the The unrolling is not so extreme as to cause processor stalls stream of memory accesses may contain sequences of con- either due to the increased number of instructions or additional tiguous memory accesses broken up by random-access jumps. branch-prediction penalties If enough registers are available In this paper, we describe the implementation of the sqmat . probe and focus on how its four parameters enable us to on the target machine, several matrices are squared at the same evaluate the behavior of four microprocessors that are popular time. Since squaring the matrix cannot be done in situ, an in- put and output matrix is needed for computation, thus a total building blocks for current high performance machines. The 2 processors are compared on a basis of the delivered percentage of 2⋅N registers are required per matrix. of peak performance rather than absolute performance so as to For direct access, each double-precision floating-point value limit the bias inherent in comparisons between different gen- has to be loaded and stored, creating 8 bytes memory-load and erations of microprocessor implementations. We evaluate 8 bytes memory-store traffic. For indirect access, the value these processors and isolate architectural features responsible and the pointer have to be loaded. As we always use 64-bit for performance bottlenecks, giving application developers pointers in 64-bit mode, each entry creates 16 bytes of mem- valuable hints on where to optimize their codes. Future work ory-load and 8 bytes of memory-store traffic. will focus on correlating sqmat parameters across a spectrum To allow a comparison between the different architectures, of scientific applications. we introduce the algorithmic peak performance (AP) metric. The AP is defined as the performance that could be achieved on the underlying hardware for a given algorithm if all the 2. SQMAT OVERVIEW floating-point units were optimally used. The AP is always The Sqmat benchmark is based on matrix multiplication and equal to or less than the machine peak performance (MP). For is therefore related to the Linpack benchmark