Applying the Roofline Model

Applying the Roofline Model Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato, Markus Puschel¨ Department of Computer Science ETH Zurich, Switzerland fofenbeck, caparrov, danieles, [email protected] Abstract—The recently introduced roofline model plots the Such a plot reveals new information: the memory bandwidth performance of executed code against its operational intensity becomes an additional upper bound for computations with low (operations count divided by memory traffic). It also includes intensity, the plot clearly distinguishes memory and compute two platform-specific performance ceilings: the processor’s peak performance and a ceiling derived from the memory bandwidth, bound computations (terms that are often used casually), and which is relevant for code with low operational intensity. The it shows the behavior of intensity across sizes, just to name a model thus makes more precise the notions of memory- and few. The downside is that the necessary data is considerably compute-bound and, despite its simplicity, can provide an insight- harder to obtain in a reliable and meaningful way than in ful visualization of bottlenecks. As such it can be valuable to guide an ordinary performance plot. Indeed, the original paper [37] manual code optimization as well as in education. Unfortunately, to date the model has been used almost exclusively with back- shows only four programs with a fixed input size, and the many of-the-envelope calculations and not with measured data. In this papers that have used the model to date have done so through paper we show how to produce roofline plots with measured back-of-the-envelope calculations rather than measurements. data on recent generations of Intel platforms. We show how to Contribution. The first contribution of this paper is a accurately measure the necessary quantities for a given program strategy on how to produce roofline plots with measured data using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. We explain the on recent generations of Intel platforms. We show that by a measurement approach, its validation, and discuss limitations. suitable measurement strategy and the use of the right set of Finally, we show, to this extent for the first time, a set of roofline performance counters, meaningful and reliable results can be plots with measured data for common numerical functions on a achieved. This includes roofline plots for single- and multi- variety of platforms and discuss their possible uses. threaded code and for cold and warm data cache scenarios. The second contribution is a detailed performance analysis I. INTRODUCTION of common numerical kernels such as BLAS functions and Software performance is determined by the complex inter- the fast Fourier transform (FFT) with the roofline model. We action of source code, the compiler used, and the architecture show for the first time roofline plots (which are quite different and microarchitecture on which the code is run. Because of from the usual performance plots) for these in various scenar- this, performance is hard to estimate, understand, and opti- ios including single/multithreaded and cold/warm data cache. mize. This is particularly true for numerical or mathematical The plots clearly reveal the different operational intensities functions that often are the bottleneck in applications from between kernels, the effect of common optimizations, but also scientific computing, machine learning, multimedia process- some surprising behavior not visible in ordinary performance ing, and other domains. Common practice in performance plots. We show and discuss to which extent a roofline plot can optimization is to try various optimization techniques together be used to study optimizations. with repeated runtime measurements until a desired perfor- Organization. We first provide background on operational mance is achieved. The result is then reported in a graph intensity and the roofline model. Then we introduce our that plots the achieved performance (say, in floating point measurement methodology and explain how we validated our operations per second or flop/s) against a range of input sizes. approach. Finally, we show a set of experiments on various This plot reveals the efficiency achieved (percentage of peak platforms. After a discussion of related work we conclude. performance) and shows the trend as the size increases, but nothing more. On the other hand, tools like Intel VTune [3] II. BACKGROUND provide large amounts of data for a given code extracted from performance counters [13], [20], such as various types of cache In this section we provide the necessary background on the and TLB misses and others. Extracting meaning from this roofline model. First, we explain the concept of operational data is daunting for most. As aid in the understanding and intensity and then discuss the actual roofline model (see [37] optimization of performance it hence seems important to have for the original paper). We also briefly discuss performance more ways of visualizing performance that have the simplicity counters, which are necessary to turn the model into roofline of a performance plot but that provide additional insights. The plots. Throughout the paper, properties of executed programs recently introduced roofline model [37] tries to provide exactly (runtime, performance, etc.) are written using upper case that. It plots performance against operational intensity, which letters, whereas platform properties (peak performance, cache is the quotient of flop count and the memory traffic in bytes. size, etc.) are written using lower case Greek letters. 1 BLAS function W (n) Qr(n) Qw(n) Q(n) = Qr+w(n) Ir(n) I(n) = Ir+w(n) 1 1 daxpy y αx + y = 2n ≥ 16n ≥ 8n ≥ 24n ≤ 8 ≤ 12 2 2 2 n+1 1 n+1 1 dgemv y αAx + βy = 2n + 2n ≥ 8n + 16n ≥ 8n ≥ 8n + 24n ≤ 4n+8 ≈ 4 ≤ 4n+12 ≈ 4 3 2 2 2 2 n+1 n+1 dgemm C αAB + βC = 2n + 2n ≥ 24n ≥ 8n ≥ 32n ≤ 12 ≤ 16 TABLE I OPERATIONAL INTENSITY ANALYSIS FOR SOME BLAS1–3 ROUTINES. performance P A. Operational Intensity [flops/cycle] Work. As commonly done, we denote with W the work, bound based on β i.e., the number of operations performed by a given program. 4 bound based on π Although W may refer to any type of operation, such as π = 2 comparisons or integer arithmetic, the focus of this paper β: 1 some program is numerical code and thus W will count the number of 1/2 x run on some input floating point additions and multiplications. This is meant in 1/4 a mathematical sense, i.e., for example one AVX addition of 1/4 1/2 1 2 4 8 operational vectors of four doubles will count as four. In our case studies, π/β intensity I [flops/byte ] the work will depend only on the sizes of the inputs, for Fig. 1. Roofline plot for π = 2 and β = 1. example when adding two vectors of length n. In this case we write W = W (n). The work is a property of the chosen algorithm and does not depend on the platform. Operational intensity: Examples. In simple cases, I can be When run on a platform with some input, the program will estimated by analysis. As example, consider the daxpy routine achieve a runtime of T and thus a performance of P = W=T . in BLAS [1], which implements Memory traffic. We denote with Q the number of bytes of memory traffic incurred by executing a given numerical y αx + y program. Again we write Q = Q(n) to express the dependency where x and y are vectors of doubles of length n and on the input size n. In contrast to W , Q heavily depends α is a scalar. The work is W (n) = 2n. Since there is on the properties of the platform such as the details of the no reuse in the computation, the data traffic is determined cache hierarchy. Because of this, Q can be estimated only by the compulsory misses and the write-back of the result, asymptotically in most cases (e.g., [15]), and exact values i.e., Q (n) ≥ 16n, Q (n) ≥ 8n, and thus Q(n) ≥ 24n. have to be determined by measurements. We will distinguish r w This analysis yields a lower bound since it excludes, for reads and writes as Q and Q : Q = Q + Q . The read r w r w example, memory traffic due to loading the code and the cache data traffic Q (n) for sizes for which the data fits within r coherency protocol in case of threading. A similar analysis can the cache is equal to the number of compulsory misses, i.e., be done for other mathematical kernels as shown in Table I reading the input data. For example, for dgemm (third row for representative BLAS 1–3 functions. Note that only dgemm in Table I) the three matrices, each of size n2, have to be is not bound by O(1). However, the shown bound is too loose read. Therefore, the compulsory read data traffic in bytes is once 3n2 exceeds the cache size. The achievable optimum 24n2. For larger sizes, there will also be reads associated to p (using suitable blocking) is Θ( γ), where γ is the cache size data that has been evicted from the cache, but this number [15]; [14] provides a non-asymptotic bound. Analysis of a cannot be determined analytically. Hence, Q (n) ≥ 24n2. In r simple triple loop implementation yields I(n) = O(1) due to dgemm, only the output matrix C has to be written (plus all bad locality. the evicted data). Therefore, the write data traffic in bytes is 2 Qw(n) ≥ 8n . B. The Roofline Model Typically, we assume a cold data cache unless stated other- The roofline model [37] visually relates performance P and wise.

Load more