Cache-Aware Roofline Model: Upgrading the Loft
Total Page:16
File Type:pdf, Size:1020Kb
1 Cache-aware Roofline model: Upgrading the loft Aleksandar Ilic, Frederico Pratas, and Leonel Sousa INESC-ID/IST, Technical University of Lisbon, Portugal ilic,fcpp,las @inesc-id.pt f g Abstract—The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%. Index Terms—Multicore computer architectures, Performance modeling, Application optimization F 1 INTRODUCTION in Tab. 1. The horizontal part of the Roofline model cor- Driven by the increasing complexity of modern applica- responds to the compute bound region where the Fp can tions, microprocessors provide a huge diversity of compu- be achieved. The slanted part of the model represents the tational characteristics and capabilities. While this diversity memory bound region where the performance is limited is important to fulfill the existing computational needs, it by BD. The ridge point, where the two lines meet, marks imposes challenges to fully exploit architectures’ potential. the minimum I required to achieve Fp. In general, the A model that provides insights into the system performance original Roofline modeling concept [10] ties the Fp and the capabilities is a valuable tool to assist in the development theoretical bandwidth of a single memory level, with the I and optimization of applications and future architectures. expressed in φ per accessed byte at that memory level. For In order to model the interrelation between architecture example, if working sets fit into the L2 cache, this concept capabilities and application characteristics, it is usually re- considers the peak L2 bandwidth (BL2) and I is defined as quired to develop architecture-specific testing and simula- φ per L2 bytes accessed (i.e., between L1 and L2 caches). tion environments, which result in accurate but complex, Hence, when improving the memory access pattern of ap- difficult to develop and hard to use models. However, plications, it is required to construct and simultaneously use simpler “bound and bottleneck” approaches can provide several instances of the model (one for each memory level) valuable insights into the primary factors that affect the in order to assess the optimization gains. system performance, and give useful guidelines for im- As shown in this paper, the original Roofline modeling proving applications and architectures [3]. In particular, the concept is usually not sufficient to fully describe the per- Roofline model [10] provides insights into inherent archi- formance of modern applications and architectures which tectural bottlenecks and potential application optimizations. rely on the on-chip memory hierarchy. This paper proposes Its usefulness is patent in several works [9], both at the the Cache-aware Roofline model, that is based on a different application [4], [6], [8] and at the architectural level [5], [7]. Core-centric concept where FP operations, data traffic and Modern multi-core systems can be represented as a set of memory bandwidth at different levels are perceived from a central processing units (Cores) with on-chip memory hier- consistent architectural point of view, i.e., as seen by the archy connected to a DRAM memory (see Fig. 1). Hence, the Core. These changes introduce a fundamentally different original Roofline model [10] shows the maximum attainable view on the attainable performance and behavior of modern performance of a computer architecture as a line in the form applications and architectures, thus providing more insight- of a roof. It relates the peak floating-point (FP) performance ful guidelines for application optimization. Moreover, the (Fp in flops=s), the peak DRAM memory bandwidth (BD proposed model is a single-plot model that translates into in βD=s), and the application’s operational intensity (I in an increase of the maximum attainable performance when flops/βD), where βD represents the data traffic in DRAM directly compared to the original DRAM roof, which can be bytes-accessed. It is based on the assumption that memory considered as an upgrade to the loft. transfers and computation can overlap, so the application execution time, T (I), is limited by Fp or BD: 2 CACHE-AWARE ROOFLINE MODEL φ 1 1 ff T (I) = T ( ) = φ max ; (1) βD × BD I Fp As stated before, in contrast to the original modeling con- × where φ is the number of performed FP operations (flops). cept that ties the Fp with the bandwidth and data traffic at a Thus, the original Roofline model defines the maximum single memory hierarchy level, the proposed Cache-aware attainable performance (Fa(I)), as: Roofline modeling concept differs in how memory traffic φ 1 and bandwidth are observed. There are two fundamental Fa(I) = = n o = min BD I;Fp (2) key differences: i) we account for all memory operations T (I) max 1 ; 1 f × g BD ×I Fp including accesses to the different cache levels; and ii) the Figure 2 shows the original Roofline model for the quad- bandwidth depends on the accessed memory levels and it core Intel 3770K processor with the characteristics presented is defined relatively to the Core. Since the proposed model 2 Original Cache-aware 1024 128 4 Cores MAD (Peak Performance) memory traffic memory traffic 4 Cores L1 C Measured → Theoretical F (φ ) 512 B(β ) 2 Cores on-chip 64 2 Cores ADD/MUL 256 1 Core L2 C BL1C B BLLC B L2 D DRAM → L3 C Core L1 ... LLC 128 → 32 1 Core (off-chip) 64 32 DRAM LLC 16 B → L2 C Performance [GFlops/s] Memory Bandwidth [GB/s] DRAM C Measured BLLCC LLC - Last Level Cache 16 Lx - Level X Cache → Theoretical B DC 0.125 1 8 64 512 4096 32768 262144 16 128 1024 8192 65536 524288 Data Traffic [KBytes] Double FP operations [Flops] Fig. 1: General computer architecture Fig. 3: Performance and bandwidth variation on the Intel 3770K th MAD (Peak Performance F ) 128 128 MAD (Peak Performance Fp) 128 id p w d ADD/MUL 64 64 ADD/MUL 64 an ) th b C id 1 32 32 w 32 L 1→ d k L C n a ( a 16 e 2 C 16 16 b P L → C L Intel 3770K 3 L C 8 L → 8 8 L M→ C Ivy Bridge L1 to Core (MAD) L 4 A R M→ 4 L1 to Core (ADD) 4 M→ D A A 2 R L1 to Core (MUL) 2 R D 2 D 1 L2 to Core k rRMSE=0.0819 a 1 1 e 0.5 L3 to Core Performance [GFlops/s] Performance [GFlops/s] Performance [GFlops/s] fitness=92.43% P DRAM to Core 0.5 0.5 0.25 DRAM to LLC 0.0078125 0.0625 0.5 4 32 256 2048 16384 0.0078125 0.0625 0.5 4 32 256 2048 16384 0.0078125 0.0625 0.5 4 32 256 2048 16384 Operational Intensity [Flops/DRAM Byte] Operational Intensity [Flops/Byte] Operational Intensity [Flops/Byte] Fig. 2: Original Roofline model (3770K) Fig. 4: Cache-aware Roofline model (3770K) Fig. 5: Model validation for Intel 3770K 128 128 128 64 64 64 32 32 32 16 16 16 Intel 2600K Intel 3820 Intel 3930K 8 8 Sandy Bridge Sandy Bridge−E 8 Sandy Bridge−E 4 4 4 2 2 2 rRMSE=0.0850 rRMSE=0.0871 rRMSE=0.0766 1 1 1 Performance [GFlops/s] Performance [GFlops/s] fitness=92.17% Performance [GFlops/s] fitness=91.98% fitness=92.88% 0.5 0.5 0.5 0.0078125 0.0625 0.5 4 32 256 2048 16384 0.0078125 0.0625 0.5 4 32 256 2048 16384 0.0078125 0.0625 0.5 4 32 256 2048 16384 Operational Intensity [Flops/Byte] Operational Intensity [Flops/Byte] Operational Intensity [Flops/Byte] Fig. 6: Cache-aware Roofline model validation for different general-purpose machines considers the complete volume of memory traffic, i.e., the value can be obtained from the processor specifications by total number of transferred bytes (β), the I in the Cache- multiplying the number of cores, the L1 bus width, and the aware Roofline Model is uniquely defined for all levels of clock frequency (see Tab. 1). Moreover, the insightfulness of the memory hierarchy, thus resulting in a single-plot model. the Cache-aware Roofline model can be further extended In fact, in the proposed model, I can be the arithmetic inten- by introducing additional memory ceilings depending on sity (φ/β). However, we use the term operational intensity the achievable bandwidth for each memory level (see Fig. 3 to reflect the possibility of applying the proposed model for and 4). It is worth mentioning that the bandwidth from the other types of operations (not necessarily arithmetic). DRAM to the Core (BD C ) is lower than the bandwidth In order to experimentally assess the bandwidth and considered in the original Roofline model (BD) due to the performance, we executed a series of tests based on Alg. 1. fact that the data goes through all cache levels before arriv- In particular, the bandwidth was characterized by vary- ing to the processor Cores (see Fig. 3 and 4). Overall, the ing the number of memory operations (test code A) to proposed model reveals a previously unexplored area, thus hit different memory levels by accessing contiguous and allowing even better insights on the attainable performance. increasing memory addresses. Figure 3 shows the results for the quad-core Intel 3770K architecture1 (see Tab.