A Benchmark-Based Performance Model for Memory-Bound HPC Applications Bertrand Putigny, Brice Goglin, Denis Barthou

A Benchmark-based Performance Model for Memory-bound HPC Applications Bertrand Putigny, Brice Goglin, Denis Barthou To cite this version: Bertrand Putigny, Brice Goglin, Denis Barthou. A Benchmark-based Performance Model for Memory- bound HPC Applications. International Conference on High Performance Computing & Simulation (HPCS 2014), Jul 2014, Bologna, Italy. 10.1109/HPCSim.2014.6903790. hal-00985598 HAL Id: hal-00985598 https://hal.inria.fr/hal-00985598 Submitted on 30 Apr 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 A Benchmark-based Performance Model for Memory-bound HPC Applications Bertrand Putigny∗†, Brice Goglin∗†, Denis Barthou∗†‡ ∗INRIA Bordeaux Sud-Ouest, †LaBRI, ‡Polytechnic Institute of Bordeaux [email protected], [email protected], [email protected] Abstract—The increasing computation capability of works have abstracted away the cache behavior by mod- servers comes with a dramatic increase of their com- eling essentially compulsory, capacity and conflict misses plexity through many cores, multiple levels of caches (the 3C). The resulting analytical models, through the and NUMA architectures. Exploiting the computing power is increasingly harder and programmers need computation of stack distances, capture data locality and ways to understand the performance behavior. offer a guidance for compiler optimizations (see [2], [3], We present an innovative approach for predict- [4] for instance). Most of these previous works target ing the performance of memory-bound multi-threaded the cache behavior of single-threaded codes. Some recent applications. It relies on micro-benchmarks and a works focus on how to model cache misses [4] for multi- compositional model, combining measures of micro- threaded codes, or on how to model cache coherence benchmarks in order to model larger codes. Our mem- traffic [5]. However, they do not consider simultaneously ory model takes into account cache sizes and cache coherence protocols, having a large impact on perfor- the impact of cache misses, of coherence and contention mance of multi-threaded codes. Applying this model coming from shared caches. A 5C model, accounting for to real world HPC kernels shows that it can predict compulsory, capacity, conflict misses, contention and co- their performance with good accuracy, helping taking herence remains to be found in order to help study the optimization decisions to increase application’s perfor- impact of factors such as the data set size for each thread mance. or the code scalability on multi-cores. Keywords—memory model; timing prediction; micro- In this paper we present a novel performance model benchmarks; caches; multicore. that allows detailed performance prediction for memory- bound multi-threaded codes. This differs from the previous I. Introduction approaches by predicting the cumulated latency required to perform the data accesses of a multi-threaded code. High performance computing requires proper software The model resorts to micro-benchmarks in order to take tuning to better exploit the hardware abilities. The in- into account the effects of the hierarchy of any number of creasing complexity of the hardware leads to a growing caches, compulsory misses, capacity misses (to some ex- need to understand and to model its behavior so as to tent), coherence traffic and contention. Micro-benchmarks optimize applications for performance. While the gap be- offer the advantage of making the approach easy to cal- tween memory and processor performance keeps growing, ibrate on a new platform and able to take into account complex memory designs are introduced in order to hide complex mechanisms. In addition these benchmarks can be the memory latency. Modern multi-core processors feature used as hardware test-beds, e.g. to choose between several deep memory hierarchies with multiple levels of caches, architectures which one best suits the needs. with one or more levels shared between cores. This paper is organized as follows: first, Section II Performance models are essential because they provide presents the scope and an overview of our work. Our model feed-back to programmers and tools, and give a way to is later detailed in Section III as a combination of hard- debug, understand and predict the behavior and perfor- ware and software models. Then Section IV presents the mance of applications. Coherence protocols, prefetchers, benchmarks used to build the hardware models. Finally replacement policies are key automatic hardware mech- Section V details the use of our model to predict real world anisms that improve peak performance, but they also code performance. make parallel architecture modeling much harder. Besides the detailed algorithms used by these mechanisms, the II. Scope and Model’s Overview latencies involved or the size of the buffers implied are not precisely quantified by hardware manufacturers in general. Our model takes as input the source code to analyze. Finally, even if the hardware resource organization may be The code can be structured with any number of loops and known at runtime thanks to tools such as hwloc [1], the statements, and we assume the data structures accessed way they are actually used depends on the application, are only scalars and arrays. Parallelism is assumed to namely how it allocates, reuses and shares buffers between be expressed in OpenMP parallel loops. The iteration the computing tasks. count of the loops have to be known at the time of the These difficulties hinder precise performance modeling, analysis, and the array indexes have to be affine functions in particular for memory performance. Many previous of surrounding loop counters and of thread ids. 2 Predicting the time to access memory usually requires to region. For instance, the following DAXPY computation build a full theoretical performance model of the memory with n threads, assuming N is a multiple of n: hierarchy. This work is however difficult due to the com- double X[N],Y[N]; plexity of all the hardware mechanisms involved in modern #pragma omp parallel for architectures. Instead we choose to build a memory model for (i=0; i<N; i++) upon benchmarks that are used to capture the hardware Y[i] = a ∗ X[i] +Y[i]; behavior of common memory access patterns while hiding the impact and complexity of these mechanisms inside the is modeled as the following code: benchmark output. double X[N],Y[N]; #pragma omp parallel The main difficulty with this approach is to find a set of int t = omp_get_thread_num(); benchmarks that characterizes hardware precisely, and to int T = omp_get_num_threads(); keep this set as small as possible. Then we try to rebuild read ([ t∗N/T:( t+1)∗N/T− 1 : 8 ] ,X) ; read ([ t∗N/T:( t+1)∗N/T− 1 : 8 ] ,Y) ; the application memory access pattern by combining the write ([ t∗N/T:( t+1)∗N/T− 1 : 8 ] ,Y) ; outputs of these basic benchmarks. We found that in cache coherent architectures, the state of the target cache lines where [t ∗ N/T :(t + 1) ∗ N/T − 1 : 8] denotes the array has a large impact on performance. Concurrent accesses index region accessed by a thread, with the three values to shared data buffers lead to cache-line bounces between corresponding to the lower, upper indices, and the stride cores due to the need to maintain coherence between the (8 for double) of the region. This notation corresponds to existing data copies in their caches. Thus we build a set the triplet notation used in compilers. The data region of benchmarks that gives us insight about the read and X[t∗N/T :(t+1)∗N/T −1 : 8] is defined as a data chunk write latency to cache lines for every of the state of the in the following. The array index region is a function f MESI protocol [6]. Indeed most cache coherent processors of N, T and t. When this is clear from the context, we use protocols that are based upon MESI. will denote this region with f(t) (assuming the size and In order to predict the time needed to access memory total number of threads are constants). This function f is for a given application, we decompose it into a memory defined as the mapping function in the following. access pattern. This pattern tells us the amount of memory In this representation, the sequence of reads and writes, access and how it is accessed (i.e. reading or writing). where initially two reads alternates with one write, has If we suppose that we know the full state of memory been replaced by two streams of reads followed by a (i.e. the cached addresses and their locations), we can stream of writes. In order to represent the capacity of i) reconstruct the state of memory after every access, modern architectures to load and store multiple elements and ii) read each access duration from the output of our per cycle, we consider an execution model where read benchmarks. By taking memory events one after another and write statements can be dispatched to parallel units. we can track the state of data in caches and construct For instance, on an architecture able to perform one read a formula that will predict the time needed to access and one write at a time, the two read accesses can only memory for the whole application. be performed in sequence, while the write access can be performed concurrently to the reads.

Load more