A Space Exploration Framework for GPU Application and Hardware Codesign

SESH framework: A Space Exploration Framework for GPU Application and Hardware Codesign Joo Hwan Lee Jiayuan Meng Hyesoon Kim School of Computer Science Leadership Computing Facility School of Computer Science Georgia Institute of Technology Argonne National Laboratory Georgia Institute of Technology Atlanta, GA, USA Argonne, IL, USA Atlanta, GA, USA [email protected] [email protected] [email protected] Abstract—Graphics processing units (GPUs) have become needs to attempt tens of transformations in order to fully increasingly popular accelerators in supercomputers, and this understand the performance potential of a specific hardware trend is likely to continue. With its disruptive architecture configuration. Fourth, evaluating the performance of a partic- and a variety of optimization options, it is often desirable to ular implementation on a future hardware may take significant understand the dynamics between potential application transfor- time using simulators. Fifth, the current performance tools mations and potential hardware features when designing future (e.g., simulators, hardware models, profilers) investigate either GPUs for scientific workloads. However, current codesign efforts have been limited to manual investigation of benchmarks on hardware or applications in a separate manner, treating the microarchitecture simulators, which is labor-intensive and time- other as a black box and therefore, offer limited insights for consuming. As a result, system designers can explore only a small codesign. portion of the design space. In this paper, we propose SESH framework, a model-driven codesign framework for GPU, that is To efficiently explore the design space and provide first- able to automatically search the design space by simultaneously order insights, we propose SESH, a model-driven framework exploring prospective application and hardware implementations that automatically searches the design space by simultaneously and evaluate potential software-hardware interactions. exploring prospective application and hardware implementations and evaluate potential software-hardware interactions. I. INTRODUCTION SESH recommends the optimal combination of application As demonstrated by the supercomputers Titan and Tianhe- optimizations and hardware implementations according to 1A, graphics processing units (GPUs) have become integral user-defined objectives with respect to performance, area, and components in supercomputers. This trend is likely to continue, power. The technical contributions of the SESH framework are as more workloads are exploiting data-level parallelism and as follows. their problem sizes increase. A major challenge in designing future GPU-enabled sys- 1) It evaluates various software optimization effects and tems for scientific computing is to gain a holistic understanding hardware configurations using decoupled workload about the dynamics between the workloads and the hard- models and hardware models. ware. Conventionally built for graphics applications, GPUs 2) It integrates GPU’s performance, area, and power have various hardware features that can boost performance if models into a single framework. carefully managed; however, GPU hardware designers may not 3) It automatically proposes optimal hardware configu- be sufficiently informed about scientific workloads to evaluate rations given multi facet metrics in aspects of perfor- specialized hardware features. On the other hand, since GPU mance, area, and power. architectures embrace massive parallelism and limited L1 storage per thread, legacy codes must be redesigned in order We evaluate our work using a set of representative scientific to be ported to GPUs. Even codes for earlier GPU generations workloads. A large design space is explored that considers may have to be recoded in order to fully exploit new GPU both application transformations and hardware configurations. architectures. As a result, an increasing effort has been made We evaluate potential solutions using various metrics including in codesigning the application and the hardware. performance, area efficiency, and energy consumption. Then, In a typical codesign effort, a set of benchmarks is proposed we summarize the overall insights gained from such space by application developers and is then manually studied by exploration. hardware designers in order to understand the potential. How- ever, such a process is labor-intensive and time-consuming. In The paper is organized as follows. In Section II, we addition, several factors challenge system designers’ endeavors provide an overview of our work. In Section III, we intro- to explore the design space. First, the number of hardware duce the integrated hardware models for power and area. configurations is exploding as the complexity of the hardware Section IV describes the space exploration process. Evaluation increases. Second, the solution has to meet several design methodology and results are described in Sections V and VI, constraints such as area and power. Third, benchmarks are respectively. After the related work is discussed in Section VII, often provided in a specific implementation, yet one often we conclude. II. OVERVIEW AND BACKGROUND automatically proposing potential application transformations and hardware configurations. The SESH framework is a codesign tool for GPU system 3) SESH projects the energy consumption and execution designers and performance engineers. It recommends the op- time for each combination of transformations and timal combination of hardware configurations and application hardware configurations, and recommends the best implementations. Different from existing performance models solution according to user-specified metrics, without or architecture simulators, SESH considers how applications manual coding or lengthy simulations. Such metrics may transform and adapt to potential hardware architectures. can be a combination of power, area, and performance. For example, one metric can be “given an A. Overall Framework area budget, what would be the most power efficient hardware considering potential transformations?”. Source Workload Code Input The SESH framework is built on top of existing performance modeling frameworks. We integrated GROPHECY [2] User Effort as the GPU workload modeling and transformation engine. We also adopted area and power models from previous work on Code Skeletons projection engine. Below we provide a brief description about them. SESH Framework Workload Modeling & Hardware Modeling & B. GROPHECY-Based Code Transformation Engine Transformaon Engine Transformaon Engine Op:mal GROPHECY [2], a GPU code transformation framework, Transformaon & Hardware Projecon Engine has been proposed to explore various transformations and to Energy Performance estimate the GPU performance of a CPU kernel. Provided with Area Projec:on Projecon Projecon a code skeleton, GROPHECY is able to transform the code skeleton to mimic various optimization strategies. Transfor- mations explored include spatial and temporal loop tiling, loop Fig. 1. Framework Overview. fusion, unrolling, shared memory optimizations, and memory request coalescing. GROPHECY then analytically computes the characteristics of each transformation. The resulting char- Source Code Code Skeleton acteristics are used as inputs to a GPU performance model of Matrix Mul3plicaon 1. float A[N][K], B[K][M] 2. float C[N][M] to project the performance of the corresponding transforma- (C = A*B) 3. /* the loop space */ 4. forall i=0:N, j=0:M tion. The best achievable performance and the transformations 1. float A[N][K], B[K][M]; 5. { 2. float C[N][M]; necessary to reach that performance are then projected. 6. /* computation w/t 3. int i, j, k; 7. * instruction count 4. // nested for loop 8. */ 5. for(i=0; i<N; ++i) GROPHECY, however, relies on the user to specify a 9. comp 1 6. { 10. /* reduction loop */ particular hardware configuration. It does not explore the 7. for(j=0; j<M; ++j) 11. reduce k = 0:K { 8. { 12. /* load */ hardware design space to study how hardware configurations 9. float sum = 0; 13. ld A[i][k] 10. for(k=0; k<K; ++k) 14. ld B[k][j] would affect the performance or efficiency of various code 11. { 15. comp 3 transformations. In this work, we extend GROPHECY with pa- 12. sum+=A[i][k]*B[k][j]; 16. } 13. } 17. comp 5 rameterized power and area models so that one can integrate it 14. C[i][j] = sum; 18. /* store */ 15. } 19. st C[i][j] into a larger framework that explores hardware configurations 16. } 20. } together with code transformations. Fig. 2. A pedagogical example of a code skeleton in the case of matrix C. Hardware Models multiplication. A code skeleton is used as the input to our framework. We utilize the performance model from the work of Sim et al. [3]. We make it tunable to reflect different GPU architecture As Figure 1 shows, the major components of the SESH specifications. The tunable parameters are Register file entry framework include (i) a workload modeling and transformation count, SIMD width, SFU width, L1D/SHMEM size and LLC engine, (ii) a hardware modeling and transformation engine, cache size. The model takes several workload characteristics and (iii) a projection engine. Using the framework involves as its inputs, including the instruction mix and the number of the following work flow: memory operations. 1) The user abstracts high level characteristics of the We utilize the work by Lim et al. [4] for chip-level source code into a code skeleton that summarizes area and power models. They model GPU power based on control flows, potential

A Space Exploration Framework for GPU Application and Hardware Codesign

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Introduction to Parallel Programming

An Open Standard for SHMEM Implementations Introduction

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Introduction to Parallel Computing

Bridging Parallel and Reconfigurable Computing with Multilevel PGAS and SHMEM+

Distributed Shared Memory – a Survey and Implementation Using Openshmem

NUG Monthly Meeting

Parallel Computing, Models and Their Performances

Benchmarking Parallel Performance on Many-Core Processors

SIMD Programmingprogramming

Experiences at Scale with PGAS Versions of a Hydrodynamics Application