A First Look at Integrated Gpus for Green High Performance Computing

Noname manuscript No. (will be inserted by the editor) A First Look at Integrated GPUs for Green High-Performance Computing T. R. W. Scogland, H. Lin and W. Feng Dept. of Computer Science Virginia Tech {tom.scogland, hlin2, wfeng}@vt.edu Received: date / Accepted: date Abstract counterpart, and it can still outperform the CPUs of a multicore system The graphics processing unit (GPU) has evolved from a single- at a fraction of the power. purpose graphics accelerator to a tool that can greatly accelerate the performance of high-performance computing (HPC) applications. Pre- Keywords: GPU, performance evaluation, power-aware computing, vious work evaluated the energy efficiency of discrete GPUs for compute- energy-efficient computing, green, embedded high-performance com- intensive scientific computing and found them to be energy efficient but very high power. In fact, a compute-capable discrete GPU can draw puting more than 200 watts by itself, which can be as much as an entire compute node (without a GPU). This massive power draw presents a se- rious roadblock to the adoption of GPUs in low-power environments, 1 Introduction such as embedded systems. Even when being considered for data cen- ters, the power draw of a GPU presents a problem as it increases the demand placed on support infrastructure such as cooling and available Through the mid-2000s, performance improvements for the supplies of power, driving up cost. With the advent of compute-capable general-purpose processor, also known as the CPU, were integrated GPUs with power consumption in the tens of watts, we be- driven by increasing clock frequencies (f) that had to be lieve it is time to re-evaluate the notion of GPUs being power-hungry. supported by correspondingly higher supply voltages to the In this paper, we present the first evaluation of the energy efficiency of integrated GPUs for green HPC. We make use of four spe- processor (V ). However, because the thermal design power cific workloads, each representative of a different computational dwarf, (TDP) of such processors is directly proportional to the clock and evaluate them across three different platforms: a multicore system, frequency f and the square of the supply voltage V , these a high-performance discrete GPU, and a low-power integrated GPU. higher frequencies and voltages led to increasingly higher We find that the integrated GPU delivers superior energy savings and a TM comparable energy-delay product (EDP) when compared to its discrete TDPs, as exemplified by the Itanium processor at 130 watts. This excessive power density led many to “brag” about the This work supported in part by the Department of Defense (DoD) fact that it approached the power density of a nuclear reactor, through the National Defense Science & Engineering Graduate Fel- as noted in (Feng et al, 2002; Hsu and Feng, 2005). lowship (NDSEG) Program and NSF I/UCRC Grant IIP-0804155 with equipment support from the Air Force Research Laboratory. Due in large part to the aforementioned power density, clock frequencies finally stalled around 3 GHz in the mid- T. R. W. Scogland 2202 Kraft Drive to-late 2000s, and instead, the number of cores started dou- Knowledge Works II Building bling every 24 months. Thus, performance improvements would no longer come from increased clock frequencies but H. Lin rather would come from harnessing the parallelism of multi- 2202 Kraft Drive core and many-core processors. Though the stalling of clock Knowledge Works II Building frequencies and advances in process technology (65 nm → Room 2219 45 nm → 32 nm) have kept the TDP at bay, despite the doubling of cores every 24 months, the TDP has resumed W. Feng R TM 2202 Kraft Drive its upward march with the dual-core Intel Xeon 7130 Knowledge Works II Building CPU at 150 watts and emerging graphics processing units Room 2209 (GPUs) for general-purpose computation such as the 1600- core AMD/ATI R RadeonTM5870 at 188 watts, the 240-core 2 NVIDIA R GTX 280 at 236 watts, and the 512-core NVIDIA R crete GPU, while still delivering better performance than Fermi at 295 watts. multicore CPUs at a fraction of their power consumption. Despite the arguably voracious power appetite of the These properties make compute-capable, integrated GPUs GPU, its popularity continues to grow quite rapidly because very promising for green HPC applications. it is a commodity solution that can deliver personal super- The remainder of the paper is organized as follows, we computing to the desktop for everything from a common present a background on the NVIDIA GPU architecture and desktop operating system (Munshi, 2008) to scientific ap- CUDATMas well as other related work in Section 2 and Sec- plications (Eastman and Pande, 2010; Sandes and de Melo, tion 3. Section 4 discusses the methodology we used for this 2010; Zhang et al, 2010; Tan et al, 2009; Anandakrishnan study, followed by an evaluation of the energy consumption et al, 2010; Hardy et al, 2009; Stone et al, 2007). However, and performance trade offs of a modern multicore architec- the adoption of the GPU in high-performance embedded ture and two popular versions of the GPU architectures in environments has been limited due to the aforementioned Section 5. Finally, Section 6 presents our conclusions and TDPs. In addition, future exascale computing environments Section 7 our plans for future work. face daunting power consumption issues, as detailed in (Kogge et al, 2008) — power issues that may be exacerbated by GPUs. 2 Background While the most common and most studied GPUs for general-purpose computation are large, dedicated, and dis- This section provides a brief overview of the CUDA archi- crete many-core processors with 240 to 3200 cores, their tecture, programming model, and compute capability. TDPs are similarly large. As a consequence, such discrete many-core GPUs cannot be used in embedded systems. 2.1 CUDA Architecture Recently, however, low-power GPUs have materialized for general-purpose computation as well. Such GPUs have The CUDA architecture consists of a set of multiproces- been designed as part of the chipset of the system mother- sors, each of which contains 8 scalar processors. Examples board, i.e., an integrated GPU and system chipset, or here- of such GPU devices include the 240-core NVIDIA GTX after referred to simply as an integrated GPU. This inte- 280 or the 16-core NVIDIA IONTMGPU. Meaning that the grated GPU provides full support for general-purpose com- NVIDIA GTX 280 has 30 multiprocessors and the TM putational acceleration via CUDA (NVIDIA, 2010) but at NVIDIA ION GPU only 2. a TDP that is an order of magnitude better (i.e., less) than While each core could execute a sequential thread, the a traditionally large, dedicated, and discrete many-core pro- cores generally execute in what NVIDIA calls single in- cessor such as the NVIDIA GTX 280. struction, multiple-thread (SIMT) fashion. That is, all cores Historically, integrated GPUs have been designed to pro- in the same group execute the same instruction at the same vide 2D and 3D rendering within a low power and energy time, much like SIMD. Where SIMT differs is in its ability budget, especially laptops and miniature PCs. Now that these to transparently handle divergent instructions. In addition, integrated GPUs also have compute capabilities, can they rather than scheduling at the granularity of a thread (as a deliver the energy efficiency of their larger brethren while standard CPU would) or at the granularity of 8 threads (as maintaining the low-power characteristics that made them one might expect from SIMD-like execution), CUDA sched- attractive in portable and embedded systems from the start? ules threads in groups of 32, known as warps, which execute Our results show that integrated GPUs with low TDPs pos- as a single unit across four time steps on a multiprocessor. sess the capacity for higher performance than their power The memory model is similarly different from standard draw suggests and provide a compelling performance com- system memory, but its structure is only tangentially relevant promise between traditional CPUs and full-sized GPUs with to the results in this paper. For more information on NVIDIA a power draw lower than either. GPUs and their architecture in general, see the CUDA pro- The main contribution of this paper is a performance gramming guide (NVIDIA, 2007). evaluation and demonstration of low-power integrated GPUs in enhancing performance and energy efficiency for green high-performance computing (HPC). Specifically, we com- 2.2 CUDA Programming Model pare the performance, power consumption, and energy efficiency for a set of scientific applications across three ac- Certain extensions are necessary to program the architecture celerated architectures: an 8-core CPU system, a 240-core described above within current programming languages. In discrete GPU, and a low-power integrated GPU. Our results the case of CUDA C was chosen and a set of language exten- show that the integrated GPU offers a comparable energy- sions have been released along with a compiler and various delay product (EDP) and superior energy savings over a dis- examples by NVIDIA. 3 The primary concept in CUDA is the kernel, which rep- by intercepting OpenGL calls. One of the hardware improve- resents a set of computations to be run in parallel on the ments investigated is dynamic voltage and frequency scaling GPU, and is specified much like a function in C. A kernel (DVFS) across multiple clock domains, which achieved an is a grid of blocks arranged in a 1 or 2 dimensional array, estimated power reduction of 10%. which are each one- or two-dimensional arrays of threads. Though little work has been done in direct relation to Threads in this case, while they are implemented differently, general-purpose computing on low-power GPUs, a great deal behave for consistency purposes the same as threads in a of effort has been expended on creating low-power embed- multicore or multiprocessor scenario.

A First Look at Integrated Gpus for Green High Performance Computing

CUDA by Example

Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation

INTEGRA II € ,00 C269od

Powervr SGX Series5xt IP Core Family

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

Informatikum Számítástechnika Értékesítés És Szerviz 6000 Kecskemét, Petőfi Sándor Utca 3

Driver Riva Tnt2 64

(GPU) Computing

Massively Parallel Computing with CUDA

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications

A Configurable General Purpose Graphics Processing Unit for Power, Performance, and Area Analysis