Paper Title (Use Style: Paper Title)

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Jeremy Enos, Craig Steffen, Joshi Fullop, Michael John E. Stone, James C. Phillips Showerman, Guochun Shi, Kenneth Esler, Theoretical and Computational Biophysics Group, Volodymyr Kindratenko Beckman Institute National Center for Supercomputing Applications University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Urbana, IL, USA Urbana, IL, USA {johns|jim}@ks.uiuc.edu {jenos|csteffen|jfullop|mshow|gshi|esler|kindr} @ncsa.illinois.edu Abstract—We present an inexpensive hardware system for We also describe two inexpensive hardware systems for monitoring power usage of individual CPU hosts and monitoring the power consumption by the host and GPU externally attached GPUs in HPC clusters and the software nodes of a compute cluster and their integration with the stack for integrating the power usage data streamed in real- compute cluster job management system. This enables the time by the power monitoring hardware with the cluster users running applications on power monitored cluster nodes management software tools. We introduce a measure for to receive power consumption statistics at the end of the job quantifying the overall improvement in performance-per-watt execution and to examine power consumption profiles of for applications that have been ported to work on the GPUs. their applications. We use the developed hardware/software infrastructure to To our knowledge, this is the first attempt to integrate demonstrate the overall improvement in performance-per-watt power measurement and analysis tools with a GPU-enhanced for several HPC applications implemented to work on GPUs. HPC cluster and to supply the end user with the application Keywords-GPU; cluster;power monitoring; power efficiency power usage profile. Access to such information is vital to enable the development of power-aware applications, runtime systems that assist in power saving, and resource I. INTRODUCTION management schemas to optimize performance and power One of the latest trends in the high-performance usage. computing (HPC) community is to use application The work presented in this paper utilizes accelerator accelerators, such as graphical processing units (GPUs), to cluster (AC) deployed at the NCSA’s Innovative Systems speed up the execution of computationally intensive codes. Laboratory (ISL) at the University of Illinois at Urbana- Several GPU-enhanced large-scale HPC clusters have been Champaign [1]. The AC cluster consists of 32 Hewlett recently deployed, e.g., Lincoln cluster at the National Packard xw9400 workstations each with two dual core 2.4 Center for Supercomputing Applications (NCSA) [1], and GHz Opteron 2216 processors and 8 GB of memory (2 the scientific computing community is actively porting many GB/core). Each node hosts an external NVIDIA Tesla codes to run on these systems. GPUs are known to consume S1070 unit containing four Tesla C1060 GPUs each, a significant amount of power, e.g., a state-of-the-art resulting in 4 GPUs per node with a cumulative total of 128 NVIDIA Tesla C2050 compute-optimized GPU consumes GPUs across the entire cluster. The current AC cluster up to 225 watts. The addition of one or more GPUs in each configuration includes power monitoring on one node and its cluster node thus significantly alters the power requirements attached Tesla S1070 unit. of the HPC systems in which they are deployed. A key The paper is organized as follows: In Section II we question that must then be addressed is to what degree the analyze related work, in Section III we present the power additional power consumption of GPUs is offset by the measurement hardware that we developed for use with the application performance increases they provide, and whether GPU cluster, in Section IV we describe software they are net gain or a net loss in terms of application infrastructure developed to collect data from the power performance per watt. monitoring sensors and correlate it with user applications In this study, we evaluate the impact of GPU acceleration executed on the GPU cluster under control of the cluster job on performance and electrical power consumption using four management system, in Section V we introduce a measure production HPC applications as case studies. We introduce a that quantifies the overall improvement in performance-per- measure that quantifies the overall improvement in watt for the applications using GPUs and compute its value performance-per-watt for the applications under the test for four test case applications, and finally in Section VI we when comparing CPU-only and GPU-accelerated versions discuss advantages and limitations of our solution and future and describe measurement methodologies appropriate for enhancements to the developed system. typical HPC workloads. II. RELATED WORK CUDA for GPU accelerated computation. This work also In [2], the authors use various GPU kernels that are extends our previous effort on analyzing power usage for known to stress a particular set of GPU resources, e.g., different stages of system operation, ranging from system ALUs, register file, or memory, to characterize power boot, to idle nodes, to full application run [1]. consumption of various GPU functional blocks using III. HARDWARE FOR POWER MONITORING physical measurements. In [3], the authors go a step further by presenting a statistical GPU power consumption model. A. Current solution The proposed statistical regression model is based on measured power consumption when executing various Results presented in this paper were obtained with the benchmarks and detecting which GPU functional blocks are help of a “Tweet-a-Watt” remote power monitor [12]. The involved in their execution. Based on these observations, the “Tweet-a-Watt” is a wireless transmitting power monitor proposed statistical model can be used to predict power based on the “Kill-a-Watt” plug-in power monitor [13] and consumption of the target GPU for a given application. an RF transmitting “XBee” remote sensor [14]. The XBee In [4], the authors present an experimental investigation transmitter is integrated with the Kill-a-Watt to use voltage of the power and energy cost of GPU operations and a signals and draw power from its internal op-amp. We built cost/performance comparison with a CPU-only system. The three power-monitoring transmitters and a fourth XBee unit authors developed a server architecture that incorporated to receive the signals on a separate machine. The three real-time energy monitoring of individual system power monitoring XBee units were programmed via USB- components, such as the CPU, GPU, motherboard, and serial registers to have separate addresses and to lower the memory, and used it in conjunction with a convolution using frequency of their monitoring and data transmission. separable kernels application to acquire experimental data. Since an XBee transmitting continuously draws far more They show that using GPUs is more energy efficient when power than the Kill-a-Watt has available to its internal the application performance improvement is above a certain circuitry, the XBee is configured to sleep for two second threshold. intervals and only then transmit a packet of monitoring In [5], the authors describe an empirical study in which information. Part of the Tweet-a-Watt kit is a large capacitor they use GPU-enabled GEM application as a test case to to store up charge to power the XBee’s transmission bursts measure power consumption and energy efficiency during its without bringing down the internal power supply of the Kill- execution on a GPU-based system. a-Watt. With the XBee and power smoothing capacitor In [20], the authors present a comprehensive hardware- installed, the Tweet-a-Watt unit takes about 30 seconds to software framework, called PowerPack, for performing an power up to the point that its display is readable. in-depth analysis of the energy consumption of parallel In Kill-a-Watt units that we purchased there was not a applications on a multi-core system. PowerPack’s hardware convenient place inside its case to put the XBee module and consists of probes to measure power consumption of power capacitor without touching any active components. individual system components, such as CPU, memory, disk, Therefore, we attached a generic electronics box to the side etc. The software framework allows to collect the power of the Kill-a-Watt’s chassis and installed the XBee there, consumption data during the application run. This data can with a four-conductor ribbon cable between the two spaces be later visualized using “Power vs. Time” plots. to deliver power and the two voltage references. Figure 1 While our work overlaps with some of the prior efforts, shows this arrangement. there are distinct differences. First, our power measurement The receiving XBee unit requires a process on the hardware is designed to be very inexpensive and non- computer to communicate with it and read the data intrusive, enabling to monitor individual nodes of large-scale transmitted from the power monitors. We created a heavily systems at a minimal cost and with a minimal deployment modified version of sample data-taking script from [12]. The overhead. Second, our data collection framework is fully Tweet-a-Watt software was created to upload its results live integrated with the cluster management software enabling to twitter (thus the name); we took the data collection code any user to effortlessly collect data for his applications, and modified it to upload the power use values to a database without any involvement on his side. And finally, the data instead. Our python script transmitts the raw power monitor presentation framework allows to extract power usage data data and timestamps into entries in a database that was later for both the host and the GPU subsystems per entire used to match the power used with job information to create application run.

Load more