Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Jeremy Enos, Craig Steffen, Joshi Fullop, Michael John E. Stone, James C. Phillips Showerman, Guochun Shi, Kenneth Esler, Theoretical and Computational Biophysics Group, Volodymyr Kindratenko Beckman Institute National Center for Supercomputing Applications University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Urbana, IL, USA Urbana, IL, USA {johns|jim}@ks.uiuc.edu {jenos|csteffen|jfullop|mshow|gshi|esler|kindr} @ncsa.illinois.edu

Abstract—We present an inexpensive hardware system for We also describe two inexpensive hardware systems for monitoring power usage of individual CPU hosts and monitoring the power consumption by the host and GPU externally attached GPUs in HPC clusters and the software nodes of a compute cluster and their integration with the stack for integrating the power usage data streamed in real- compute cluster job management system. This enables the time by the power monitoring hardware with the cluster users running applications on power monitored cluster nodes management software tools. We introduce a measure for to receive power consumption statistics at the end of the job quantifying the overall improvement in performance-per-watt execution and to examine power consumption profiles of for applications that have been ported to work on the GPUs. their applications. We use the developed hardware/software infrastructure to To our knowledge, this is the first attempt to integrate demonstrate the overall improvement in performance-per-watt power measurement and analysis tools with a GPU-enhanced for several HPC applications implemented to work on GPUs. HPC cluster and to supply the end user with the application Keywords-GPU; cluster;power monitoring; power efficiency power usage profile. Access to such information is vital to enable the development of power-aware applications, runtime systems that assist in power saving, and resource I. INTRODUCTION management schemas to optimize performance and power One of the latest trends in the high-performance usage. (HPC) community is to use application The work presented in this paper utilizes accelerator accelerators, such as graphical processing units (GPUs), to cluster (AC) deployed at the NCSA’s Innovative Systems speed up the execution of computationally intensive codes. Laboratory (ISL) at the University of Illinois at Urbana- Several GPU-enhanced large-scale HPC clusters have been Champaign [1]. The AC cluster consists of 32 Hewlett recently deployed, e.g., Lincoln cluster at the National Packard xw9400 workstations each with two dual core 2.4 Center for Supercomputing Applications (NCSA) [1], and GHz Opteron 2216 processors and 8 GB of memory (2 the scientific computing community is actively porting many GB/core). Each node hosts an external codes to run on these systems. GPUs are known to consume S1070 unit containing four Tesla C1060 GPUs each, a significant amount of power, e.g., a state-of-the-art resulting in 4 GPUs per node with a cumulative total of 128 NVIDIA Tesla C2050 compute-optimized GPU consumes GPUs across the entire cluster. The current AC cluster up to 225 watts. The addition of one or more GPUs in each configuration includes power monitoring on one node and its cluster node thus significantly alters the power requirements attached Tesla S1070 unit. of the HPC systems in which they are deployed. A key The paper is organized as follows: In Section II we question that must then be addressed is to what degree the analyze related work, in Section III we present the power additional power consumption of GPUs is offset by the measurement hardware that we developed for use with the application performance increases they provide, and whether GPU cluster, in Section IV we describe software they are net gain or a net loss in terms of application infrastructure developed to collect data from the power performance per watt. monitoring sensors and correlate it with user applications In this study, we evaluate the impact of GPU acceleration executed on the GPU cluster under control of the cluster job on performance and electrical power consumption using four management system, in Section V we introduce a measure production HPC applications as case studies. We introduce a that quantifies the overall improvement in performance-per- measure that quantifies the overall improvement in watt for the applications using GPUs and compute its value performance-per-watt for the applications under the test for four test case applications, and finally in Section VI we when comparing CPU-only and GPU-accelerated versions discuss advantages and limitations of our solution and future and describe measurement methodologies appropriate for enhancements to the developed system. typical HPC workloads. II. RELATED WORK CUDA for GPU accelerated computation. This work also In [2], the authors use various GPU kernels that are extends our previous effort on analyzing power usage for known to stress a particular set of GPU resources, e.g., different stages of system operation, ranging from system ALUs, , or memory, to characterize power boot, to idle nodes, to full application run [1]. consumption of various GPU functional blocks using III. HARDWARE FOR POWER MONITORING physical measurements. In [3], the authors go a step further by presenting a statistical GPU power consumption model. A. Current solution The proposed statistical regression model is based on measured power consumption when executing various Results presented in this paper were obtained with the benchmarks and detecting which GPU functional blocks are help of a “Tweet-a-Watt” remote power monitor [12]. The involved in their execution. Based on these observations, the “Tweet-a-Watt” is a wireless transmitting power monitor proposed statistical model can be used to predict power based on the “Kill-a-Watt” plug-in power monitor [13] and consumption of the target GPU for a given application. an RF transmitting “XBee” remote sensor [14]. The XBee In [4], the authors present an experimental investigation transmitter is integrated with the Kill-a-Watt to use voltage of the power and energy cost of GPU operations and a signals and draw power from its internal op-amp. We built cost/performance comparison with a CPU-only system. The three power-monitoring transmitters and a fourth XBee unit authors developed a server architecture that incorporated to receive the signals on a separate machine. The three real-time energy monitoring of individual system power monitoring XBee units were programmed via USB- components, such as the CPU, GPU, motherboard, and serial registers to have separate addresses and to lower the memory, and used it in conjunction with a convolution using frequency of their monitoring and data transmission. separable kernels application to acquire experimental data. Since an XBee transmitting continuously draws far more They show that using GPUs is more energy efficient when power than the Kill-a-Watt has available to its internal the application performance improvement is above a certain circuitry, the XBee is configured to sleep for two second threshold. intervals and only then transmit a packet of monitoring In [5], the authors describe an empirical study in which information. Part of the Tweet-a-Watt kit is a large capacitor they use GPU-enabled GEM application as a test case to to store up charge to power the XBee’s transmission bursts measure power consumption and energy efficiency during its without bringing down the internal power supply of the Kill- execution on a GPU-based system. a-Watt. With the XBee and power smoothing capacitor In [20], the authors present a comprehensive hardware- installed, the Tweet-a-Watt unit takes about 30 seconds to software framework, called PowerPack, for performing an power up to the point that its display is readable. in-depth analysis of the energy consumption of parallel In Kill-a-Watt units that we purchased there was not a applications on a multi-core system. PowerPack’s hardware convenient place inside its case to put the XBee module and consists of probes to measure power consumption of power capacitor without touching any active components. individual system components, such as CPU, memory, disk, Therefore, we attached a generic electronics box to the side etc. The software framework allows to collect the power of the Kill-a-Watt’s chassis and installed the XBee there, consumption data during the application run. This data can with a four-conductor ribbon cable between the two spaces be later visualized using “Power vs. Time” plots. to deliver power and the two voltage references. Figure 1 While our work overlaps with some of the prior efforts, shows this arrangement. there are distinct differences. First, our power measurement The receiving XBee unit requires a on the hardware is designed to be very inexpensive and non- computer to communicate with it and read the data intrusive, enabling to monitor individual nodes of large-scale transmitted from the power monitors. We created a heavily systems at a minimal cost and with a minimal deployment modified version of sample data-taking script from [12]. The overhead. Second, our data collection framework is fully Tweet-a-Watt software was created to upload its results live integrated with the cluster management software enabling to twitter (thus the name); we took the data collection code any user to effortlessly collect data for his applications, and modified it to upload the power use values to a database without any involvement on his side. And finally, the data instead. Our python script transmitts the raw power monitor presentation framework allows to extract power usage data data and timestamps into entries in a database that was later for both the host and the GPU subsystems per entire used to match the power used with job information to create application run. The end result is a production-quality graphs of power used during a job. The monitoring scheme framework for characterizing power efficiency of GPU- we have used in the prototype has been to use two of the enabled applications on GPU-enhanced HPC clusters. power monitors to log power used by one node in the AC While power consumption measurements have cluster, one monitor reading the power used by the node previously been published for GPU accelerated workloads, itself, and the other monitoring the power drawn by its to date they have been limited to small microbenchmark associated Tesla GPU accelerator unit. kernels or single GPU test cases. In this work we provide The overall cost of the parts used to assemble the power results for several complete scientific codes that run on HPC monitoring unit and the corresponding receiving unit was clusters. The selected applications use message passing under $100. interface (MPI) for distributed application execution, and The current drawn by each of the four output ports is separately monitored, and the other leg of the input is monitored as well to aid in cross-calibration and checking for stray unbalanced currents. Current monitoring is by AC current transformer coils with burden resistors; the voltage readout of the coils is done by an Arduino Duemilanove unit [15] read out externally via USB. Each monitored current is coupled to its readout by an MN-220 current transformer from Manutech, Inc. [16] (Figure 3). Unlike most current pickup transformers that are meant to measure the current for a full distribution panel, the MN-220 is physically very small and is designed for use with currents between 1 and 20 amps AC. These transformers are built without burden resistors, so we were able to set the current sensitivities to match the loads we wish to measure. The burden resistors can be changed to alter the current sensitivity range of each channel individually. The burden resistors are sized to maximize the dynamic range of measuring the target load without saturating the voltage monitoring range of the Arduino’s inputs. The initial configuration has burden resistors configured for 600 watts for two of the output channels, 1000 watts for the other two channels, and able to read the entire range of possible Figure 1. Internal arrangement of our Tweet-a-Watts; the XBee currents up to 16 amps on the opposite leg. The power transmitter is in a generic box attached to the side of the Kill-a-Watt. distribution portion of the power monitor is wired with the same protections as a standard PDU would have, including a 16 amp double-pole circuit breaker to protect the wiring B. Development of a dedicated power monitoring unit inside the unit. We have designed and built a new power monitoring unit that will be used with the same software and display infrastructure. The Tweet-a-Watt has a low rate of readout, it requires a monitoring device per single power circuit, and it is limited by the design of the Kill-a-Watt to a 20 amp 120 volt circuit, while the typical system in our machine room runs on 208-volt power. The new power monitor is built in the form of a power distribution unit (PDU), with a C-20 power input port (208 V 16 A) and distributes power to four C-13 power plugs (Figure 2).

Figure 3. MN-220 current pickup transformers from Manutech, Inc. are used to monitor currrent going through the power cables.

The output voltages from the pickup coil are measured by the analog voltage input channels on an Arduino microcontroller (Figure 4). One side of each of transformer coil is anchored to 2.5V above the Arduinos’ ground, the other is connected to a voltage monitor channel, which in the default configuration measures a 10-bit value between ground and 5 volts above ground. We wrote a program for the Arduino that measures 2000 samples from each analog channel and computes the RMS total of each channel for the Figure 2. New power monitor prototype. sampling period and outputs the result as a current in milliamps in ASCII format. The output is transmitted to the USB port via a USB-serial stream; the receiving computer only needs to have USB-serial capability to read the data. We chose 2000 samples per reported value as We are currently in the processo of calibrating the device and compromise between taking enough individual voltage writing a script that reads the currents and outputs them into samples to calculate a clean RMS value and few enough so the power monitoring database, just like the existing Tweet- that the time reporting granularity was small. At 2000 a-Watt monitors. samples, the error for the current measurement was +/- 6 mA, or +/- 1Watt on the output power. Monitoring 5 channels (4 output plus total) at 2000 samples per report allows the Arduino to output a set of values every 1.4 seconds. Total cost of the parts used to assemble this unit was just under $200, or $50 per monitored power connector.

IV. SOFTWARE INTEGRATION Regardless of the power monitoring device used, the collected data needs to be stored in a database and made accessible to the application user. Each power monitor has some associated software which streams, at the very least, raw data which can be assembled into power consumed in some way. The Tweet-a-Watt collector software we created streams computed voltage and current values, and also the raw numbers used to compute them. The raw numbers are included in case we later need to make a calibration correction retroactively, and are not absolutely required. Figure 4. Arduino microcontroller inside the power monitoring unit. The SQL database tables are essentially categorized into sensor configuration, collected power data, and the batch The burden resistors are sized to maximize the sensitivity system (PBS, aka Torque) job data. Each individual sensor given a power envelope. If a larger current is drawn than the has a unique id assigned, along with a description of the burden resistor is set for, the power monitor will not be devices attached to it, and its expected collection interval damaged, but the peaks of the voltage waveform will exceed (data rate). The expected collection interval for a sensor can the analog measurement range of the Arduino’s input be used for detecting gaps in data that may indicate a channels. The program that the Arduino runs tests for this, functional problem. The power data includes everything that and outputs an “OF” flag on the appropriate channel if that is streamed by the sensor controller, as well as an association happens; this information will be propagated on to the to the sensor id, devices attached to that sensor, and database, so that the data is not used. The resulting current collection time. The job data is composed of the start and values will still approximately follow the actual current end time of any job which utilized a power monitored being drawn, but the reported current will be lower because resource exclusively, an association with a sensor or device, the waveform used to calculate the current values will have along with some basic job details like user name and job been “clipped”. name. With all sensors’ data streams being continuously

Figure 5. Example output. collected, any time span for any sensor can be extracted. attached GPUs have awoken from reduced power- The batch system (PBS) executes custom scripts before saving modes. It is also frequently the case that several of and after user job execution, prologue and epilogue, the first and last power measurements taken may need to be respectively. In the prologue, a node makes an entry to the discarded as these may occur during application startup central SQL database if it detects that it hosts a power phases that may represent a disproportionately high fraction monitored resource being used exclusively. The initial job of runtime in a short run, whereas they may be a end time record is set to the job start time plus requested wall vanishingly small fraction of the runtime (and power clock time, which the epilogue will reset more precisely consumption) for a full-length simulation. Since there is when it runs. Additionally, if the job is to be power presently no mechanism for instrumenting application codes monitored, a link to a live updated plot of the monitored such that they record timestamps that mark the start and end resources relevant to the job is provided to the user in of particular application phases, we manually filtered or standard out. discarded power measurement samples known to be Provisioning the monitor data to a user begins with the associated with application startup or shutdown phases so link provided to them in standard out, which presents a chart that averaged power results would be representative of full- plotting their power usage of each device over time, mouse- length simulations. In the future, we hope to use program over data exposure, as well as under the curve totals for the instrumentation to automate the annotation and selection of job (Figure 5). This is accomplished using the Open Flash power measurement data associated with different phases of Chart PHP library. On an additional page, a user may view application execution. all their historical job profiles listed by job name and number, or download a CSV file per job if preferred in that A. NAMD form. Since the power monitoring setup on AC currently NAMD is a widely used parallel molecular dynamics includes a workstation and Tesla S1070 GPU unit simulation package based on the Charmm++ parallel independently, a user can include either or both as part of programming environment [7], [6], and with support for their job and get a meaningful comparison of FLOP/Watt GPU acceleration [11], [9]. NAMD performance and power efficiency between two separate runs, using the GPU consumption were measured for “STMV”, a 1 million atom acceleration or not. virus representative of the size of simulations well suited to GPU clusters. Performance was benchmarked in terms of V. APPLICATION CASE STUDIES seconds per simulation timestep. In order to evaluate the impact of GPU acceleration on The CPU-only tests ran in 6.6 seconds per timestep, and performance and electrical power consumption, four the GPU-accelerated tests ran in 1.1 seconds per timestep, applications (NAMD, VMD, QMCPACK, and MILC) were yielding a speedup factor of 6. Power measurements were benchmarked on the AC cluster [1]. Performance and power made for the benchmark runs, excluding the brief startup and measurements were made for CPU-only benchmark runs and finalization stages. The CPU-only power consumption for accelerated runs using both CPUs and GPUs. Since the averaged 316 watts over the course of the benchmark. The AC cluster nodes use externally-powered NVIDIA S1070 GPU-accelerated benchmarks yielded an average GPU GPUs, the power consumption for the attached GPUs is power consumption of 391 watts, and an average CPU power measured independently from the host CPUs. With the consumption of 290 watts. Thus, for NAMD GPU STMV exception of QMCPACK, for each application, we compared run, the performance/watt ratio is 316 / (290 + 391) * 6 = wall clock execution time for the CPU-only (not accelerated) 2.78. NAMD GPU implementation is thus 2.78 times more power-efficient than the CPU-only version for the run, t, against the GPU-accelerated runs, ta, to derive the benchmarked STMV test case. GPU acceleration speedup factor, s=t/ta. In case of QMCPACK, speedup is computed based on the number of B. VMD walker generations per second. Using the speedup factors and the power measurements for the runs, p for CPU-only VMD is a popular molecular visualization and analysis tool that runs on both desktop workstations and HPC run and pa for GPU-accelerated run, we can compute the overall improvement in performance-per-watt for the clusters. VMD makes use of GPU acceleration for computationally demanding tasks such as computation of applications under test as e = p/pa*s. This measure shows us how many times a GPU-accelerated implementation is more electrostatic potential maps and visualization of molecular power-efficient than the CPU-only version of the same orbitals [8], [10], [11]. VMD performance and power application. Summary of the final results for the four tested consumption were measured for computation of a time- applications is presented in Table I. averaged electrostatic potential field for a 685,000 atom In each of the test cases described below, power STMV trajectory, using the multilevel summation method consumption was averaged over at least 20 samples. All of [10]. Each VMD MPI rank computed the time-averaged the test cases were based on abbreviated runs of electrostatic potential map for 10 trajectory frames. representative problems, with sufficient runtime to facilitate The CPU-only test case ran in 1,465.2 seconds and the accurate power measurement. One issue that arises with GPU-accelerated test ran in 57.5 seconds, yielding a GPU relatively short test runs is that the power measurements at speedup factor of 25.5. The CPU-only power consumption the very start and end of the run must typically be discarded averaged 299 watts. The GPU-accelerated tests gave an as one or more samples may occur before the host CPUs and average GPU power consumption of 433 watts and an average CPU power consumption of 309 watts. For the are measured on an input lattice 283 x 96 with pre-computed VMD benchmark, the performance/watt ratio is 299 / (309 + spinors and gauge links. 433) * 25.5 = 10.48. The GPU accelerated VMD is 10.48 Since the GPU-accelerated MILC only support running times more power-efficient than the CPU-only version for on one GPU at this stage, the performance for CPU version the benchmarked STMV test case. and accelerated MILC code is from running on one CPU core or one CPU core and GPU only whereas the power C. QMCPACK consumption data is for all four CPU cores and all four Quantum Monte Carlo (QMC) [18] is a class of methods GPUs. With that in mind, the CPU-only test ran in 77,324 to solve the many-body problem of interacting quantum seconds and the GPU-accelerated test ran in 3,881 seconds, particles. Because it combines very high accuracy with giving a 19.9x speedup. The average host power usage extreme parallel scalability, it has become a major user of during the CPU run is 225 watts while the average host and leadership computing facilities. Recently we have ported a GPU power usage is 222 watts and 332 watts, respectfully. QMC simulation suite, QMCPACK, to run on the CUDA Therefore, the overall efficiency improvement for the MILC platform. For our typical problem sizes we have observed application is 225 / (222 + 332) * 19.9 = 8.1, compared with speedups ranging from 10x-15x the performance of quad- CPU only version. core CPUs from the same generation. For the present study, we have simulated bulk diamond crystal in a 128-atom TABLE I. IMPROVEMENT IN PERFORMANCE-PER-WATT FOR THE simulation cell, with 512 valence electrons, first using CPUs FOUR CONSIDERED APPLICATIONS only, then using both CPUs and GPUs. Note that the CPU- Application t (sec) ta (sec) s p pa e only version uses purely double-precision, while the GPU (watt) (watt) version uses primarily single precision, reserving double- NAMD 6.6 1.1 6 316 681 2.78 VMD 1,465.2 57.5 25.5 299 742 10.48 precision for select operations. We have found that the GPU QMCPACK 61.5 314 853 22.6 version reproduces the double-precision results within the MILC 77,324 3,881 19.9 225 555 8.1 statistical accuracy of the results. We measure the simulation throughput in Monte Carlo (MC) generations per One may argue that the power efficiency measure shown unit time, as well as the average power consumed by the in Table I can be simply estimated based on the host and CPUs and GPUs. In both cases, we propagate several MC GPU specs. Such an estimation however is not that simple states, called "walkers", in parallel in a manner appropriate to and is not accurate. S1070 GPU system is rated at 800 W. each platform. Our test runs employed two forms of QMC, HP xw9400 workstation has a 1050 W power supply, but its variational Monte Carlo and diffusion Monte Carlo. Since actual power consumption depends on the type, the speedups and power usage were very similar in each amount of memory, etc. Using the maximum spec numbers case, we report only the figures for DMC, which usually thus will not yield a meaningful value of e. dominates the total run time. The CPU-only test case ran at a rate of 1.16 walker VI. DISCUSSION AND FUTURE WORK generations per second, while the GPU-accelerated test ran at Figure 6 shows an analytical model for the performance- a rate of 71.1 walker generations per second, yielding a per-watt improvement as a function of the application speedup factor of 61.5. Power measurements were made for speedup when using GPUs to accelerate the execution of the the benchmark runs, excluding the brief startup. The CPU- code. Three different scenarios of host and host+GPU power only power consumption averaged 314 watts, while the consumption are considered: 1) 300 watt by the host for non- GPU-accelerated power consumption averaged 269 and 584 accelerated application and 600 watt combined by the host watts for the CPUs and GPUs, respectively. Thus, for the and GPU for a GPU-accelerated application, 2) 300 watt – QMCPACK benchmark, the resulting performance/watt ratio 700 watt, and 3) 300 watt – 800 watt. Two main is 314 / (269 + 584) * 61.5 = 22.6. The QMCPACK results observations can be made from this plot: 1) the higher the demonstrate that the GPU accelerated version is over 22 host+GPU power consumption, the lower achievable times more power efficient than the CPU-only version. performance-per-watt improvement is, and 2) there is a D. MILC threshold value for speedup when the performance-per-watt improvement becomes larger than one. For the power The MIMD Lattice Computation (MILC) code [17], a consumption examples shown in Figure 5, this threshold Quantum Chromodynamics (QCD) application used to value is between 2x and 3x. This indicates that achieving a simulate four-dimensional SU(3) lattice gauge theory, is one small speedup, e.g., 2-3x, when using a GPU to accelerate of the largest compute cycle users at the national the application does not lead to a power-efficient execution supercomputing centers. and algorithm of the application. On the other hand, achieving speedups development have led to a remarkable growth of Lattice over 3x results in power savings. QCD, and calculations have become so precise that it is Our current power monitoring approach suffers from becoming necessary to take into account effects not only limitations some of which will be removed with the new from the strong force, but from electromagnetism as well power monitoring hardware while some will still remain. [19]. In this test, we use a GPU-accelerated package that One such limitation is due to the limited ability to monitor includes the quantum electrodynamics effects into the lattice each GPU individually. NVIDIA Tesla S1070 unit contains computation. MILC performance and power consumption four Tesla C1060 GPUs, all powered by the same power supply. Thus, we can only reliably monitor applications that versions. We also show how power efficiency depends on use all 4 GPUs. If some GPUs are not used, we do not the application speedup. currently have a reliable way to subtract the amount of power they consume when idling. Similarly, power consumption of ACKNOWLEDGMENT the idle CPU cores is also included in the measurements. On This work utilized the AC cluster [1] operated by the the other hand, we obtain the true power consumption by the Innovative Systems Laboratory (ISL) at the National Center system when running an application without regards to the for Supercomputing Applications (NCSA) at the University application’s ability to efficiently utilize all the available of Illinois. The cluster was funded by NSF SCI 05-25308 resources. and CNS 05-51665 grants along with generous donations of hardware from NVIDIA, Nallatech, and AMD. NAMD and VMD development is supported by the National Institutes of 50 Health under grant P41-RR05969. 45 300-800 40

watt 300-600 - 35 REFERENCES

per 30 300-700 - [1] V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, 25 J. Phillips, W. Hwu, “GPU Clusters for High-Performance 20 Computing,” Proc. IEEE International Conference on Cluster 15 Computing, Workshop on Parallel Programming on Accelerator 10 Clusters, Dec. 2009, doi: 10.1109/CLUSTR.2009.5289128. performance 5 [2] S. Collange, D. Defour, A. Tisserand, ”Power consumption of GPUs from a software perspective,“ G. Allen et al. (Eds.): ICCS 2009, Part 0 I, LNCS 5544, pp. 914–923, 2009. 0 20 40 60 80 100 [3] X. Ma, M. Dong, L. Zhong, Z. Deng, “Statistical power consumption speedup factor analysis and modeling for GPU-based computing,” in Proc. ACM Wrkshp. Power Aware Computing and Systems (HotPower), Co- Figure 6. Performance-per-watt as a function of speedup factor fro three located with SOSP, October 2009. different power consumption scenarios. [4] M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, M. Sarrafzadeh, “Energy-Aware High Performance Computing with Graphic There are a number of enhancements we are considering Processing Units,” Workshop on Power Aware Computing and for the future pending user feedback and resource Systems (HotPower 2008). San Diego. December 8-10, 2008. availability. These include: [5] S. Huang, S. Xiao, W. Feng. “On the energy efficiency of graphics processing units for scientific computing,” Proc. 2009 IEEE  Additional sensors: Currently, only the Tweet-a-Watt international Symposium on Parallel & Distributed Processing (May hardware is fully implemented and integrated into the 23 - 29, 2009). IPDPS. IEEE Computer Society, Washington, DC, 1- cluster. The Arduino solution has the advantages of 8, 2009. greater accuracy, greater collection granularity, and [6] L. V. Kale and S. Krishnan, ”Charm++: Parallel Programming with more flexible voltage support. We also have an ever Message-Driven Objects,” in Parallel Programming using C++, G. V. increasing variety of hardware to monitor and compare. Wilson and P. Lu, Eds. MIT Press, 1996, pp. 175-213.  Database optimization: The current database [7] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten, ”Scalable implementation has some prototype limitations in molecular dynamics with NAMD,” J. Comp. Chem., vol. 26, pp. design. We may also consider reducing data collection 1781-1802, 2005. only to job timeframes instead of continuous depending [8] C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu, “GPU on the performance impact and resource requirements Acceleration of Cutoff Pair Potentials for Molecular Modeling imposed by additional sensors with higher data rates. Applications,” In CF'08: Proceedings of the 2008 conference on  Support for user controlled tagging: Depending on user Computing frontiers, pp. 273-282, New York, NY, USA, 2008. ACM. feedback, we may implement a call available to users [9] J. Phillips, J. Stone, K. Schulten, “Adapting a Message-driven which would arbitrarily flag a power data record for Parallel Application to GPU-Accelerated Clusters,” In SC '08: more convenient identification of various application Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, stages on the graphical power consumption plot. pp. 1-9, Piscataway, NJ, USA, 2008. IEEE Press. [10] D. Hardy, J. Stone, K. Schulten, ”Multilevel Summation of VII. CONCLUSIONS Electrostatic Potentials Using Graphics Processing Units,” Parallel In this work, we presented hardware, software, and Computing, 28:164-177, 2009. [11] J. Stone, J Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, methodology for characterizing performance-per-watt “High Performance Computation and Interactive Display of advantage of GPU-based systems. The results for four tested Molecular Orbitals on GPUs and Multi-core CPUs,” Proc. 2nd applications indicate that although GPUs significntly Workshop on General-Purpose Processing on Graphics Processing increase power consumption, the provided acceleration Units, ACM International Conference Proceeding Series, vol. 383, results in reduction of the overall energy consumption. In pp. 9-18, 2009. some cases, GPU-accelerated codes are an order of [12] Tweet-a-Watt, http://www.ladyada.net/make/tweetawatt/ magnitude more power-efficient than their CPU-only [13] P3 Kill A Watt Electricity Monitor, http://www.p3international.com/ products/special/P4400/P4400-CE.html [14] XBee & XBee-PRO 802.15.4 OEM RF Modules, http://www.digi. [19] S. Basak, A.Bazavov, C. Bernard, C. DeTar, W. Freeman, Steven com/products/wireless/point-multipoint/xbee-series1-module.jsp Gottlieb, U.M. Heller, J.E. Hetrick, J. Laiho, L. Levkova, J. Osborn, [15] Arduino, http://www.arduino.cc/ R. Sugar, D. Toussaint, “Electromagnetic splittings of hadrons from improved staggered quarks in full QCD,” Proc. XXVI International [16] MN220 AC Current Sense Transformers, http://www.manutech. Symposium on Lattice Field Theory, July 2008, Williamsburg, us/_Pdf/out/MN220.pdf Virginia, PoS LATTICE2008:127, 2008. [17] The MIMD Lattice Computation (MILC) Collaboration, [20] R. Ge, X. Feng, S. Song, H. Chang, D. Li, K. Cameron, “PowerPack: http://www.physics.utah.edu/~detar/milc/ energy profiling and analysis of high-performance systems and [18] J. Kim, K. Esler, J. McMinis, B. Clark, J. Gergely, S. Chiesa, K. applications,” IEEE Transactions on Parallel and Distributed Delaney, J. Vincent, D. Ceperley, QMCPACK simulation suite, Systems, Vol. 21, No. 5, pp. 658-671, 2010. http://www.mcc.uiuc.edu/qmcpack