Power Aware Tactical Computing
Total Page:16
File Type:pdf, Size:1020Kb
Power Aware Tactical Computing Song J Park1, Dale R Shires1, Brian J Henz1, James A Ross2, David A Richie3, and Jordan J Ruloff2 1U.S. Army Research Laboratory, APG, MD 2Dynamic Research Corp., Andover, MA 3Brown Deer Technology, Forest Hill, MD Abstract - Power consumption has become a chief supports and maintains high performance computing (HPC) impediment in the advancement of digital computing. To resources. The ARL DSRC provides state-of-the-art improve performance amid power limitations, accelerators computational solutions for the DoD research and are being applied to everyday systems. In particular, due to development community. Among the systems available at the mix of popularity and raw computational power, ARL is Harold system, which consists of 10,752 cores with graphics processing units (GPUs) have extended the the processing capability of 120 trillion floating-point applicability of digital computing in a multitude of sectors operations per second (TFLOPS). A predecessor, JVN from ubiquitous smartphones to environmentally responsible system, decommissioned in 2009, had 2048 cores with the supercomputing. Operating within the mobile power theoretical peak of 14.7 TFLOPS. Shifting the focus to constraint, the usefulness of a high performance graphics single-precision, the peak processing power for the Harold processor system in a tactical environment is explored in this system is 240 TFLOPS and 29.5 TFLOPS for the JVN study. A line-of-sight optimization algorithm serves as a system. Given that a single graphics card, Radeon HD 6990, compute-intensive application with characteristics relating is rated at 5.1 TFLOPS for peak single-precision arithmetic, to a tactical scenario. Power-efficient hardware options and this means a single video card is equivalent to roughly 1/6 of achievable parallel computing capabilities are analyzed for the JVN’s theoretical compute power. In other words, assisting tactical operations. networking six AMD Radeon HD 6990 PCI-E boards can provide a combined peak performance exceeding that of the Keywords: GPGPU, line-of-sight, mobile HPC available 32-bit floating-point power in JVN, albeit single- precision. A graph showing peak performance values for GPUs and their relationship to ARL’s previous-generation supercomputers is presented in Figure 1. The peak 1 Introduction comparison figure reflects the current state of raw compute abilities offered by common GPU cards and systems, which Computing power continues to increase as new rival decommissioned yet once TOP500 ranked systems. generations of processors are released to consumers. Moreover, an equivalently powerful system using Following Moore’s Law, where the number of transistors accelerators built today requires less space, power, and cost double every 18 months, computing elements within to operate. This is the motivation behind leveraging GPU processing units have grown exponentially since the birth of accelerators to enhance power-constrained tactical digital computing. By common logic, higher transistor count computing. translates into higher performance potential. The TOP500 maintains statistical lists of supercomputers around the world. High performance systems are ranked according to the performance outcome of the Linpack benchmark, which tests the ability to solve a dense system of linear equations. In a nutshell, the benchmark algorithm evaluates the compute speed of double-precision arithmetic on a given system [1]. The TOP500 table is released biannually and includes technical details such as processor type, number of processors, theoretical peak performance, and maximum Linpack performance. Since 2008, detail on power has been included as part of the statistical fields, a hint of the important role power will play in future systems as computing field advances to reach exascale performance. The U.S. Army Research Laboratory (ARL) Department Figure 1. Single-precision comparison of GPU devices of Defense Supercomputing Resource Center (DSRC) with previous-generation supercomputers at ARL Originally designed to drive displays, GPU cards have power is dissipated in the cache circuitry of the CPU. become quite powerful in executing single-precision Coupled with the energy required to transfer data from main calculations to account for the parallel nature of drawing memory, data movement seems to impact power graphics on a screen. Since 2007, general purpose consumption at a multiple memory hierarchy. computing on graphics processors has introduced CUDA and OpenCL languages to support general C-based algorithms to Similarly, features to optimize power consumption are access the underlying parallel hardware. This effectively and evident in graphics architecture. For example, advanced seamlessly boosted performance in systems, which most power management techniques available in Nvidia include likely had graphics processors installed. In this respect, the multiple levels of clock gating, dynamic voltage scaling, and GPU approach is an affordable and readily available adaptive clocking. Clock gating shuts off clocks during idle consumer product with some programming effort since its conditions, voltage scaling drops voltage levels during idle characteristics are similar to C language. The result of states, and adaptive clocking automatically adjusts core and continuing advancement of consumer-grade GPUs has made memory frequency to save power. Scientists and engineers HPC available to the masses. at ARL have collaborated with researchers at University of Texas El Paso in looking at selected use of double-precision 2 Influenced by Power functions for Nvidia compute resources to reduce energy consumption. In order to analyze accurate measurement of Examples of power influencing design can be observed power usage, probing equipments are installed to capture at multiple levels of computing hierarchy, starting from voltage and current signals. Power efficient computing is supercomputing centers to transistor designs. Topics relating addressed from a software perspective in [5]. to power issues in computing market are summarized in this section where multitudes of options for reducing power Field-programmable gate arrays (FPGAs) are an option consumption are briefly visited. for application-tailored, low-power processing. Altera’s Stratix V FPGAs are a low-power option that supports The cost to power and cool a system is usually one of floating-point implementations with a variable-precision the limiting factors and is a challenge for hosting large scale DSP architecture. A white paper by Altera’s describes that HPC systems. The average electrical power consumption for the single-precision multiplier density in Stratix V has petaflop systems in the TOP500 in November 2010 was 3.3 increased to 4096, which calculates to floating-point MW, which translates to $2.89 million per year, assuming processing rates in excess of 1 teraflops [6]. In terms of the $0.10/KWh. As the supercomputing community strives to metric flops per watt, this equates to 10 GFLOPS/W. Yet, achieve exascale computing, efficiently managing energy the programming model and development processes are not and power will be one of the major challenges. The as friendly to those familiar with working in C. emphasis on power is evident on the ubiquitous HPC DARPA project soliciting for a prototype of a petaflop-class In the fourth quarter of 2010, sales of ARM-based rack system drawing 57 KW by year 2014 [2]. smartphones exceeded the sales of personal computers [7]. ARM has historically emphasized the metric of power The shift toward multiple cores in semiconductor giant efficiency due to having evolved in a low-power, battery- Intel and AMD is another example of how power and heat operated environment. ARM processors have gained have affected central processor architecture design. In order dominance in the segment of the mobile market. to avoid producing more heat per square centimeter than the Additionally, ARM is looking to enter the server market as surface of the sun, while continuing to improve performance, the efforts are underway by companies like Calxeda to multicore was introduced as a solution. Yet, the clock implement large-scale ARM powered servers. Current frequency did not advance during this period, but rather Calxeda products offer server nodes operating at 5 W using relied solely on the resource parallelism for performance Cortex-A9 cores [8]. For hosting and servicing web gain. Additional attention to power is manifested via Intel applications, ARM processor's low power consumption Sandy Bridge's implementation of on-die power meters that offers promising alternative to the traditional server can measure power use on the chip and can dynamically platforms. distribute power [3]. Intel’s 3D transistor gate is another technique for power reduction. Novel vertical fins of the 3 Hardware for Tactical Computing semiconductor substrate mean more energy efficiency resulting from lower voltage operation and lower leakage [4]. Workstation footprint system is a manageable-sized PC Moving forward, the largest transistor allocation inside an augmented with high-end graphics processors. This concept Intel Nehalem die relates to cache. Distributed L1 and L2 of HPC in a box to aid on-field computations was cache inside each core and the shared L3 cache clearly investigated from two angles. First, the upper limit as to how dominate the transistor usage. This implies that the majority much computing power can be packed into a workstation form factor was explored. Secondly,