<<

Power Aware Tactical Computing

Song J Park1, Dale R Shires1, Brian J Henz1, James A Ross2, David A Richie3, and Jordan J Ruloff2 1U.S. Army Research Laboratory, APG, MD 2Dynamic Research Corp., Andover, MA 3Brown Deer Technology, Forest Hill, MD

Abstract - Power consumption has become a chief supports and maintains high performance computing (HPC) impediment in the advancement of digital computing. To resources. The ARL DSRC provides state-of-the-art improve performance amid power limitations, accelerators computational solutions for the DoD research and are being applied to everyday systems. In particular, due to development community. Among the systems available at the mix of popularity and raw computational power, ARL is Harold system, which consists of 10,752 cores with graphics processing units (GPUs) have extended the the processing capability of 120 trillion floating-point applicability of digital computing in a multitude of sectors operations per second (TFLOPS). A predecessor, JVN from ubiquitous smartphones to environmentally responsible system, decommissioned in 2009, had 2048 cores with the supercomputing. Operating within the mobile power theoretical peak of 14.7 TFLOPS. Shifting the focus to constraint, the usefulness of a high performance graphics single-precision, the peak processing power for the Harold processor system in a tactical environment is explored in this system is 240 TFLOPS and 29.5 TFLOPS for the JVN study. A line-of-sight optimization algorithm serves as a system. Given that a single graphics card, HD 6990, compute-intensive application with characteristics relating is rated at 5.1 TFLOPS for peak single-precision arithmetic, to a tactical scenario. Power-efficient hardware options and this means a single is equivalent to roughly 1/6 of achievable capabilities are analyzed for the JVN’s theoretical compute power. In other words, assisting tactical operations. networking six AMD Radeon HD 6990 PCI-E boards can provide a combined peak performance exceeding that of the Keywords: GPGPU, line-of-sight, mobile HPC available 32-bit floating-point power in JVN, albeit single- precision. A graph showing peak performance values for GPUs and their relationship to ARL’s previous-generation supercomputers is presented in Figure 1. The peak 1 Introduction comparison figure reflects the current state of raw compute abilities offered by common GPU cards and systems, which Computing power continues to increase as new rival decommissioned yet once TOP500 ranked systems. generations of processors are released to consumers. Moreover, an equivalently powerful system using Following Moore’s Law, where the number of transistors accelerators built today requires less space, power, and cost double every 18 months, computing elements within to operate. This is the motivation behind leveraging GPU processing units have grown exponentially since the birth of accelerators to enhance power-constrained tactical digital computing. By common logic, higher computing. translates into higher performance potential. The TOP500 maintains statistical lists of supercomputers around the world. High performance systems are ranked according to the performance outcome of the Linpack benchmark, which tests the ability to solve a dense system of linear equations. In a nutshell, the benchmark algorithm evaluates the compute speed of double-precision arithmetic on a given system [1]. The TOP500 table is released biannually and includes technical details such as processor type, number of processors, theoretical peak performance, and maximum Linpack performance. Since 2008, detail on power has been included as part of the statistical fields, a hint of the important role power will play in future systems as computing field advances to reach exascale performance.

The U.S. Army Research Laboratory (ARL) Department Figure 1. Single-precision comparison of GPU devices of Defense Supercomputing Resource Center (DSRC) with previous-generation supercomputers at ARL Originally designed to drive displays, GPU cards have power is dissipated in the cache circuitry of the CPU. become quite powerful in executing single-precision Coupled with the energy required to transfer data from main calculations to account for the parallel nature of drawing memory, data movement seems to impact power graphics on a screen. Since 2007, general purpose consumption at a multiple memory hierarchy. computing on graphics processors has introduced CUDA and OpenCL languages to support general C-based algorithms to Similarly, features to optimize power consumption are access the underlying parallel hardware. This effectively and evident in graphics architecture. For example, advanced seamlessly boosted performance in systems, which most power management techniques available in include likely had graphics processors installed. In this respect, the multiple levels of clock gating, dynamic voltage scaling, and GPU approach is an affordable and readily available adaptive clocking. Clock gating shuts off clocks during idle consumer product with some programming effort since its conditions, voltage scaling drops voltage levels during idle characteristics are similar to C language. The result of states, and adaptive clocking automatically adjusts core and continuing advancement of consumer-grade GPUs has made memory frequency to save power. Scientists and engineers HPC available to the masses. at ARL have collaborated with researchers at University of Texas El Paso in looking at selected use of double-precision 2 Influenced by Power functions for Nvidia compute resources to reduce energy consumption. In order to analyze accurate measurement of Examples of power influencing design can be observed power usage, probing equipments are installed to capture at multiple levels of computing hierarchy, starting from voltage and current signals. Power efficient computing is supercomputing centers to transistor designs. Topics relating addressed from a software perspective in [5]. to power issues in computing market are summarized in this section where multitudes of options for reducing power Field-programmable gate arrays (FPGAs) are an option consumption are briefly visited. for application-tailored, low-power processing. Altera’s Stratix V FPGAs are a low-power option that supports The cost to power and cool a system is usually one of floating-point implementations with a variable-precision the limiting factors and is a challenge for hosting large scale DSP architecture. A white paper by Altera’s describes that HPC systems. The average electrical power consumption for the single-precision multiplier density in Stratix V has petaflop systems in the TOP500 in November 2010 was 3.3 increased to 4096, which calculates to floating-point MW, which translates to $2.89 million per year, assuming processing rates in excess of 1 teraflops [6]. In terms of the $0.10/KWh. As the supercomputing community strives to metric per watt, this equates to 10 GFLOPS/W. Yet, achieve exascale computing, efficiently managing energy the programming model and development processes are not and power will be one of the major challenges. The as friendly to those familiar with working in C. emphasis on power is evident on the ubiquitous HPC DARPA project soliciting for a prototype of a petaflop-class In the fourth quarter of 2010, sales of ARM-based rack system drawing 57 KW by year 2014 [2]. smartphones exceeded the sales of personal computers [7]. ARM has historically emphasized the metric of power The shift toward multiple cores in semiconductor giant efficiency due to having evolved in a low-power, battery- and AMD is another example of how power and heat operated environment. ARM processors have gained have affected central processor architecture design. In order dominance in the segment of the mobile market. to avoid producing more heat per square centimeter than the Additionally, ARM is looking to enter the server market as surface of the sun, while continuing to improve performance, the efforts are underway by companies like Calxeda to multicore was introduced as a solution. Yet, the clock implement large-scale ARM powered servers. Current frequency did not advance during this period, but rather Calxeda products offer server nodes operating at 5 W using relied solely on the resource parallelism for performance Cortex-A9 cores [8]. For hosting and servicing web gain. Additional attention to power is manifested via Intel applications, ARM processor's low power consumption Sandy Bridge's implementation of on-die power meters that offers promising alternative to the traditional server can measure power use on the chip and can dynamically platforms. distribute power [3]. Intel’s 3D transistor gate is another technique for power reduction. Novel vertical fins of the 3 Hardware for Tactical Computing semiconductor substrate mean more energy efficiency resulting from lower voltage operation and lower leakage [4]. Workstation footprint system is a manageable-sized PC Moving forward, the largest transistor allocation inside an augmented with high-end graphics processors. This concept Intel Nehalem die relates to cache. Distributed L1 and L2 of HPC in a box to aid on-field computations was cache inside each core and the shared L3 cache clearly investigated from two angles. First, the upper limit as to how dominate the transistor usage. This implies that the majority much computing power can be packed into a workstation form factor was explored. Secondly, development software is loaded on a PandaBoard, Army applications written in C and algorithms were analyzed for feasibility and possibilities. can be benchmarked. Currently, OpenCL support for ARM Initial attempts at the asymmetric workstation system is still in its initial stage and a STDCL support is under involved switching heat sinks and fans with custom fitted development. Linux distributions, Angstrom and Ubuntu, water blocks for GPU cards to populate the system with were successfully tested on a PandaBoard. A noticeable seven Radeon HD 4870X2 cards. Since Radeon HD 4870X2 slow behavior was observed in the GUI mode of Ubuntu. cards contain dual graphics engines, it is equivalent to having Currently in process of examination is the open source 14 GPU devices inside a single workstation. Additionally, Android operating system that has quickly gained popularity liquid cooling allowed for overclocking the reference clocks in smartphones and tablets. Although, still relatively to achieve extra performance. The target goal was to reach immature, Android has a potential to be Microsoft Windows 20 TFLOPS mark for single-precision operations. In regards for the decade. to software development environment, both CUDA and OpenCL frameworks, which extend the power of the GPUs 4 Army-centric Application beyond graphics, were evaluated for parallel computing. CUDA is proprietary and thus specifically supports Nvidia An algorithm for ballistic field calculation based on GPU architectures. CUDA was made public in 2007 and first-hit ray-tracing method was developed using OpenCL. was originally based on Open64 C compiler until the recent The ray-tracing method allows for the calculation of line-of- switch to low level virtual machine (LLVM) for CUDA sight ballistic threat locations within a specified area. A release 4.1. Nvidia’s earlier start has served to acquire initial three-dimensional representation of a small part of town was momentum within the GPGPU community extending to selected for testing the ballistic hit probability field Matlab, Python, and Fortran, to name a few. OpenCL, on the calculations. The input for the algorithm is a triangle data other hand, is vendor agnostic that can target x86 processors, format describing a three-dimensional surface and layout. AMD GPUs, and Nvidia GPUs. In both frameworks, a For user interaction and displaying of results, functionalities language is provided for writing kernels representing core were leveraged and added to the World Wind program. functions and application programming interfaces (APIs) that World Wind was developed by NASA and is an open-source are used to manage the platforms. Ongoing translation interactive world viewer. Figure 2 shows the World Wind research continues for compiling OpenCL programs to run interface window with loaded elevation map and the ballistic on FPGAs and ARM processors. hit probability field, denoted by red shaded overlay on the surface. In the figure, the end user can insert entities, which More recently acquired test bed systems at ARL include are color coded to designate red as a shooter, green as an a Nvidia GeForce 580 system and an AMD Radeon HD 6970 observation point, and blue as a watcher. A shooter was GPU system. The configuration for the 4U workstation placed on top of a roof and an observation point was set near platforms are dual-socket Hex-core Xeon system, 24 GB the front wall of the building to represent a door of interest. DDR3 memory, double-width PCIe slots for holding four Executing the ballistic threat minimization algorithm GPU cards, a solid-state drive, and 1400 W power supply. generates the optimal locations for a watcher that minimize The goal is for this system to target compute-intensive threat level while maintaining line-of-sight to the observation calculations inside a mobile platform. Typically, an average point. alternator in a vehicle is rated at 100 A at 12 V, which computes to 1200 W. Thus, the power requirement of the four GPUs workstation seems to be within an achievable range for an automobile operation.

ARM powered devices are explored for integrating with the GPU workstation platforms for a tactical computing scenario. One of the purposes for mobile ARM products would be to serve as an end user interface to display supercomputing calculated results. For instance, iPhone can request ray-tracing calculations to a nearby GPU-enhanced workstation to achieve near real-time computation. To learn the development process and the compute capabilities of an ARM processor, PandaBoard ES was procured for evaluation Figure 2. Interface to ballistic threat minimization and its possible role in mobile HPC. PandaBoard ES has the

OMAP 4460 processor that is designed by Texas The example case consisting of two shooters and two Instruments, which is a combination of the ARM architecture observation points was benchmarked on a single Nvidia GPU series Cortex-A9 and an machine and an AMD multi-GPU workstation. The Nvidia POWERVR graphics core. Once a Linux operating system system contained Tesla C2050 graphics card and the multi- GPU workstation had four Radeon HD 6970 cards. The code [2] DARPA, Broad Agency Announcement Ubiquitous High was written to support multi-GPU execution, but to avoid Performance Computing Transformational Convergence serialization between GPUs; an older version of AMD Technology Office, March 2010. graphics driver was required for CentOS Linux operating system. Timing results for ballistic threat calculation and [3] R. Merritt, “Intel offers first tour of its bridge to placement optimization are presented in Figure 3. Written in heterogeneous computer processors,” EE Times. Sep 2010. OpenCL, the algorithm is portable to x86 processors as well. The calculation on a dual-core Xeon 5160 completed in 25 [4] Intel Newsroom 22nm 3-D Tri-Gate Transistor minutes. Technology. Available: http://newsroom.intel.com/docs/DOC-2032

[5] Ricardo Portillo, Sarala Arunagiri, Patricia J. Teller, Song J. Park, Lam H. Nguyen, Joseph C. Deroba and Dale Shires, "Power versus performance tradeoffs of GPU-accelerated backprojection-based synthetic aperture radar image formation," Proc. SPIE 8060, 2011.

[6] Altera Corporation, “Achieving One TeraFLOPS with 28-nm FPGAs,” Sep 2010.

[7] “Flying Robots Designed to Form Emergency Network,"

Figure 3. Timing measurements for ballistic threat Computer, vol. 44, no. 5, pp. 14-16, May 2011. minimization [8] Calxeda. Available: http://www.calxeda.com 5 Conclusion

Armed with mass-marketed processors, Army applications are targeted for enhancement with heterogeneous computing solutions. With the assistance of GPUs, computationally powerful systems can be placed closer to field operations in a small form factor and can handle processing requirements onboard. Furthermore, ARM-based devices with portability and low-power characteristics are naturally appropriate interface in tactical operations. A GPU workstation will be responsible for complex and time-consuming computations of algorithms while an ARM product will provide an intuitive interface and abstraction. Future work for ARM development boards would be to assess and analyze ARM systems in a networked configuration, running cloud computing software.

This research investigated the applicability of asymmetric core computing in a battlefield environment. Across computing disciplines, power related issues play a dominant role in architecture, design, and systems. This work hopes to complement and facilitate the transition of heterogeneous computing to battlefield operating systems.

6 References

[1] TOP500 Supercomputer Sites. Available: http://top500.org