A Feasibility Analysis of Power Awareness in Commodity-Based High-Performance Clusters∗

A Feasibility Analysis of Power Awareness in Commodity-Based High-Performance Clusters∗ Chung-hsing Hsu and Wu-chun Feng Computer & Computational Sciences Division Los Alamos National Laboratory {chunghsu, feng}@lanl.gov Abstract connects 2004, Mark Seager of Lawrence Livermore Na- tional Laboratory (LLNL) noted the following: We present a feasibility study of a power-reduction scheme that reduces the thermal power of proces- 1. For every watt of power consumed at LLNL, 0.7 watts sors by lowering frequency and voltage in the con- of cooling is needed to dissipate the power. text of high-performance computing. The study revolves around a 16-processor Opteron-based Beowulf clus- 2. The cooling bill for supercomputers at LLNL is $6M ter, configured as four nodes of quad-processors, and per year. shows that one can easily reduce a significant amount of CPU and system power dissipation and its associated en- Consequently, ignoring the costs of acquisition, integra- ergy costs while still maintaining high performance. Specif- tion, upgrading, and maintenance, the annual cost to sim- ically, our study shows that a 5% performance slowdown ply power and cool the cluster supercomputers at LLNL can be traded off for an average of 19% system energy sav- amounts to $14.6M per year! ings and 24% system power reduction. These preliminary The power efficiency of a cluster can be addressed by empirical results, via real measurements, are encourag- reducing the power consumption and dissipation of pro- ing because hardware failures often occur when the clus- cessors in the cluster. For example, at IEEE Cluster 2002, ter is running hot, i.e, when the workload is heavy, and the Feng et al. [2] presented a bladed Beowulf, dubbed Green new power-reduction scheme can effectively reduce a clus- Destiny, that is based on low-power Transmeta proces- ter’s power demands during these busy periods. sors running high-performance code-morphing software. 1 Green Destiny has proven to be extraordinarily reliable even though it operates in a harsh 85◦ F warehouse environment. The recently announced Blue Gene/L [8] and Orion Multi- 1. Introduction systems workstations [10] arguably follow in the footsteps of Green Destiny. Power efficiency is critical for developing cost-effective, For more traditional AMD- and Intel-based Beowulf small-footprint clusters. When cluster nodes consume and clusters, the power reduction can be achieved by means of dissipate more power, they must be spaced out and ag- dynamic voltage and frequency scaling (DVFS), a mech- gressively cooled; otherwise, undissipated power causes the ◦ anism that allows system software to increase or decrease temperature to increase rapidly enough that for every 10 processor frequency and voltage at run time. Despite a C increase in temperature, the failure rate doubles [1]. Be- handful of empirical studies published earlier this year [3, cause the steep power demands of high-performance clus- 4, 5, 6], the power-efficiency benefits that the DVFS mech- ters mean that massive cooling facilities are needed to keep anism can provide in high-end Beowulf clusters are still not the clusters from failing, the total cost of ownership of these clear yet. Consequently, we present a feasibility study on a clusters can be quite high. For example, at IEEE Hot Inter- high-end Opteron-based Beowulf cluster. ∗ This work was supported by the DOE LDRD Exploratory Research Program through Los Alamos National Laboratory contract W-7405- 1 The high-performance code-morphing software improved floating- ENG-36 and by Advanced Micro Devices, Inc. Available as LANL point performance as much as 100% over standard code-morphing technical report: LA-UR 05-5649. software. Since the debut of Green Destiny, similar solutions have appeared, e.g., Sun Microsystems’ SunFire server, which uses AMD XP-M processors, HP’s ProLiant BL server, which uses Intel Pentium M processors, Orion Multisys- tems’ desktop cluster, which uses Transmeta Efficeon processors, and finally IBM’s BlueGene/L, which uses IBM PowerPC 440 processors. All these processors are primarily Figure 1. The RLX ServerBlade. used for mobile computing. Hence, a Green Destiny type of solution refers to a cluster design that achieves good power efficiency by using many low-power embedded or mobile 2. Background processors (rather than fewer but more powerful server processors). In this section, we present two different approaches to- However, the Green Destiny type of solution has two ma- wards reducing power consumption, and hence, improv- jor drawbacks. First, many cluster workloads do not scale ing reliability in clusters: (i) the low-power approach of as the number of cluster nodes increases. The inherently se- Green Destiny and (ii) the power-aware approach that uses quential part of workloads and the network bandwidth limi- dynamic voltage and frequency scaling (DVFS) to reduce tation prohibit performance scalability. Second, this type of power consumption. solution is not entirely based on commodity technologies, and hence, may not be cost-effective. For example, Blue- 2.1. Green Destiny Gene/L uses an extensively stripped-down version of the 700-MHz PowerPC 440 embedded processor while Green Destiny relies on a customized high-performance version Green Destiny is the first large-scale instantiation of of code-morphing software (CMS)2 to achieve good per- the “Supercomputing in Small Spaces” project [14] at Los formance, e.g., 12.6 Gflops on 24 processors. In contrast, Alamos National Laboratory. The origins of this work ac- the 16-processor MegaProto cluster [11] which uses the tually date back to September 2001 with a 24-node bladed same type of processor as Green Destiny, achieves only 5.62 Beowulf dubbed MetaBlade that also used Transmeta pro- Gflops on Linpack because it does not have the customized cessors, albeit the previous generation TM5600. high-performance version of CMS that Green Destiny had. Green Destiny has 240 nodes that fit into an industry- standard 19-inch rack (with a footprint of five square feet) and sip only three kilowatts of power when booted diskless. 2.2. DVFS-Enabled Clusters Each cluster node is an RLX ServerBlade 1000t, as shown The idea of DVFS on commodity operating systems can in Figure 1, that contains a 933-MHz/1-GHz Transmeta be traced back as early as 1994 [15]. Processor power is TM5800 CPU, 128-MB DDR SDRAM, 512-MB SDRAM, reduced through lowering frequency and voltage because 20-GB hard disk, and three 100-Mb/s Fast Ethernet net- power is proportional to frequency and to the square of volt- work interfaces. Twenty-four such nodes are mounted into age [9]. However, commodity processors that actually sup- a chassis that fits in a 3U space. Putting ten such chassis to- ported DVFS did not appear until six years later in 2000 gether along with a standard tree-based network results in and did so only in the mobile computing market. It was not the Green Destiny cluster. until 2003 that DVFS made its way into desktop proces- Each populated chassis consumes about 512 watts at sors, specifically the AMD Athlon64. By late 2004, DVFS load, i.e., 21 watts per cluster node. The low power of each gained support on server-class processors such as the AMD cluster node can be attributed to the low power of the Trans- Opteron and Intel Xeon EM64T. meta processor which consumes no more than 7.5 watts. To date, we are only aware of two DVFS-enabled Be- The excellent power efficiency of Green Destiny made it ◦ owulf clusters that have been built and evaluated [4, 5, 6]. extremely reliable. The cluster ran in a dusty 85 F ware- The first cluster [6] uses sixteen notebook computers as house environment at 7,400 feet above sea level without cluster nodes because most notebook computers are pow- any unscheduled downtime over its two-year lifetime, and ered by DVFS-enabled processors. (In this case, the mo- it did so without any special facilities, i.e., no air condi- bile processor that is used in each node is a 600-1400MHz tioning, no humidification control, no air filtration, and no Intel Pentium M.) In contrast, the second cluster [4, 5] is ventilation. In contrast, our more traditional 100-processor Beowulf cluster that preceded Green Destiny failed on a 2 Each Transmeta processor has a software layer, called code-morphing weekly basis in the same hot environment. (This Beowulf software, that dynamically morphs x86 instructions into VLIW in- cluster was actually a 128-processor cluster, but we were structions. This provides x86 software with the impression that it is never able to get the entire cluster up and running reliably.) being run on native x86 hardware. 2 based on ten desktop motherboards, each of which has a 800-2000MHz AMD Athlon64 processor. For both clusters, each node has 1GB main memory. Though the two DVFS-enabled clusters are commodity- based, we argue that they are not high-performance nor balanced. Performance-wise, the two clusters use slow 100- Mb/s Fast Ethernet networks. In contrast, none of the supercomputers on the Top500 list (http://www.top500.org) use Fast Ethernet, but nearly half of them use Gigabit Ether- net. More importantly, using fast processors but a slow network creates an imbalanced machine that allows the processor frequency to be lowered further than a balanced machine for the same level of performance impact, thus lead- ing to noticeably more power and energy reduction than one would realize in a more balanced, high-performance cluster. Figure 2. Celestica A8440. Therefore, we use a 16-processor Opteron-based Beowulf cluster, interconnected by Gigabit Ethernet, to present a feasibility study on the power-efficient benefits that the DVFS f (GHz) 2.0 1.8 1.6 1.4 1.2 1.0 0.8 mechanism can deliver.

A Feasibility Analysis of Power Awareness in Commodity-Based High-Performance Clusters∗

SIMD Extensions

Analysis and Comparison of Different Microprocessors Used in Computer

Evolution Des X86befehlssatzes Und Seiner Erweiterungen

O Společnosti Transmeta

Transmeta Efficeon

Processor Microarchitecture an Implementation Perspective Ii

Mobile Processors

Transmeta Efficeon TM8820 Processor Interconnect, Andanagp 4Xgraphics Interface

Transmeta™ Efficeon™ TM8300 Processor

Introducing the Transmeta Efficeon TM8000 Microprocessor Family

Transmeta Efficeon Processor

Marcelo S. Cintra Phd Thesis