Measuring Power Consumption on IBM Blue Gene/P
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Comput Sci Res Dev DOI 10.1007/s00450-011-0192-y SPECIAL ISSUE PAPER Measuring power consumption on IBM Blue Gene/P Michael Hennecke · Wolfgang Frings · Willi Homberg · Anke Zitz · Michael Knobloch · Hans Böttiger © The Author(s) 2011. This article is published with open access at Springerlink.com Abstract Energy efficiency is a key design principle of the Top10 supercomputers on the November 2010 Top500 list IBM Blue Gene series of supercomputers, and Blue Gene [1] alone (which coincidentally are also the 10 systems with systems have consistently gained top GFlops/Watt rankings an Rpeak of at least one PFlops) are consuming a total power on the Green500 list. The Blue Gene hardware and man- of 33.4 MW [2]. These levels of power consumption are al- agement software provide built-in features to monitor power ready a concern for today’s Petascale supercomputers (with consumption at all levels of the machine’s power distribu- operational expenses becoming comparable to the capital tion network. This paper presents the Blue Gene/P power expenses for procuring the machine), and addressing the measurement infrastructure and discusses the operational energy challenge clearly is one of the key issues when ap- aspects of using this infrastructure on Petascale machines. proaching Exascale. We also describe the integration of Blue Gene power moni- While the Flops/Watt metric is useful, its emphasis on toring capabilities into system-level tools like LLview, and LINPACK performance and thus computational load ne- highlight some results of analyzing the production workload glects the fact that the energy costs of memory references at Research Center Jülich (FZJ). and the interconnect are becoming more and more impor- tant [3]. It has also been pointed out that a stronger focus on Keywords Blue Gene · Energy efficiency · Power optimizing time to solution will likely result in a different consumption ranking of competing algorithms to solve a given scientific problem than when solely optimizing for Flops/Watt [4]. It 1 Introduction and background is therefore important to better understand the energy char- acteristics of current production workloads on Petascale sys- Power consumption of supercomputers is becoming increas- tems. Those insights can then be used as input to future hard- ingly important: Since 2007, the Green500 list publishes su- ware design as well as for algorithmic optimizations with percomputer rankings based on the Flops/Watt metric. The respect to overall energy efficiency. In this work we focus on the IBM* Blue Gene* series * IBM, Blue Gene, DB2, POWER and PowerXCell are trademarks of of supercomputers. The guiding design principles for Blue IBM in USA and/or other countries. Gene are simplicity, efficiency, and familiarity [5]. Regard- M. Hennecke () ing energy efficiency, the key feature of Blue Gene is its judi- IBM Deutschland GmbH, Karl-Arnold-Platz 1a, ciously chosen low-frequency, low-voltage design which re- 40474 Düsseldorf, Germany sults in both high-performance and highly energy-efficient e-mail: [email protected] supercomputers. Blue Gene/L [6–8] and Blue Gene/P [9] W. Frings · W. Homberg · A. Zitz · M. Knobloch systems have consistently gained top MFlops/Watt rankings Forschungszentrum Jülich GmbH, Wilhelm-Johnen-Strasse, on the Green500 list, and an early prototype of the next gen- 52425 Jülich, Germany eration Blue Gene/Q system has recently set a new record at 2.1 GFlops/Watt [2]. One important aspect of the familiar- H. Böttiger IBM Deutschland Research & Development GmbH, ity design principle is that the well established MPI paral- Schönaicher Str. 220, 71032 Böblingen, Germany lel programming paradigm on homogeneous nodes is main- M. Hennecke et al. Fig. 1 Blue Gene/P system buildup tained (augmented by OpenMP parallelism as the number sults of analyzing job history data on the Petascale Blue of cores per node increases). This distinguishes Blue Gene Gene/P system operated by FZJ, before concluding the pa- from other current supercomputers, which often achieve per in Sect. 7. high Flops/Watt efficiency by relying on accelerator tech- nologies like the IBM PowerXCell* 8i [15] or GPGPUs. The simplicity principle includes packaging a large number 2 Blue Gene/P architecture and power flow of less powerful and less complex chips into a rack, and in- tegrating most system functions including the interconnect The Blue Gene/P system architecture and packaging are into the compute chips [10]. described in detail in [9]. A Blue Gene/P node consists A direct consequence of the Blue Gene system design is of the quad-core Blue Gene/P Compute Chip (BPC) and that additional energy optimization techniques like dynamic forty DDR3 DRAM chips, all soldered onto a printed circuit voltage and frequency scaling, which are typical for more board for reliability. The BPC ASIC also includes a large complex processors operating at much higher frequencies L3 cache built from embedded DRAM [14], a 3D torus in- [11–13], are both less feasible (because the Blue Gene chips terconnect for MPI point-to-point operations, a tree network do not include comparable infrastructure) and less important for MPI collectives, and a barrier network. A node card (as Blue Gene already operates at highly optimized voltage (NC) contains 32 compute nodes and up to two I/O nodes and frequency ranges). (IONs). Within a midplane, 16 node cards with 512 com- On the other hand, a scalable environmental monitoring pute nodes are connected to form an 8 × 8 × 8 torus (with- infrastructure is an integral part of the Blue Gene software out any additional active components). Each midplane also environment [16]. While this is primarily used to satisfy contains a service card (SC) for bringup and management. the reliability, availability and serviceability (RAS) require- To interconnect multiple midplanes, Blue Gene/P link chips ments of operating large Blue Gene systems with their huge (BPLs) are used which are packaged onto four link cards number of components, it can also be used to analyze the (LC) per midplane. The BPL ASICs can be programmed to machine’s power consumption at scale while running pro- either connect the surfaces of the 8 × 8 × 8 cube to cop- duction workloads. per torus cables attached to other midplanes, or to close that The rest of this paper is organized as follows: In Sect. 2 torus dimension within the midplane. Figure 1 shows this we present the Blue Gene system architecture and power system buildup. distribution network, followed by the environmental mon- On the top of each Blue Gene/P rack, bulk power mod- itoring infrastructure in Sect. 3 and a detailed breakdown ules (BPMs) convert AC power to 48 V DC power, which of Blue Gene/P power consumption in Sect. 4. Integration is then distributed to the cards through the two midplanes. of Blue Gene energy information into system-level tools is Service cards, node cards and link cards include a number described in Sect. 5. In Sect. 6, we present some initial re- of DC/DC voltage regulator modules (VRMs) to provide Measuring power consumption on IBM Blue Gene/P Fig. 2 Power flow within a Blue Gene/P rack the different voltages required on the cards. All BPMs and Table 1 Blue Gene/P voltage regulators + VRMs are N 1 redundant. Blue Gene/P Count Count Count for 1 PFlops Heat is removed from the rack by side-to-side air cooling, component per card per rack (72 racks) with fan assemblies on the left side of the rack. The fan as- semblies are powered directly from the BPMs at 48 V. Blue BPMs – 9 648 Gene/P also has an option for hydro-air cooling: Heat ex- SC VRMs 7 14 1,008 changers between the racks in a row are used to cool down LC VRMs 2 16 1,152 the hot air exhausted from one rack before it enters the next NC VRMs 8 256 18,432 rack in the row. This reduces sub-floor airflow requirements by up to 8×, and is also more efficient than using external computer room air conditioning (CRAC) units. Table 2 Blue Gene/P consumers (using 8 IONs/rack) Figure 2 shows the power flow within a Blue Gene/P rack, including the main 48 V power distribution, the Blue Gene/P Count Count Count for 1 PFlops DC/DC voltage regulator modules, and the main energy component per card per rack (72 racks) consumers. Table 1 shows the part counts for the AC/DC and DC/DC voltage regulators, and Table 2 summarizes the Fans 3 60 4,320 part counts for the Blue Gene/P energy consumers. BPL chips 6 48 3,456 Service cards, node cards and link cards can be accessed BPC chips 1 1,032 74,304 from an external Blue Gene service node through a 1 GbE DRAM chips 40 41,280 2,972,160 hardware control network. Through this path, the DC/DC voltage regulators of the respective cards can be monitored as described in the next section. BPMs are monitored by the –anoperational database which records information about service card of the bottom midplane, and fans in a midplane blocks (partitions), jobs, and their history; are monitored by the service card of that midplane. –anenvironmental database which keeps current and past values for environmentals like temperature, voltages and currents; 3 Blue Gene/P environmental monitoring – and a RAS database which collects hard errors, soft errors, The Blue Gene service node uses an IBM DB2* relational machine checks, and software problems. database to store information about the Blue Gene machine As described in Chap.