Overview of the K Computer System

Overview of the K computer System Hiroyuki Miyazaki Yoshihiro Kusano Naoki Shinjou Fumiyoshi Shoji Mitsuo Yokokawa Tadashi Watanabe RIKEN and Fujitsu have been working together to develop the K computer, with the aim of beginning shared use by the fall of 2012, as a part of the High-Performance Computing Infrastructure (HPCI) initiative led by Japan’s Ministry of Education, Culture, Sports, Science and Technology (MEXT). Since the K computer involves over 80 000 compute nodes, building it with lower power consumption and high reliability was important from the availability point of view. This paper describes the K computer system and the measures taken for reducing power consumption and achieving high reliability and high availability. It also presents the results of implementing those measures. 1. Introduction Technology (MEXT). As the name “Kei” in Fujitsu has been actively developing and Japanese implies, one objective of this project providing advanced supercomputers for over was to achieve a computing performance of 30 years since its development of the FACOM 1016 floating-point operations per second (10 230-75 APU—Japan’s first supercomputer—in PFLOPS). The K computer, moreover, was 1977 (Figure 1). As part of this effort, it has developed not just to achieve peak performance been developing its own hardware including in benchmark tests but also to ensure high original processors and software too and building effective performance in applications used in up its technical expertise in supercomputers actual research. Furthermore, to enable the along the way. entire system to be installed and operated at The sum total of this technical expertise one location, it was necessary to reduce power has been applied to developing a massively consumption and provide a level of reliability parallel computer system—the K computer1), note)i that could ensure the total operation of a large- —which has been ranked as the top performing scale system. supercomputer in the world. To this end, four development targets were The K computer was developed jointly established. by RIKEN and Fujitsu as part of the High • A high-performance CPU for scientific Performance Computing Infrastructure computation (HPCI) initiative overseen by Japan’s Ministry • New interconnect architecture for massively of Education, Culture, Sports, Science and parallel computing • Low power consumption note)i “K computer” is the English name that RIKEN has been using for the • High reliability and high availability supercomputer of this project since July The CPU and interconnect architecture are 2010. “K” comes from the Japanese word defined in detail in other papers of this special “Kei,” which means ten peta or 10 to the 2),3) 16th power. issue. In this paper, we present an overview FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 255–265 (July 2012) 255 H. Miyazaki et al.: Overview of the K computer System K computer Peta-scale SPARC FX1 Enterprise VPP5000 VPP300/700 PRIMEQUEST VPP500 PRIMEPOWER PRIMERGY Vector machine HPC2500 BX900 VP Series Cluster node AP3000 Scalar machine HX600 Cluster node ScalarAP1000 MPP FACOM230-75 APU PRIMERGY x86 cluster RX200 Cluster node ～1985 1990 1995 2000 2005 2010 2015 Figure 1 FigureHistory of1 supercomputer development at Fujitsu. History of supercomputer development at Fujitsu. of the K computer system, describe the measures implementing power-reduction measures from taken for reducing power consumption and both process-technology and design points of achieving high reliability and high availability at view. This CPU applies High Performance the system level of the K computer, and present Computing–Arithmetic Computational Extensions the results of implementing those measures. (HPC-ACE) applicable to scientifi c computing and analysis. It also uses a 6-MB/12-way 2. Compute-node confi guration sector cache as an L2 cache and uses a Virtual in the K computer Single Processor by Integrated Multicore We fi rst provide an overview of the compute Architecture (VISIMPACT), whose effectiveness nodes, which lie at the center of the K computer has previously been demonstrated on Fujitsu’s system. A compute node consists of a CPU, FX1 high-end technical computing server. As memory, and an interconnect. a result of these features, the SPARC64 VIIIfx 1) CPU achieves high execution performance in the We developed an 8-core CPU with a HPC fi eld. In addition, the system-control and theoretical peak performance of 128 GFLOPS memory-access-control (MAC) functions, which called the “SPARC64 VIIIfx” as the CPU for the were previously implemented on separate chips, K computer (Figure 2).4) The SPARC64 VIIIfx are now integrated in the CPU, resulting in high features Fujitsu’s advanced 45-nm semiconductor memory throughput and low latency. process technology and a world-class power 2) Memory performance of 2.2 GFLOPS/W achieved by Commercially available DDR3-SDRAM- 256 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) H. Miyazaki et al.: Overview of the K computer System System bus interface Core 5 Core 7 L2 cache data Core 4 Core 6 MAC L2 cache control MAC 6D mesh/torus Node group Core 3 DDR3 interface DDR3 interface Core 1 Figure 3 TofuFigure interconnect. 3 L2 cache data Tofu interconnect. Core 0 Core 2 torus confi guration. It employs an extended- dimension-order routing algorithm that enables the provision of a CPU group joined in a 1–3 Figure 2 2 dimensional torus so that a faulty node in the SPARC64 VIIIfx. SPARC64 VIIIfx. system becomes non-existent as far as the user is concerned. The Tofu interconnect also makes it possible to allocate to the user a complete CPU DIMM is used as main memory. The dual in-line group joined in a 1–3 dimensional torus even if memory module (DIMM) is a commodity module part of the system has been extracted for use. that is also used in servers and PC clusters, A compute node of the K computer which provides a number of advantages. For consists of the CPU/DIMM described above example, it can be multi-sourced, a stable supply and an InterConnect Controller (ICC) chip for of about 700 000 modules can be obtained in a implementing the interconnect. Since the Tofu roughly one-year manufacturing period with a interconnect forms a direct interconnection stable level of quality, and modules with superior network, the ICC includes a function for relaying power-consumption specifi cations can be selected. packets between other nodes (Figure 4). If the Besides this, an 8-channel memory interface node section of a compute node should fail, only per CPU provides a peak memory bandwidth the job using that compute node will be affected, of 64 GB/s, surpassing that of competitors’ but if the router section of that compute node computers and the high memory throughput should fail, all jobs using that compute node as deemed necessary for scientifi c computing. a packet relay point will be affected. The failure 3) Interconnect rate of the router section is nearly proportional To provide a computing network to the amount of circuitry that it contains, and (interconnect), We developed and implemented it needs to be made signifi cantly lower than the an interconnect architecture called “Tofu” failure rate of the node section. ICC-chip faults for massively parallel computers in excess are therefore categorized in accordance with of 80 000 nodes.5) The Tofu interconnect their impact, which determines the action taken (Figure 3) constitutes a direct interconnection at the time of a fault occurrence. The end result network that provides scalable connections with is a mechanism for continuing system operation low latency and high throughput for a massively in the event of a fault in a node section. parallel group of CPUs and achieves high availability and operability by using a 6D mesh/ FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 257 H. Miyazaki et al.: Overview of the K computer System Node section Router section CPU ICC Core Core Port TNI Port Memory Core Core Port DIMM Port DIMM MAC r DIMM TNI DIMM MAC a L2 cache System System b Port s DIMM control bus IF bus IF s Port DIMM MAC o DIMM r DIMM MAC TNI C Port Port Core Core L2 cache Port PCIe PCIe TNI Core data Core Port core core Tofu interconnect PCIe MAC: Memory access controller ICC: InterConnect Controller TNI: Tofu network interface IF: Interface PCIe: Peripheral component interconnect express Figure 4 ConceptualFigure 4 diagram of node and router sections. Conceptual diagram of node and router sections. Memory 3. Rack configuration in the K CPU Memory PCIe riser computer ICC Compute nodes of the K computer are installed in specially developed racks. A compute rack can accommodate 24 system boards (SBs), each of which mount 4 compute nodes (Figure 5), CPU and 6 I/O system boards (IOSBs), each of which Memory System board I/O system board mount an I/O node for system booting and file- system access. A total of 102 nodes can therefore FigureFigure 5 5 be mounted in one rack. System board and I/O system board. Water cooling is applied to the CPU/ICCs System board and I/O system board. used for the compute nodes and I/O nodes and to on-board power-supply devices. Specifically, IOSBs are air cooled since they make use of chilled water is supplied to each SB and IOSB commercially available components. Since these via water-cooling pipes and hoses mounted boards are mounted horizontally inside the rack, on the side of the rack. Water cooling is used air must be blown in a horizontal direction. To to maintain system reliability, enable a high- enable the racks themselves to be arranged in a density implementation, and lower power high-density configuration, the SBs are mounted consumption (decrease leakage current).

Overview of the K Computer System

Performance Analysis on Energy Efficient High

Supercomputer Fugaku

Measuring Power Consumption on IBM Blue Gene/P

Compute System Metrics (Bof)

Computer Architectures an Overview

Programming on K Computer

2010 IBM and the Environment Report

101117-Green Computing and Gpus

40 Jahre Erfolgsgeschichte BS2000 Vortrag Von 2014

Tibidabo$: Making the Case for an ARM-Based HPC System

Supermicro “Industry-Standard Green HPC Systems”

Toward Automated Cache Partitioning for the K Computer