Overview of TSUBAME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data
Total Page:16
File Type:pdf, Size:1020Kb
16 Overview of TSUBAME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data ChainerMN: Scalable Distributed Deep Learning Framework Creation of Chemical Structures Aimed for Drug Design, Facilitated by Efficient GPU Computing Overview of TSUBAME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data Satoshi Matsuoka1 Toshio Endo1 Akira Nukada1 Shinichi Miura1 Akihiro Nomura1 Hitoshi Sato2 Hideyuki Jitsumoto1 Aleksandr Drozd1 1 Global Scientific Information and Computing Center, Tokyo Institute of Technology 2 Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology TSUBAME3.0 is one of largest supercomputers in Japan, which enjoys 47.2PFlops performance in half-precision. It drives broad range workloads, not only in traditional HPC area, but also in big-data and AI. It is expected to achieve theoretical PUE of 1.033, as results of our long-time efforts in improvement of density and energy efficiency. This report introduces architecture of TSUBAME3.0, while mentioning issues appeared in previous TSUBAME systems and new operation strategies as their solutions. XA. The total numbers of CPUs and GPUs are 1,080 and 2,160, respectively, and the total Introduction 1 performance is 12.15PFlops in double precision and 47.2PFlops in half precision (or more). Half precision is an expression of floating point values in 16bit, which is expected Global Scientific Information and Computing center (GSIC) to be useful in big-data and AI area, and implemented by at Tokyo Institute of Technology is operating the TSUBAME hardware in GPU accelerators adopted in TSUBAME3.0. As supercomputer series as “Everybody’s supercomputer”, another feature, each compute node has a high-speed and which focus on high performance and easy-to-use. large capacity (2TB) NVMe SSD. The total capacity of SSDs are Since TSUBAME1.0 introduced in 2006, TSUBAME has 1.08PB. Also shared large-scale storage is more powerful than been harnessed by around 2,000 users from Tokyo Tech, ever; the total capacity is 15.9PB and transfer speed is 150GB/ other research institutes, and industrial area. After partial s. This storage hierarchy can be harnessed for extremely high- adoption of GPU accelerators in TSUBAME1.2, GPUs are fully speed big-data analysis. equipped in TSUBAME2.0 introduced in 2010. Owing to TSUBAME3.0 has excellent power efficiency, largely higher energy efficiency largely, TSUBAME2.0 achieved 30 owing to GPU accelerators of the latest generation. We have times higher speed performance than TSUBAME1.0 within conducted performance measurement of the HPL benchmark a similar power budget[1]. TSUBAME2.0 has ranked as No.4 with 144 compute nodes. When the parameters are optimized in the Top500 supercomputer ranking and awarded “the for better power efficiency, the performance of 1.998PFlops Greenest Production Supercomputer in the World”. From has been achieved with power consumption of 141.6kW. The the peta-scale application aspect, 2PFlops performance has power-performance ratio, 14.11GFlops, won world No.1 in the been observed in dendritic solidification simulation, which Green500 List in June 2017[3]. resulted in winning the ACM Gordon Bell Prize[2]. In 2013, all The cooling system of TSUBAME3.0 is also the GPUs in TSUBAME2.0 have been replaced to K20X GPUs, highly optimized for energy saving. We have used and and the resultant system is called TSUBAME2.5, whose peak analyzed indirect liquid cooling in TSUBAME2.0/2.5 and performance is 5.6PFlops (double precision). liquid submersion cooling in TSUBAME-KFC prototype GSIC has started the operation of the brand-new supercomputer. Based on such knowledge, TSUBAME3.0 system in this series, called TSUBAME3.0, in August 2017. The cooling system consists of warm water cooling for CPUs and proposition submitted by SGI Japan has been adopted in GPUs, the main heat sources, and indirect liquid cooling for January; since then, GSIC, SGI Japan and related vendors have the rest parts. This achieves high power-efficiency, stability cooperatively promoted preparations towards the operation and maintainability. start. Figure 1 is an illustration of the entire TSUBAME3.0 This report overviews the TSUBAME3.0 system, system. including its cooling system and facility, and describes the The target applications area of TSUBAME3.0 is not operation strategy. limited to typical HPC area, but includes big-data and artificial intelligence (AI) area. The main components of the system are 540 compute nodes, which are customized version of SGI ICE 02 As GPU accelerators, four NVIDIA Tesla P100 for NVLink- Optimized Servers (16GB HBM2@732GB/s, 5.3TFLOPS@FP64, 10.6TFLOPS@FP32, 21.2TFLOPS@FP16) are equipped. They are fourth generation GPUs in the TSUBAME series, which succeed Tesla S1070 (GT200) in TSUBAME1.2, Tesla M2050 (Fermi) in TSUBAME2.0 and Tesla K20X (Kepler) in TSUBAME2.5. Each node is also equipped with four Intel Omni-Path Architecture based Host Fabric Interfaces(HFI). The injection bandwidth is Fig. 1 Illustration of TSUBAME3.0 system 400Gbps in uni-direction and 800Gbps in bi-direction. As local storage, A NVMe-based high-speed SSD with 2TB capacity per node is used. High-speed interconnects and SSDs are especially important for big-data/AI usages. Fig. 2 is a block diagram of a TSUBAME3.0 compute TSUBAME3.0 Architecture 2 node. The node has the almost symmetric structure with even numbers of CPUs, GPUs and HFIs. In the figure, PCIe represents PCI-Express Gen3, whose transfer speed is 1GB/s per lane. A Broadwell-EP CPU has 40 PCI-Express lanes. Among 2.1 Computing Nodes of them, 16 lanes are connected to one of Omni-Path HFIs. TSUBAME3.0 includes 540 uniform compute nodes, each of Other 16 lanes are connected to a PLX PCI-Express switch, which consists of the latest CPUs, GPU accelerators, large which is connected with two GPUs and another Omni-Path main memory, high-speed interconnect and fast SSDs. Two HFI. The rest lanes of CPU1 are connected to an NVMe SSD. Intel Xeon E5-2680 V4 Processor (Broadwell-EP, 14 cores, GPU accelerators on TSUBAME3.0 are Tesla P100 2.4GHz) are adopted as CPUs, and main memory capacity is with SXM2 form factor. Each GPU has four 20GB/s data link, 256GiB, which composed of eight DDR4-2400 ECC REG DIMM called NVLink, which are used for direct data transfer among 32GB modules, and bandwidth of main memory is 154GB/s. GPUs. Fig. 2 Block diagram of a TSUBAME3.0 compute node 03 Overview of TSUBAME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data Table 1 shows comparison between a TSUBAME2.5 communication in a certain job does not affect performance node and a TSUBAME3.0 node. The numbers shown are of other jobs, this topology is suitable to supercomputers theoretical peak ones. The computation speed of GPUs in universities, which accommodate applications with wide is largely improved on TSUBAME3.0, whereas GPUs on variety in the scale and communication patterns. While TSUBAME2.5 have been upgraded in 2013. Especially the the basic topology is common between TSUBAME2.0/3.5 half precision (FP16) is highly improved. Also capacity and and TSUBAME3.0, the former has used the static routing bandwidth of memory and the SSD are largely enhanced method with a fixed routing table, thus we have observed toward big-data applications. performance degradation caused by network contentions in certain communication patterns [4]. On the other hand, TSUBAME3.0, with the adaptive routing method that dynamically finds communication paths, such degradations are expected to be solved. 2.3 Storage 2.3.1 Global Storage Area Shared storages with peta-scale capacity and high access speed are essential components in supercomputers. On TSUBAME2.5, a global storage system that is equally accessible from all compute nodes has largely contributed to convenience in the usage of the supercomputer. TSUBAME3.0 inherits this direction, while capacity and access speed are largely enhanced. Considering flexibility in the operation and localization of effects of failures, the TSUBAME3.0 global storage, shown in Fig.3, consists of three systems, based on Lustre parallel file system. Each system has 5.3PB effective Table 1 Comparison between a TSUBAME2.5 node capacity and the total capacity is 15.9PB. and a TSUBAME3.0 node. The aggregated number is shown within a node. 2.2 Interconnect All compute nodes, storages and management nodes are connected via interconnect based on the Intel Omni-Path Architecture. As the more important characteristics of the interconnect, compute nodes are connected in a fat-tree topology with full bisection bandwidth. The tree topology consists of director switches, edge switches and compute nodes as leaves. Each chassis contains nine compute nodes and two edge switches. Each edge switch has 18 downlinks connected to nine nodes, and 18 uplinks connected to three director switches. The design of the entire interconnect topology on supercomputer has a heavy impact especially on performance of large scale computations. TSUBAME3.0 adopts the full bisection fat tree, after its success in TSUBAME2.0. On this topology, all communication passes between arbitrary ports can be established without contention theoretically. Since Fig. 3 The global storage of TSUBAME3.0 04 in order to reduce operation costs, the cooling system of 2.3.2 Scratch Area TSUBAME3.0 is highly power efficient at the world’s top level. Modern supercomputers should support extremely frequent 3.1 The Server Room file I/O requests from compute nodes. It is especially important for recent workloads including big-data applications. Global Modern supercomputers have tended to increase the number file systems are often designed to optimize I/O performance of nodes for performance improvement, which requires larger for files with large sizes, however, some workloads including footprint area.