Earth Simulator System 1

Special Issue on High Performance Computing 21 Architecture and Hardware for HPC 1 The Earth Simulator System 1 By Shinichi HABATA,* Mitsuo YOKOKAWA† and Shigemune KITAWAKI‡ 5 5 The Earth Simulator, developed by the Japanese government’s initiative “Earth Simulator 1-1 ABSTRACT Project,” is a highly parallel vector supercomputer system that consists of 640 processor nodes and 1-2 interconnection network. The processor node is a shared memory parallel vector supercomputer, in which 8 1-3 & 2-1 vector processors that can deliver 8GFLOPS are tightly connected to a shared memory with a peak performance 2-2 10 of 64GFLOPS. The interconnection network is a huge non-blocking crossbar switch linking 640 processor nodes10 2-3 & 3-1 and supports for global addressing and synchronization. The aggregate peak vector performance of the Earth Simulator is 40TFLOPS, and the intercommunication bandwidth between every two processor nodes is 12.3GB/s in each direction. The aggregate switching capacity of the interconnection network is 7.87TB/s. To realize a high-performance and high-efficiency computer system, three architectural features are applied in the Earth 15 Simulator; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar15 network. The Earth Simulator achieved 35.86TFLOPS, or 87.5% of peak performance of the system, in LINPACK benchmark, and has been proven as the most powerful supercomputer in the world. It also achieved 26.58TFLOPS, or 64.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This record-breaking sustained performance makes this innovative system a very effective 20 scientific tool for providing solutions to the sustainable development of humankind and its symbiosis with the20 planet earth. KEYWORDS Supercomputer, HPC (High Performance Computing), Parallel processing, Shared memory, Distributed memory, Vector processor, Crossbar network 25 25 1. INTRODUCTION 2. SYSTEM OVERVIEW The Japanese government’s initiative “Earth The Earth Simulator is a highly parallel vector Simulator project” started in 1997. Its target was to supercomputer system consisting of 640 processor 30 30 promote research for global change predictions by nodes (PN) and interconnection network (IN) (Fig. 1). using computer simulation. One of the most impor- Each processor node is a shared memory parallel tant project activities was to develop a supercomputer vector supercomputer, in which 8 arithmetic proces- at least 1,000 times as powerful as the fastest one sors (AP) are tightly connected to a main memory available in the beginning of “Earth Simulator system (MS). AP is a vector processor, which can 35 35 project.” After five years of research and development deliver 8GFLOPS, and MS is a shared memory of activities, the Earth Simulator was completed and 16GB. The total system thus comprises 5,120 APs came into operation at the end of February 2002. Within only two months from the beginning of its operational run, the Earth Simulator was proven to 40 40 be the most powerful supercomputer in the world by Interconnection Network 640 x 640 crossbar switch achieving 35.86TFLOPS, or 87.5% of peak performance of the system, in LINPACK benchmark. Main memory Main memory Main memory It also achieved 26.58TFLOPS, or 64.9% of peak System 16GB System 16GB System 16GB performance of the system, for a global atmospheric 45 45 circulation model with the spectral method. This simulation model was also developed in the “Earth Simulator project.” Arithmetic Processor #0 Arithmetic Processor #0 Arithmetic Processor #0 Arithmetic Processor #1 50 *Computers Division Arithmetic Processor #7 Arithmetic Processor #1 Arithmetic Processor #7 Arithmetic Processor #1 Arithmetic Processor #7 50 †National Institute of Advanced Industrial Science and Processor Node #0 Processor Node #1 Processor Node #639 Technology (AIST) ‡Earth Simulator Center, Japan Marine Science and Fig. 1 General configuration of the Earth Simula- Technology Center tor. 22 NEC Res. & Develop., Vol. 44, No. 1, January 2003 1 and 640 MSs, and the aggregate peak vector perfor- ties of system disks and user disks are 415TB and1 mance and memory capacity of the Earth Simulator 225TB, respectively. The Mass Storage System is a are 40TFLOPS and 10TB, respectively. cartridge tape library system, and its capacity is more The interconnection network is a huge 640 × 640 than 1.5PB. 5 non-blocking crossbar switch linking 640 PNs. The To realize a high-performance and high-efficiency5 interconnection bandwidth between every two PNs is computer system, three architectural features are ap- 12.3GB/s in each direction. The aggregate switching plied in the Earth Simulator: capacity of the interconnection network is 7.87TB/s. The operating system manages the Earth Simula- · Vector processor 10 tor as a two-level cluster system, called the Super- · Shared memory 10 Cluster System. In the Super-Cluster System, 640 · High-bandwidth and non-blocking interconnection nodes are grouped into 40 clusters (Fig. 2). Each clus- crossbar network ter consists of 16 processor nodes, a Cluster Control Station (CCS), an I/O Control Station (IOCS) and From the standpoint of parallel programming, 15 system disks. Each CCS controls 16 processor nodes three levels of parallelizing paradigms are provided15 and an IOCS. The IOCS is in charge of data transfer to gain high-sustained performance: between the system disk and Mass Storage System, and transfers user file data from the Mass Storage · Vector processing on a processor System to a system disk, and the processing result · Parallel processing with shared memory within a 20 from a system disk to the Mass Storage System. node 20 The supervisor, called the Super-Cluster Control · Parallel processing among distributed nodes via Station (SCCS), manages all 40 clusters, and pro- the interconnection network. vides a Single System Image (SSI) operational envi- ronment. For efficient resource management and job 3. PROCESSOR NODE 25 control, 40 clusters are classified as one S-cluster and 25 39 L-clusters. In the S-cluster, two nodes are used for The processor node is a shared memory parallel interactive use and another for small-size batch jobs. vector supercomputer, in which 8 arithmetic proces- User disks are connected only to S-cluster nodes, sors which can deliver 8GFLOPS are tightly con- and used for storing user files. The aggregate capaci- nected to a main memory system with a peak 30 30 35 35 40 40 45 45 50 50 Fig. 2 System configuration of the Earth Simulator. Special Issue on High Performance Computing 23 1 performance of 64GFLOPS (Fig. 3). It consists of 8 To realize such a huge crossbar switch, a byte-1 arithmetic processors, a main memory system, a slicing technique is applied. Thus, the huge 640 × 640 Remote-access Control Unit (RCU), and an I/O proc- non-blocking crossbar switch is divided into a control essor (IOP). unit and 128 data switch units as shown in Fig. 4. 5 The most advanced hardware technologies are Each data switch unit is a one-byte width 640 × 6405 adopted in the Earth Simulator development. One of non-blocking crossbar switch. the main features is a “One chip Vector Processor” Physically, the Earth Simulator comprises 320 PN with a peak performance of 8GFLOPS. This highly cabinets and 65 IN cabinets. Each PN cabinet con- integrated LSI is fabricated by using 0.15µm CMOS tains two processor nodes, and the 65 IN cabinets 10 technology with copper interconnection. The vector contain the interconnection network. These PN and10 pipeline units on the chip operate at 1GHz, while IN cabinets are installed in the building which is 65m other parts including external interface circuits oper- long and 50m wide (Fig. 5). The interconnection net- ate at 500MHz. The Earth Simulator has ordinary air work is positioned in the center of the computer room. cooling, although the processor chip dissipates ap- The area occupied by the interconnection network is 15 proximately 140W, because a high-efficiency heat approximately 180m2, or 14m long and 13m wide, and15 sink using heat pipe is adopted. the Earth Simulator occupies an area of approxi- A high-speed main memory device was also devel- mately 1,600m2, or 41m long and 40m wide. oped for reducing the memory access latency and No supervisor exists in the interconnection net- access cycle time. In addition, 500MHz source syn- work. The control unit and 128 data switch units 20 chronous transmission is used for the data transfer operate asynchronously, so a processor node controls20 between the processor and main memory to increase the total sequence of inter-node communication. For the memory bandwidth. The data transfer rate between the processor and main memory is 32GB/s, and the aggregate bandwidth of the main memory is 25 256GB/s. 25 Two levels of parallel programming paradigms are provided within a processor node: · Vector processing on a processor 30 · Parallel processing with shared memory 30 4. INTERCONNECTION NETWORK The interconnection network is a huge 640 × 640 35 non-blocking crossbar switch, supporting global ad- 35 dressing and synchronization. Fig. 4 Interconnection network configuration. From / To Interconnection Network 40 LAN 40 System Disks AP AP AP AP Mass Storage System #0 #1 #2 #7 RCU IOP Processor Node Cabinets (PN Cabinets) User Disks 14m Interconnection Network User/Work 41m Cabinets (IN Cabinets) Bandwidth : 256GB/s Disks 45 13m 45 Air Conditioning System 40m 65m MMU#0 MMU#2 MMU#4 MMU#5 MMU#6 MMU#1 MMU#3 MMU#29 MMU#30 MMU#31 Power Supply Main memory System (MS) (Capacity : 16GB) System 50m Double Floor 50 (Floor height 1.5m) 50 128 Mbit DRAM developed for ES (Bank Cycle Time : 24nsec) 2048-way banking scheme Fig. 5 Bird’s-eye view of the Earth Simulator Fig. 3 Processor node configuration.

Earth Simulator System 1

Performance Modeling the Earth Simulator and ASCI Q

Project Aurora 2017 Vector Inheritance

2020 Global High-Performance Computing Product Leadership Award

Hardware Technology of the Earth Simulator 1

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

Beyond Earth Simulator

Evaluation of Cache-Based Superscalar and Cacheless Vector Architectures for Scientific Computations

Japan's 10 Peta FLOPS Supercomputer Development

1. Introduction

Notes on the Earth Simulator

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Porting the 3D Gyrokinetic Particle-In-Cell Code GTC to the NEC SX-6 Vector Architecture: Perspectives and Challenges