Special Issue on High Performance Computing 21 Architecture and Hardware for HPC

1 The Earth Simulator System 1

By Shinichi HABATA,* Mitsuo YOKOKAWA† and Shigemune KITAWAKI‡ 5 5 The Earth Simulator, developed by the Japanese government’s initiative “Earth Simulator 1-1 ABSTRACT Project,” is a highly parallel vector system that consists of 640 processor nodes and 1-2 interconnection network. The processor node is a shared memory parallel vector supercomputer, in which 8 1-3 & 2-1 vector processors that can deliver 8GFLOPS are tightly connected to a shared memory with a peak performance 2-2 10 of 64GFLOPS. The interconnection network is a huge non-blocking crossbar switch linking 640 processor nodes10 2-3 & 3-1 and supports for global addressing and synchronization. The aggregate peak vector performance of the Earth Simulator is 40TFLOPS, and the intercommunication bandwidth between every two processor nodes is 12.3GB/s in each direction. The aggregate switching capacity of the interconnection network is 7.87TB/s. To realize a high-performance and high-efficiency computer system, three architectural features are applied in the Earth 15 Simulator; , shared-memory and high-bandwidth non-blocking interconnection crossbar15 network. The Earth Simulator achieved 35.86TFLOPS, or 87.5% of peak performance of the system, in LINPACK benchmark, and has been proven as the most powerful supercomputer in the world. It also achieved 26.58TFLOPS, or 64.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This record-breaking sustained performance makes this innovative system a very effective 20 scientific tool for providing solutions to the sustainable development of humankind and its symbiosis with the20 planet earth.

KEYWORDS Supercomputer, HPC (High Performance Computing), Parallel processing, Shared memory, Distributed memory, Vector processor, Crossbar network 25 25 1. INTRODUCTION 2. SYSTEM OVERVIEW

The Japanese government’s initiative “Earth The Earth Simulator is a highly parallel vector Simulator project” started in 1997. Its target was to supercomputer system consisting of 640 processor 30 30 promote research for global change predictions by nodes (PN) and interconnection network (IN) (Fig. 1). using computer simulation. One of the most impor- Each processor node is a shared memory parallel tant project activities was to develop a supercomputer vector supercomputer, in which 8 arithmetic proces- at least 1,000 times as powerful as the fastest one sors (AP) are tightly connected to a main memory available in the beginning of “Earth Simulator system (MS). AP is a vector processor, which can 35 35 project.” After five years of research and development deliver 8GFLOPS, and MS is a shared memory of activities, the Earth Simulator was completed and 16GB. The total system thus comprises 5,120 APs came into operation at the end of February 2002. Within only two months from the beginning of its operational run, the Earth Simulator was proven to 40 40 be the most powerful supercomputer in the world by Interconnection Network 640 x 640 crossbar switch achieving 35.86TFLOPS, or 87.5% of peak perfor-

mance of the system, in LINPACK benchmark. Main memory Main memory Main memory It also achieved 26.58TFLOPS, or 64.9% of peak System 16GB System 16GB System 16GB performance of the system, for a global atmospheric 45 45 circulation model with the spectral method. This simulation model was also developed in the “Earth Simulator project.” Arithmetic Processor #0 Arithmetic Processor #0 Arithmetic Processor #0 Arithmetic Processor #1 50 *Computers Division Arithmetic Processor #7 Arithmetic Processor #1 Arithmetic Processor #7 Arithmetic Processor #1 Arithmetic Processor #7 50 †National Institute of Advanced Industrial Science and Processor Node #0 Processor Node #1 Processor Node #639 Technology (AIST) ‡Earth Simulator Center, Japan Marine Science and Fig. 1 General configuration of the Earth Simula- Technology Center tor. 22 NEC Res. & Develop., Vol. 44, No. 1, January 2003

1 and 640 MSs, and the aggregate peak vector perfor- ties of system disks and user disks are 415TB and1 mance and memory capacity of the Earth Simulator 225TB, respectively. The Mass Storage System is a are 40TFLOPS and 10TB, respectively. cartridge tape library system, and its capacity is more The interconnection network is a huge 640 × 640 than 1.5PB. 5 non-blocking crossbar switch linking 640 PNs. The To realize a high-performance and high-efficiency5 interconnection bandwidth between every two PNs is computer system, three architectural features are ap- 12.3GB/s in each direction. The aggregate switching plied in the Earth Simulator: capacity of the interconnection network is 7.87TB/s. The operating system manages the Earth Simula- · Vector processor 10 tor as a two-level cluster system, called the Super- · Shared memory 10 Cluster System. In the Super-Cluster System, 640 · High-bandwidth and non-blocking interconnection nodes are grouped into 40 clusters (Fig. 2). Each clus- crossbar network ter consists of 16 processor nodes, a Cluster Control Station (CCS), an I/O Control Station (IOCS) and From the standpoint of parallel programming, 15 system disks. Each CCS controls 16 processor nodes three levels of parallelizing paradigms are provided15 and an IOCS. The IOCS is in charge of data transfer to gain high-sustained performance: between the system disk and Mass Storage System, and transfers user file data from the Mass Storage · Vector processing on a processor System to a system disk, and the processing result · Parallel processing with shared memory within a 20 from a system disk to the Mass Storage System. node 20 The supervisor, called the Super-Cluster Control · Parallel processing among distributed nodes via Station (SCCS), manages all 40 clusters, and pro- the interconnection network. vides a Single System Image (SSI) operational envi- ronment. For efficient resource management and job 3. PROCESSOR NODE 25 control, 40 clusters are classified as one S-cluster and 25 39 L-clusters. In the S-cluster, two nodes are used for The processor node is a shared memory parallel interactive use and another for small-size batch jobs. vector supercomputer, in which 8 arithmetic proces- User disks are connected only to S-cluster nodes, sors which can deliver 8GFLOPS are tightly con- and used for storing user files. The aggregate capaci- nected to a main memory system with a peak 30 30

35 35

40 40

45 45

50 50

Fig. 2 System configuration of the Earth Simulator. Special Issue on High Performance Computing 23

1 performance of 64GFLOPS (Fig. 3). It consists of 8 To realize such a huge crossbar switch, a -1 arithmetic processors, a main memory system, a slicing technique is applied. Thus, the huge 640 × 640 Remote-access Control Unit (RCU), and an I/O proc- non-blocking crossbar switch is divided into a control essor (IOP). unit and 128 data switch units as shown in Fig. 4. 5 The most advanced hardware technologies are Each data switch unit is a one-byte width 640 × 6405 adopted in the Earth Simulator development. One of non-blocking crossbar switch. the main features is a “One chip Vector Processor” Physically, the Earth Simulator comprises 320 PN with a peak performance of 8GFLOPS. This highly cabinets and 65 IN cabinets. Each PN cabinet con- integrated LSI is fabricated by using 0.15µm CMOS tains two processor nodes, and the 65 IN cabinets 10 technology with copper interconnection. The vector contain the interconnection network. These PN and10 pipeline units on the chip operate at 1GHz, while IN cabinets are installed in the building which is 65m other parts including external interface circuits oper- long and 50m wide (Fig. 5). The interconnection net- ate at 500MHz. The Earth Simulator has ordinary air work is positioned in the center of the computer room. cooling, although the processor chip dissipates ap- The area occupied by the interconnection network is 15 proximately 140W, because a high-efficiency heat approximately 180m2, or 14m long and 13m wide, and15 sink using heat pipe is adopted. the Earth Simulator occupies an area of approxi- A high-speed main memory device was also devel- mately 1,600m2, or 41m long and 40m wide. oped for reducing the memory access latency and No supervisor exists in the interconnection net- access cycle time. In addition, 500MHz source syn- work. The control unit and 128 data switch units 20 chronous transmission is used for the data transfer operate asynchronously, so a processor node controls20 between the processor and main memory to increase the total sequence of inter-node communication. For the memory bandwidth. The data transfer rate be- tween the processor and main memory is 32GB/s, and the aggregate bandwidth of the main memory is 25 256GB/s. 25 Two levels of parallel programming paradigms are provided within a processor node:

· Vector processing on a processor 30 · Parallel processing with shared memory 30

4. INTERCONNECTION NETWORK

The interconnection network is a huge 640 × 640 35 non-blocking crossbar switch, supporting global ad- 35 dressing and synchronization. Fig. 4 Interconnection network configuration.

From / To Interconnection Network 40 LAN 40

System Disks AP AP AP AP Mass Storage System #0 #1 #2 #7 RCU IOP Processor Node Cabinets (PN Cabinets) User Disks 14m Interconnection Network User/Work 41m Cabinets (IN Cabinets) Bandwidth : 256GB/s Disks 45 13m 45 Air Conditioning System 40m

65m MMU#0 MMU#2 MMU#4 MMU#5 MMU#6 MMU#1 MMU#3 MMU#29 MMU#30 MMU#31

Power Supply Main memory System (MS) (Capacity : 16GB) System 50m Double Floor 50 (Floor height 1.5m) 50 128 Mbit DRAM developed for ES (Bank Cycle Time : 24nsec) 2048-way banking scheme

Fig. 5 Bird’s-eye view of the Earth Simulator Fig. 3 Processor node configuration. building. 24 NEC Res. & Develop., Vol. 44, No. 1, January 2003

1 example, the sequence of data transfer from node A to ECC codes are also used for recovering from a1 node B is shown in Fig. 6. continuous inter-node communication failure result- ing from a data switch unit malfunction. In this case 1) Node A requests the control unit to reserve a data the error byte data are continuously corrected by 5 path from node A to node B, and the control unit RCU within any receiver node until the broken data5 reserves the data path, then replies to node A. switch unit is repaired. 2) Node A begins data transfer to node B. To realize high-speed parallel processing synchro- 3) Node B receives all the data, then sends the data nization among nodes, a special feature is introduced. transfer completion code to node A. Counters within the interconnection network’s con- 10 trol unit, called Global Barrier Counter (GBC), and10 In the Earth Simulator, 83,200 pairs of 1.25GHz flag registers within a processor node, called Global serial transmission through copper cable are used for Barrier Flag (GBF), are used. This barrier synchroni- realizing the aggregate switching capacity of zation mechanism is shown in Fig. 8. 7.87TB/s, and 130 pairs are used for connecting a 15 processor node and the interconnection network. 1) The master node sets the number of nodes used for15 Thus, the error occurrence rate cannot be ignored for the parallel program into GBC within the IN’s realizing stable inter-node communication. To resolve control unit. this error occurrence rate problem, ECC codes are 2) The control unit resets all GBF’s of the nodes used added to the transfer data as shown in Fig. 7. Thus, a for the program. 20 receiver node detects the occurrence of intermittent 3) The node, on which task completes, decrements20 inter-node communication failure by checking ECC GBC within the control unit, and repeats to check codes, and the error byte data can almost always be GBF until GBF is asserted. corrected by RCU within the receiver node. 4) When GBC = 0, the control unit asserts all GBF’s of the nodes used for the program. 25 5) All the nodes begin to process the next tasks. 25

Control Unit Therefore the barrier synchronization time is con- stantly less than 3.5µsec, the number of nodes vary- 1) reserve a data path 3) send completion XSWXSW ing from 2 to 512, as shown in Fig. 9. Data Switch code 30 × 30 unit 128

check ECC code 5. PERFORMANCE 2) transfer data from node A to B Several performances of the interconnection net- AP AP RCU AP AP RCU add ECC code work and LINPACK benchmark have been measured. 35 First, the interconnection bandwidth between35 MS MS

node A node B

Fig. 6 Inter-node communication mechanism. Control Unit ≠ -1 GBC 0 40 -1 40

-1 3 1 GBC Processor NodeInterconnection Network Processor Node

DATA 13B ECC 3B DATA 13B ECC 3B 3 checking ECC codes and correcting error data 2 P/S S/P X0 P/S S/P 2 4 4 45 P/S S/P X14 P/S S/P 45 adding ECC codes

P/S S/P X15 P/S S/P

DATA 13Byte DATA 13Byte x 8 X16 x 8 AP AP RCU AP AP RCU SREG GBF GBF

3 3 5 5 MS MS 50 x 8 x 8 50 Master node Node

X127 Fig. 8 Barrier synchronization mechanism using Fig. 7 Inter-node interface with ECC codes. GBC. Special Issue on High Performance Computing 25

1 5 1 60.0 without GBC 4.5 with GBC 4 50.0 3.5 3 5 40.0 5 µ 2.5 2

µ 30.0 1.5 Time ( sec) 1 20.0 0.5 Time ( sec) 0 10.0 10 0 100 200 300 400 500 600 10 The number of Processor Nodes 0.0 248163264 Fig. 9 Scalability of MPI_Barrier. The number of Processor Nodes

15 Fig. 11 Scalability of MPI_Barrier. 15 14

12 Read Write 100.0 [GB/sec] 10 90.0 20 80.0 20 8 70.0

6 60.0 Peak Ratio [%] 50.0 Bandwidth 4 40.0 LINPACK HPC [TFLOPS] 30.0 25 2 25 20.0

0 10.0 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10 100 Transfer Data Size [MB] 0.0 0 1000 2000 3000 4000 5000 6000 The number of processors 30 Fig. 10 Inter-node communication bandwidth. 30 Fig. 12 LINPACK performance and peak ratio.

every two nodes is shown in Fig. 10. The horizontal axis shows data size, and the vertical axis shows 35 35 bandwidth. Because the interconnection network is a performance is more than 85%, and LINPACK perfor- single-stage crossbar switch, this performance is al- mance is proportional to the number of processors. ways achieved for two-node communication. Second, the effectiveness of the Global Barrier 6. CONCLUSION Counter is summarized in Fig. 11. The horizontal axis 40 40 shows the number of processor nodes, and the Verti- An overview of the Earth Simulator system has cal axis shows MPI_barrier synchronization time. been given in this paper, focusing on its three archi- The line labeled “with GBC” shows the barrier syn- tectural features and hardware realization, especially chronization time using GBC, and “without GBC” the details of the interconnection network. shows the software barrier synchronization. By using Three architectural features (i.e.: vector processor, 45 45 GBC feature, MPI_barrier synchronization time is shared memory and high-bandwidth non-blocking in- constantly less than 3.5µsec. On the other hand, the terconnection crossbar network) provide three levels software barrier synchronization time increases, or is of parallel programming paradigms; vector process- proportional to the number of nodes. ing on a processor, parallel processing with shared The performance of LINPACK benchmark is mea- memory within a node and parallel processing among 50 50 sured to verify that the Earth Simulator is a high- distributed nodes via the interconnection network. performance and high-efficiency computer system. The Earth Simulator was thus proven to be the most The result is shown in Fig. 12, the number of proces- powerful supercomputer, achieving 35.86TFLOPS, or sors varying from 1,024 to 5,120. The ratio of peak 87.5% of peak performance of the system, in 26 NEC Res. & Develop., Vol. 44, No. 1, January 2003

1 LINPACK benchmark, and 26.58TFLOPS, or 64.9% who were engaged in the Earth Simulator project,1 of peak performance of the system, for a global atmo- and especially H. Uehara and J. Li for performance spheric circulation model with the spectral method. measurement. This record-breaking sustained performance make 5 this innovative system a very effective scientific tool REFERENCES 5 for providing solutions for the sustainable develop- ment of humankind and its symbiosis with the planet [1] M. Yokokawa, S. Shingu, et al., “Performance Estimation of the Earth Simulator,” Towards Teracomputing, Proc. of earth. 8th ECMWF Workshop, pp.34-53, World Scientific, 1998. [2] K. Yoshida and S. Shingu, “Research and development of 10ACKNOWLEDGMENTS the Earth Simulator,” Proc. of 9th ECMWF Workshop,10 pp.1-13, World Scientific, 2000. [3] M. Yokokawa, S. Shingu, et al., “A 26.58 TFLOPS Global The authors would like to offer their sincere condo- Atmospheric Simulation with the Spectral Transform lences for the late Dr. Hajime Miyoshi, who initiated Method on the Earth Simulator,” SC2002, 2002. the Earth Simulator project with outstanding leader- 15 ship. 15 The authors would also like to thank all members Received November 1, 2002

* * * * * * * * * * * * * * * 20 20

Shinichi HABATA received his B.E. degree in Shigemune KITAWAKI received his B.S. de- electrical engineering and M.E. degree in in- gree and M.S. degree in mathematical engi- formation engineering from Hokkaido Univer- neering from Kyoto University in 1966 and sity in 1980 and 1982, respectively. He joined 1968, respectively. He joined NEC Corporation 25 NEC Corporation in 1982, and is now Manager in 1968. He was then involved in developing25 of the 4th Engineering Department, Comput- compilers for NEC’s computers. He joined ers Division. He is engaged in the development of Earth Simulator project in 1998, and is now Group leader of supercomputer hardware. “User Support Group” of the Earth Simulator Center, Japan Mr. Habata is a member of the Information Processing Marine Science and Technology Center. Society of Japan. Mr. Kitawaki belongs to the Information Processing Society 30 of Japan and the Association for Computing Machinery. 30

Mitsuo YOKOKAWA received his B.S. and M.S. degrees in mathematics from Tsukuba University in 1982 and 1984, respectively. He also received a D.E. degree in computer engi- 35 neering from Tsukuba University in 1991. He 35 joined the Japan Atomic Energy Research In- stitute in 1984, and was engaged in high performance compu- tation in nuclear engineering. He was engaged in the develop- ment of the Earth Simulator from 1996 to 2002. He has joined the National Institute of Advanced Industrial Science and Technology (AIST) in 2002, and is Deputy Director of the Grid 40 40 Technology Research Center of AIST. He was a visiting re- searcher of the Theory Center at Cornell University in the US from 1994 to 1995. Dr. Yokokawa is a member of the Information Processing Society of Japan and the Japan Society for Industrial and Applied Mathematics. 45 45

* * * * * * * * * * * * * * *

50 50