RIKEN Super Combined Cluster (RSCC) System
Total Page:16
File Type:pdf, Size:1020Kb
RIKEN Super Combined Cluster (RSCC) System V Kouichi Kumon V Toshiyuki Kimura V Kohichiro Hotta V Takayuki Hoshiya (Manuscript received June 14, 2004) Recently, Linux cluster systems have been replacing conventional vector processors in the area of high-performance computing. A typical Linux cluster system consists of high-performance commodity CPUs such as the Intel Xeon processor and commodity networks such as Gigabit Ethernet and Myrinet. Therefore, Linux cluster systems can directly benefit from the drastic performance improvement of these commodity components. It seems easy to build a large-scale Linux cluster by connecting these components, and the theoretical peak performance may be high. However, gaining stable operation and the expected performance from a large-scale cluster needs ded- icated technologies from the hardware level to the software level. In February 2004, Fujitsu shipped one of the world’s largest clusters to a customer. This system is currently the fastest Linux cluster in Japan and the seventh fastest supercomputer in the world. In this paper, we describe the technologies for realizing such a high-perfor- mance cluster system and then describe the implementation of an example large-scale cluster and its performance. 1. Introduction more than a thousand, depending on the perfor- A Linux cluster (sometimes called a PC clus- mance requirements. For a large-scale cluster, ter) consists of Intel Architecture (IA) based the network’s performance becomes more impor- servers, interconnect networks, and cluster mid- tant because, if the network’s performance is poor, dleware. Formerly, the PC cluster system was the scalability will be limited regardless of the regarded as a low-budget supercomputer because number of nodes. Therefore, the network perfor- it uses inexpensive PCs as computing engines and mance becomes a key issue in a large-scale system. Ethernet for interconnections. Many PC boards Improving the network performance requires not now have Ethernet built in, so such a PC cluster only hardware enhancement, but also optimiza- can be built using just PCs and Ethernet hubs. tion of the driver software and network Because of improvements in the performance of middleware. In addition to the performance IA CPUs, supercomputers can now be replaced issue, system management causes another prob- with PC clusters by adding extra functionality or lem. That is, when a Linux cluster system is used small amounts of non-commodity components. by a relatively small, experienced group or a Therefore, we use the term “Linux Cluster” to dis- single person, the system’s stability is not a prime tinguish such a supercomputer from a PC cluster issue because stability is maintained by the system. users’ programming skills. For example, a well- The major components of a Linux cluster sys- written program periodically dumps its internal tem are the IA servers and interconnect networks, state (checkpointing) so it can be restarted from and the number of CPUs can vary from several to the latest dump (recovered) when the system 242 FUJITSU Sci. Tech. J., 40,2,p.242-251(December 2004) K. Kumon et al.: RIKEN Super Combined Cluster (RSCC) System crashes. However, as the application domain is 2. Issues of Linux cluster extended, clusters are being used as the central systems for high-performance machines in laboratories and even companies. As computing facility a result, cluster systems must be able to perform To realize a high-performance, large-scale checkpointing and recovery without the help of Linux cluster and widen the application area, at application programs. Therefore, the system soft- least three goals must be reached: 1) a high- ware must strengthen the system stability as well performance interconnect network to gain the best as the performance. application performance, 2) language support to SCore1), note 1) cluster middleware is a system widen the application area, and 3) management software for solving these problems. It provides support for a large-scale cluster to realize stable both high-performance communication interfac- operation. In the following sections, we describe es and a checkpoint-restart functionnote 2) that can the technologies we have developed to reach these augment the scalability and reliability of a large- goals. scale system without application modification. In February 2004, Fujitsu shipped a large-scale 2.1 InfiniBand: A high-performance cluster Linux cluster system to a customer. This system— interconnect the RIKEN Super Combined Cluster (RSCC)—is On a Linux cluster system, a large-scale the fastest Linux cluster in Japan and the sev- computation is divided into small blocks of tasks, enth fastest supercomputer in the world. It is the blocks are distributed to the nodes of a clus- currently operating as the main computer at the ter, and then the nodes execute the tasks by customer’s supercomputer center. In this paper, communicating with other nodes through the in- we describe technologies for realizing a high- terconnecting network. If application programs performance Linux cluster. Some examples of need frequent communication to execute, the these technologies are 1) high-performance inter- overall system performance will mainly be deter- connection networks, 2) programming support mined by the network performance. Therefore, a technology to extend the cluster usage so users high-performance interconnection network is can migrate their code from a vector computer needed to build a high-performance cluster. environment to a cluster environment, and Two common measures are used to evaluate 3) management features to gain system reliabili- the interconnect network: the network through- ty. We then introduce the RSCC system as an put and the roundtrip latency time (RTT). For a example of our Linux cluster technologies and cluster interconnect, Gigabit Ethernet and Myri- describe its performance. net are typically used. Gigabit Ethernet is cheap and commonly used in systems ranging from a desktop computer to a large server system. Its theoretical maximum throughput is 1 Gbps (125 MB/s), and the roundtrip latency, which depends on the network interface controller (NIC) note 1) SCore was developed by the Real World Com- and the communication protocol, is usually from puting Project, which was funded by the 30 to 100 µs. Myricom’s Myrinet interconnect is Japanese Government. It is currently sup- ported by the Linux Cluster Consortium. widely used as a high-performance interconnect note 2) A checkpoint restart function dumps the sta- and has a 2 Gbps link speed and a roundtrip tus of a running process so it can be used to latency of less than 10 µs. As the cluster scale restore the status at a later time. It is main- ly used to recover the status after a system increases, a much higher performance intercon- has crashed. nect is needed. To meet this need, we developed FUJITSU Sci. Tech. J., 40,2,(December 2004) 243 K. Kumon et al.: RIKEN Super Combined Cluster (RSCC) System Application SCASH MPI/C++ MPICH-SCore User level SCore-D Global OS PMv2 PM/InfiniBand-FJ PM/Shemem PM/Shemem PM/Ethernet PM/UDP PM/Ethernet Socket PM/InfiniBand-FJ PM/Shemem PM/Myrinet driver UDP/IP Linux Kernel level driver driver driver Ethernet IB firmware PM firmware Ethernet NIC NIC level IB HCA Myrinet NIC IB: InfiniBand UDP: User Datagram Protocol HCA: Host Channel Adaptor Figure 1 SCore cluster middleware architecture. the InfiniBand interconnect. User process PMv2 library The InfiniBand is a general-purpose inter- PM/IB-FJ library connect network specified by the InfiniBand Trade Association (IBTA). This specification defines Linux kernel three data transfer speeds: 2 Gbps, 8 Gbps, and PM/IB-FJ driver Firmware control Kernel 24 Gbps—called the 1x, 4x, and 12x link, respec- Memory management bypass tively. To gain a superior performance, the 4x and 12x InfiniBand links are candidates. Fujitsu al- IB HCA hardware ready ships a 4x InfiniBand Host Channel Adaptor PM/IB-FJ firmware (HCA) for its PRIMEPOWER server, so we decid- InfiniBand link protocol ed to develop a high-performance driver for use with a Linux cluster system. To provide good net- Figure 2 Architecture of PM/InfiniBand-FJ. work performance, the interface between a user program and the hardware is also important. We therefore chose the SCore cluster middleware to ing the PM/InfiniBand-FJ driver, application realize the required performance and stability. programs can run on these different kinds of The main components of the SCore middleware interconnect hardware without the need for mod- are the PMv22) high-performance, low-level com- ification or recompilation. munication interface and the SCore-D3) cluster The PM/InfiniBand-FJ library directly com- operating system (Figure 1). municates with the InfiniBand HCA; therefore, PMv2 provides a short latency and a high- the kernel overhead is not added to the communi- throughput network infrastructure to the MPI cation overhead (Figure 2). Conventional (Message Passing Interface) library and abstracts networks need a driver to prevent malicious ac- the implementation of underlining hardware such cess to the hardware and also prevent interference as the Ethernet, Myrinet, shared memory, and SCI between user processes. InfiniBand HCA has a (Scalable Coherent Interface). Therefore, by add- hardware mechanism to implement this protec- 244 FUJITSU Sci. Tech. J., 40,2,(December 2004) K. Kumon et al.: RIKEN Super Combined Cluster (RSCC) System Table 1 1000 900 PM-level roundtrip latency. InfiniBand/GC-LE 800 RTT (µs) Ratio InfiniBand/RX200A 700 MyrinetXP/RX200A PM/InfiniBand-FJ 16.8 2.02 600 Ethernet/RX200A PM/MyrinetXP 8.3 1 500 PM/Ethernet 29.6 3.37 400 300 200 100 tion; therefore, the user-level library can safely 0 Communication bandwidth (MB/s) Communication 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+06 1.E+06 1.E+07 access the hardware directly.