Paper, and Eration on Two TSUBAME Nodes

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology/JST, Japan Abstract taken diverse approaches for heterogeneous architecture; AMD Torrenza and a vector acceleration project Heterogeneous supercomputers with combined by HP [4] connect heterogeneous resources via front general-purpose and accelerated CPUs promise to be side bus, although their scalability in massive scale is the future major architecture due to their wide-ranging unproved. An example of petascale heterogeneous su- generality and superior performance / power ratio. percomputers will be the IBM Roadrunner that will However, developing applications that achieve effective combine AMD Opteron CPUs and the Cell Broad- scalability is still very difficult, and in fact unproven band Engines[7] to be built in 2008 and will boast 1.6 on large-scale machines in such combined setting. We PetaFlops. show that an effective method for such heterogeneous The largest heterogeneous supercomputer to date is systems so that the porting from applications written our TSUBAME supercomputer, installed at Tokyo In- with homogeneous assumptions could be achieved. For stitute of Technology in April 2006. TSUBAME is also this goal, we divide porting of applications into several currently the fastest supercomputer in Asia Pacific, steps, analyze performance of the kernel computation, based on 5,240 dual-core Opteron CPUs (10,480 CPU create processes that virtualize the underlying proces- cores) and 648 SIMD vector accelerator boards from sors, tune parameters with preferences to accelerators, ClearSpeed[1]. The peak performance of Opterons is and balance the load between heterogeneous nodes. We approximately 50.4TFlops, and that of 360 accelerator apply our method to the parallel Linpack benchmark boards is 52.2TFlops, totaling 102.6TFlops of comput- on the TSUBAME heterogeneous supercomputer. ing resources. We efficiently utilize both 10,000 general purpose The major issues that arise on heterogeneous super- CPU cores and 648 SIMD accelerators in a combined computers are how users can develop programs that ef- fashion—the resulting 56.43 TFlops utilized the entire fectively use the hybrid computing resources. Despite machine, and not only ranked significantly on the the fairly long history in supercomputers with vector Top500 supercomputer list, but also it is the highest acceleration options such as the CM-5 and the Meiko Linpack performance on heterogeneous systems in the CS-2, and the recent high interest on SIMD-vector pro- world. gramming, there have been little results in scalability of tightly-coupled, parallel code on large heterogeneous machines. On the software side, recent research in 1 Introduction pursuing easy programming with heterogeneity has fo- cused on small scale (often a single node) systems only, Although massively-parallel homogenous supercom- including Sequoia [10], CUDA[2] and Accelerator[18]. puters using low power CPUs and/ or multi-core CPUs We note that such difficulty is rather fundamental; the such as BlueGene/L [8] are one promising road to vast differences in the architecture between CPUs and petascale, another approach is heterogeneous super- accelerators not only make programming challenging computers. They consist of general purpose CPUs in terms of coding, but also make proper load balanc- and more specific, dedicated programmable accelerated ing difficult. Even a small deviation from the opti- processors. CPUs offer flexibility and generality over mally balanced load may have adverse effect on the wide-ranging classes of applications, while accelerators overall scalability, especially for tightly-coupled paral- provide high performance / power ratio for specific lel codes. As a result, previous results have been for computation patterns, so their combined use would be loosely-coupled codes, and/or small scale computation. ideal. Several research or commercial projects have Our work, however, demonstrates that combined us- 1 ¡ 2 TSUBAME: The Heterogeneous 100 ¥ ¤ ¥ £ & TeraFlops Supercomputer % ¥ ¢ ¥ ¡ "# $ ¥ ! NEC/Sun TSUBAME is a ‘fat-node’ supercomputer ¤ £ cluster that consists of 655 SunFire X4600 compute ¢ nodes. Each compute node has 16 2.4 GHz AMD ¡ Opteron CPU cores, with 32 GBytes of shared mem- § © ¦ § © § © © ¦ ory, while storage server nodes total 1.6PBytes in raw § § § § § capacity. Both the computing and storage nodes have ¥ - - ¡ ' ( ) , , Voltaire InfiniBand 4x HCAs, interconnected in a re- stricted fat tree topology with two core switches and six edge switches, 288-port Voltaire ISR9288, as shown Figure 1. HPL performance on two TSUBAME inFigure2.Eachnodeisconnectedtooneofedge nodes. In ‘+n Acc’, we use n accelerator switches via two 10Gbps InfiniBand links, and switches boards in addition to CPUs. are mutually interconnected via 24 trunked InfiniBand links. On each compute node, 64bit SuSE Linux Enter- prise Server with kernel 2.6.5 runs. Users invoke their jobs via the Sun N1 Grid Engine scheduler, though our large scale experiments are done in a dedicated situa- age of resources can be achieved with high efficiency tion. and scaling with existing tightly-coupled parallel codes. Among the 655 TSUBAME compute nodes, 648 are Our approach consists of 4 steps, namely (1) carefully equipped with the ClearSpeed Advance X620 Accel- analyze and model the kernel of the code, (2) create vir- erator Boards [1] on their PCI-X slots. Each board tual processes to emulate apparent homogeneity with hosts two ClearSpeed CSX600 SIMD processors, each underlying heterogeneous resources, (3) parameter tun- of which has 96 SIMD processing elements; its theoreti- cal peak performance is 80.6GFlops (double precision) ing with preferences to accelerators, and (4) load bal- 1 ance taking heterogeneity and overhead for accelera- at 210MHz clock speed . A board also hosts 1GB tors into account. We demonstrate the effectiveness DDR-SDRAM, which CSX600 processors can directly of our approach by modifying the High Performance access. Since the memory is separated from host mem- Linpack(HPL)[17] for heterogeneous systems consist- ory, communication of input/output data via PCI-X ing of massive numbers of general-purpose and accel- is necessary. A notable feature of the boards is low eration CPUs, and demonstrate its efficiency and scal- power; consumption of a board is only about 25W, or ability on TSUBAME. total of 16kW for the entire 648 boards, consisting less than 2% of the power consumption of TSUBAME while Without careful strategy and tuning, HPL perfor- offering 50% of the overall compute power. Currently mance even becomes slower due to introduction of ac- ClearSpeed provides following usages; a SIMD paral- n celerators as shown in Figure 1; the graph compares lel programming language C , CSXL library for basic the performance in a ‘CPU-only’ case and with accel- linear algebra (BLAS), mainly used in this paper, and eration on two TSUBAME nodes. When we adopt im- CSFFT library for fast Fourier transformation. Re- proper granules of compute size (‘Bad granularity’ in cently molecular dynamics package is also provided. the graph) or ignore the overhead of introducing accel- In heterogeneously accelerated supercomputers such erators (‘Bad load balance’), the overall performance as TSUBAME, we may be are faced with two types lags behind CPU-only case. With careful parameter of heterogeneity, namely intra-node heterogeneity and configurations (‘Good configuration’), HPL with accel- inter-node heterogeneity. The former is that acceler- erators gets significantly faster than CPU-only. And ated nodes have both general purpose CPUs and SIMD in fact it scales on TSUBAME with varying number accelerators that have different characteristics, with of node configurations, up to the scale of the full ma- promise of generality and low power/space consump- chine, achieving 56.43TFlops on more than 10,000 CPU tion. We may be faced the latter for less technical cores and 648 accelerators, a significant improvement reasons; if part of nodes are equipped with acceler- from our CPU-only result of 38.18TFlops. As far as ators and others are not, we have to take difference we know this is the first time that a heterogeneous su- between accelerated and non-accelerated nodes. This percomputer has been ranked high (9 to 16th) on the 1The clock speed is configurable up to 250MHz, but faster Top500 list[3]. clock may make computational results unstable 2 10Gb InfiniBand switch switch 10Gb MPO(fiber) x24 x24 switch switch switch switch switch switch 10Gb IB 120 SunFire 120 120 120 120 55 60 Storage nodes nodes nodes nodes nodes nodes servers Figure 2. Overview of the TSUBAME Architecture. Currently, shaded SunFire nodes are equipped with ClearSpeed accelerators. was the case with TSUBAME, since only about 55% B of the nodes were equipped with accelerators until Oc- tober 2007. In general, computer centers with large 0 1 2 0 1 2 0 scale clusters may come across similar situations, for Q=3 3 4 5 3 4 5 3 changes of commercial products or reasons of the bud- get. Therefore techniques to tackle this situation are P=2 0 1 2 0 1 2 0 1 2 0 N becoming important for efficient large scale computing. 3 4 5 3 4 5 3 4 5 3 In more general cases, the numbers and/or the type of 0 1 2 0 1 2 0 accelerators may differ among the nodes. 3 4 5 3 4 5 3 As already discussed, we have several approaches to 0 1 2 0 1 2 0 introduce heterogeneity into supercomputers. TSUB- AME’s approach to equip accelerators on PCI-X slots has the following advantages. First, it is more P ×Q × portable; accelerators tightly coupled with general pur- Figure 3. (Left) Process grid with =2 3 pose CPUs, such as AMD Torrenza, are attractive for processes. (Right) Two dimensional block N × N their superior bandwidth, but they heavily depend on cyclic distribution of matrix on 6 pro- B CPU architecture. Secondly, it is power and space effi- cesses. Block size is . cient; if we combine separately designed systems, such as PC cluster and a vector machine, we may suffer from double power consumption and space.

Paper, and Eration on Two TSUBAME Nodes

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Clearspeed Technical Training

Introduction Hardware Acceleration Philosophy Popular Accelerators In

The Return of Acceleration Technology

Highlights of the 53Rd TOP500 List

A Fad Or the Yellow Brick Road Onto Exascale?

Exascale” Supercomputer Fugaku & Beyond

TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer

(WP8) Prototypes Future Technologies

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance

GPU Scalability 35