1

The Total Picture of TSUBAME 2.0

GPU Computing for Dendritic Solidification based on Phase-Field Model The Total Picture of TSUBAME 2.0 Satoshi Matsuoka* Toshio Endo** Naoya Maruyama* Hitoshi Sato* Shin'ichiro Takizawa* * Global Scientific Information and Computing Center, Tokyo Institute of Technology ** Graduate School of Information Science and Engineering, Tokyo Institute of Technology

In November 2010, Tokyo Tech will launch a new named TSUBAME 2.0. It is the firstpetascale supercomputer in Japan; it has computing power of 2.4 petaflops and storage capacity of 7.1 petabytes, and becomes one of the fastest in the world.

in terms of the number of processors, CPUs have been much more dominant. In TSUBAME 2.0, each node is equipped with three GPU Introduction accelerators, which enable a great leap not only in performance but also energy efficiency. TSUBAME 2.0's main compute node, manufactured by Hewlett- In 2006, Tokyo Tech Global Scientific Information and Computing Packard, has two Intel Westmere EP 2.93GHz processors, three Center (GSIC) has introduced a supercomputer called TSUBAME 1.0 Fermi M2050 GPUs and about 50GB memory. Each node as “supercomputer for everyone”; the goal of this system is to make provides computation performance of 1.7 teraflops (TFLOPS), supercomputing available even to non-specialist users. TSUBAME which is about 100 times larger than that of typical laptop 1.0 became No.1 supercomputer in Asia as of 2006 to 2007, and PCs. The main part of the system consists of 1,408 computing has been widely used by about 2,000 users for four and a half nodes, and the total peak performance will reach 2.4 PFLOPS, years. In November 2010, GSIC center will start operation of new which will surpass the aggregated performance of all the major supercomputer, called TSUBAME 2.0 with computing performance supercomputers in Japan. of 2.4 petaflops (PFLOPS, 2.4x1015 operations per second), which Each CPU in TSUBAME 2.0 has six physical cores and supports is about 30 times larger than TSUBAME 1.0. This new system up to 12 hardware threads with the hyper-threading technology, is designed based on our research findings and operational achieving up to 76 gigaflops (GFLOPS). The GPU sports NVIDIA's experiences with TSUBAME 1.0/1.2, and will be introduced in new processor architecture called Fermi and contains 448 small cooperation with NEC, HP, NVIDIA, Microsoft, and other partner cores with 3GB of GDDR5 memory with 515 GFLOPS performance companies. With TSUBAME 2.0, which becomes the first petascale at maximum. supercomputer in Japan, we also promote innovative energy- In general, efficient use of GPUs requires different programming aware and cloud operations. methodologies than CPUs. For this purpose, CUDA and OpenCL are supported so that users can execute programs designed for TSUBAME 1.2 on the new system. Also Tesla M2050 GPUs have advantage both in performance and programmability; the Key Technologies toward Petaflops adoption of the true hardware cache will make performance tuning of programs much easier. One of key features of TSUBAME 2.0 architecture is adopting Basically TSUBAME 1.0/1.2 and 2.0 are supercomputing clusters, hardware with highly improved bandwidth in order to enable data which consist of a large number of usual processors such as Intel communication efficiently. To keep performance of intra-node compatible CPUs. Additionally, they are equipped with processors communication high, the memory bandwidth reaches up to 32 designed for specific purposes, called accelerators, in order to GB/s on CPU and 150 GB/s on GPU. Communication between CPUs significantly improve performance of scientific applications and GPUs is supported by the latest PCI-Express 2.0 x16 technology based on vector computing. We have experiences in operating with bandwidth of 8GB/s. ClearSpeed accelerators on TSUBAME 1.0, and 680 As the interconnect that combines more than 1,400 computing GPUs in TSUBAME 1.2, and learned various technologies for energy- nodes and storage described later, TSUBAME 2.0 uses the latest efficient high-performance computing with accelerators. However, QDR InfiniBand (IB) network, which features 40Gbps bandwidth

02 TSUBAME 2.0 System Configuration

per link. Each computing node will be connected with two IB links; thus communication speed of the node is about 80 times larger Large Scale Storage for e-Science than typical LAN (1Gbps). Not only the link speed at end-point nodes, but network topology of the whole system will heavily affect performance of large scale computations. TSUBAME 2.0 adopts full-bisection fat-tree topology, which accommodates As a supercomputing system that supports e-Science, large scale applications of wider area than others such as torus/mesh storage system that supports fast data access is necessary. TSUBAME 2.0 topology; the adopted topology will be much more advantageous includes a storage system with capacity of 7.1 petabytes (PB), which for applications with implicit methods, such as spectral methods. is six times larger than that of TSUBAME 1.0. As typical usage, users Despite the significant improvements in performance and can store large scale data used by their tasks; additionally, the storage capacity, the power consumption of TSUBAME 2.0 will be as system is used to provide Web-based storage service, which users in comparable as the current one, thanks to a lot of research and Tokyo-Tech can use easily. engineering advances in power efficiency. Cooling is also greatly The storage system mainly consists of two parts: 1.2 PB home improved by exploiting much more efficient water-cooling storage volume and 5.9 PB parallel file system volume. The home systems. We expect the power usage effectiveness (PUE, an index storage volume is designed so that it provides high reliability, to evaluate energy efficiency of computing facility) of TSUBAME 2.0 availability and performance based on redundant structure. Especially, will be as low as 1.2 on the average. it provides up to 1100 MB/s of accelerated NFS performance, via QDR IBs and 10Gbps networks. The volume also supports other protocols, such as CIFS, iSCSI in order to support transparent data access from all computing nodes running both Linux and Windows. It is also used for various storage services for educational and clerical purpose in Tokyo-

03 The Total Picture of TSUBAME 2.0

Tech. The center component of the home volume is a DDN SFA 10000 ● We have provided the large computing service (HPC queue), for so- high density storage. It is connected to four HP DL380 G6 servers and called capability jobs where a user group can use up to 1000 CPU two BlueArc Mercury servers that accept access requests from clients. cores and 120 GPUs exclusively in TSUBAME 1.2. For TSUBAME 2.0, The focus of another volume, the parallel file system volume is we will further enhance this feature, providing up to 10,000 CPU scalability so that it accommodates access requests from a large cores and thousands of GPUs to selected user groups under number of nodes smoothly. Based on our operational experience regulated reservation and/or peer review process. with TSUBAME 1.0/1.2, it supports the Lustre protocol and consists ● We will federate a web portal, called TSUBAME portal with Tokyo of five Lustre subsystems. The aggregate read I/O throughput Tech portal to enhance usability such as paperless account of each subsystem will be over 200GB/s. Like the home volume, application. For Tokyo Tech users, TSUBAME accounts are unified each subsystem contains a DDN SFA 10000 system. To achieve high with accounts of Tokyo Tech portal. Additionally we will provide performance, each SFA 10000 is connected to six HP DL 360G6 account services across Japan as one of the nine leading National servers. Unviersity supercomputing centers and its national alliance. Data on these TSUBAME 2.0 storage systems will be backed up ● A network storage service will be provided to Tokyo Tech users to 4PB (uncompressed) Sun SL8500-based tape libraries. We are by using the large scale storage of TSUBAME 2.0. Users can easily currently planning to introduce a hierarchal file system in 2010, in utilize the storage from their PCs without recognizing existence of order to provide transparent on-demand data access between the TSUBAME. parallel file systems and tape libraries for data-intensive application ● We promote data-oriented e-Science using TSUBAME 2.0’s high- users, as well as increase tape capacity to accommodate for data end storage resources, which has been difficult for traditional volume increase. Japanese supercomputing centers. For this purpose, TSUBAME In addition to these system-wise storage systems, each compute 2.0 and national supercomputing centers are tightly couple with node of TSUBAME 2.0 has 120-240GB solid state drives (SSD) instead 10Gbps class networks, called SINET. This will enable data sharing of hard disk drives. They are used to store temporary files created by and transportation service via GFarm file system and GridFTP using applications and checkpoint files. It is very challenging to introduce our newly developed RENKEI-PoP technology which are now being SSDs to large scale supercomputers, and our innovative research on deployed at various national supercomputing centers. this topic has been leading the international efforts in reliability, which we have and will continue to publish in international conferences and journals. Summary

Cloud Operation of TSUBAME 2.0 As of writing this article, system construction and detailed investigation of new operational method are ongoing. When the operation starts in November, please utilize TSUBAME 2.0, one of the As “supercomputer for everyone”, TSUBAME 1.0/1.2 has provided largest supercomputer in the world, for leading innovation in science availability so that users, who have been used PCs or small clusters, and technology. can sport supercomputing power. In TSUBAME 2.0, while inheriting the availability, we promote the following services in order to expand TSUBAME 2.0 News Web: the range of variable users. http://www.gsic.titech.ac.jp/tsubame2 ● While the typical usage of computing power of TSUBAME 2.0 Official TSUBAME 2.0 Twitter account: remains to be throwing jobs via a batch queue system, it supports Tsubame20 not only Linux OS (SUSE Linux Enterprise 11), but also Windows HPC Server 2008. This new operation is supported by virtual machine (VM) technology. ● We continue the hosting service using VM technology in Tokyo Tech campus. Additionally, with VM technology, we plan to improve efficiency of computing resources by suspending some low-priority jobs and dividing a single node into several virtual nodes.

04 GPU Computing for Dendritic Solidification based on Phase-Field Model

Takayuki Aoki* Satoi Ogawa** Akinori Yamanaka** * Global Scientific Information and Computing Center, Tokyo Institute of Technology ** Graduate School of Science and Engineering, Tokyo Institute of Technology

Mechanical properties of metallic materials like steel depend on the solidification process. In order to study the morphology of the microstructure in the materials, the phase-field model derived from the non-equilibrium statistical physics is applied and the interface dynamics is solved by GPU computing. Since very high performance is required, 3-dimensional simulations have not been carried out so much on conventional supercomputers. By using 60 GPUs installed on TSUBAME 1.2, a performance of 10 TFlops is archived for the dendritic solidification based on the phase-field model.

In this article, we study the growth of the dendritic solidification of a pure metal in super cooling state by solving the equations Introduction 1 derived from the phase-field model. Finite difference discretization is employed and the GPU code developed in CUDA is executed on the GPU of TSUBAME 1.2. The remarkably high performance is shown The mechanical properties of metallic materials are strongly in comparison with the conventional CPU computing. Although characterized by distribution and morphology of the microstructure most GPGPU applications run on single GPU, we exploit a multiple in the materials. In order to improve the mechanical performance GPU code and show the strong scalability of large-scale problems. of the materials and to develop a new material, it is essential to understand the microstructure evolution during solidification and phase transformation. Recently, the phase-field model[1] Phase-Field Model has been developed as a powerful method to simulate the 2 microstructure evolution. In the phase-field modeling, the time- dependent Ginzbunrg-Landau type equations which describe interface dynamics and solute diffusion during the microstructure The phase-field model is a phenomenological simulation method evolution are solved by the finite difference and finite element to describe the microstructure evolution in sub-micron scale methods. This microstructure modeling has been applied to based on the diffuse-interface concept. In this article, to describe numerical simulations for solidification, phase transformation and the interface between solid phase and liquid phase, we define precipitation in various materials. However, large computational a non-conserved order parameter (phase field)φtaking a value cost is required to perform realistic and quantitative three- of 0 in the liquid phase and 1 in the solid phase. In the interface dimensional phase-field simulation in the typical scales of the region,φ gradually changes from 0 to 1. The position atφ=0.5can microstructure pattern. To overcome such computational task, be defined as the solid/liquid interface. Using this diffuse-interface we utilize the GPGPU (General-Purpose Graphics Processing approach, the phase-field method does not require explicit Unit)[2] which is developed as an innovative accelerator[3] in high tracking of the moving interface. performance computing (HPC) technology. In this study, a time-dependent Ginzbunrg-Landau equation for GPU is a special processor often used in personal computers the phase fieldφand a heat conduction equation are solved[4]. The (PC) to render the graphics on the display. The request for high- governing equations for the phase fieldφand the temperature T quality computer graphics and PC games led great progress on are given by the following equations : GPU’s performance and made it possible to apply it to general- purpose computation. Since it is quite different from the former accelerators due not only to high performance of floating point calculation but also a wide memory bandwidth, GPGPU is applicable to various types of HPC applications. In 2006, CUDA[3] framework was released by NVIDIA and it has enabled us GPU programming in standard C language without taking account for graphics functions. (1)

05 GPU Computing for Dendritic Solidification based on Phase-Field Model

(2) Here, M is the mobility of the phase filedφ,εis the gradient coefficient, W is the potential height and βis the driving force term. These parameters are related to the material parameters as follows.

(3)

(4)

(5)

(6)

In this article, the material parameters for pure nickel are used.

The melting temperature Tm = 1728 K, the kinetic coefficient μ = 2.0 m/Ks, the interface thickness δ= 0.08μm, the latent heat L= 2.35×109 J/m3 and the interfacial energy σ= 0.37 J/m2, the heat conduction coefficient κ =1.55 × 10-5m2/s and the specific heat C= 5.42×10 6 J/Km3. Furthermore, the strength of interfacial anisotropy γ= 0.04. Assuming the interface regionλ<φ<1- λ, we obtained b=tanh-1(1-2λ). λis set to be 0.1, so that b reduce to 2.20. χis a random number distributed uniformly in the interval [-1, 1]. α is the amplitude of the fluctuation and set to be 0.4.

GPU Computing 3 Fig.1 Snap shots of the dendritic solidification growth

The calculations in this article were carried out on the TSUBAME Difference Method and time-integrated with the first-order accuracy Grid Cluster 1.2 in the Global Scientific Information and Computing (Euler scheme).The arrays for the dependent variables φat the n and Center (GSIC), at Tokyo Institute of Technology. Each node consists n+1 time steps are allocated on the VRAM, called the global memory of the Sun Fire X4600 (AMD Opteron 2.4 GHz 16 cores, 32 GByte in CUDA. We minimize the data transfer between the host (CPU) memory DDR2) and is connected by two Infiniband SDR networks with and the device memory (global memory) through the narrow PCI- 10 Gbps. Two GPUs of the NVIDIA Tesla S1070 (4 GPUs, 1.44GHz, Express bus, which becomes a large overhead of the GPU computing. VRAM 4GByte, 1036 GFlops, memory bandwidth 102GByte/s) In CUDA programming, the computational domain of a nx× ny× are attached to 340 nodes (total 680 GPUs) through the PCI-Express nz mesh is divided into L×M×M smaller domains of a MX×MY×MZ bus (Generation 1.0 x8). In our computations, one of the two GPUs mesh, where MX = nx/L, MY = ny/M, MZ = nz/N. We assign the CUDA is used per a node. The Opteron CPU (2.4 GHz) on each node has threads (MX, MY, 1) to each small domain in the x-and y-directions. the performance of 4.8 GFlops and a memory bandwidth of 6.4 Each thread computes MZ grid points in the z-direction using a loop. GByte/sec (DDR-400). CUDA version 2.2 and NVIDIA Kernel Module The GPU performance strongly depends on the block size and we 185.18.14 are installed and the OS is SUSE Enterprise Linux 10. optimize it to be MX = 64 and MY= 4. The discretized equation for the phase field variable φis reduced to 3-1 Tuning Techniques the stencil calculation referring to 18 neighbor mesh points. In order Equations (1) and (2) are discretized by the second-order Finite to suppress the global memory access, we use the shared memory as

06 a software managed cache. Recycling three arrays with a size of (MX assigned to each GPU. +2)×(MY+2)saves the use of the shared memory. In computing the Using the MPI library for the communication between GPU temperature, the shared memory is used similarly, however in this nodes, we run the same process number as the GPU number. n case the time derivative term∂φ/∂t|i, j,k appears in the right-hand Direct data transfer of GPU-to-GPU is not available and a three- n side of Eq.(2). We fuse the computational kernel function of φi, j,k→ hop communication is required: global memory to host memory, n+1 n n+1 φ i, j,kwith the kernel function of T i, j,k→T i, j,k, so that it becomes MPI communication, host memory to the global memory. This unnecessary to access the global memory by keeping the value communication overhead of the multi-GPU computing is relatively n ∂φ / ∂t|i, j,k on a temporal variable in the kernel function. much larger than that of CPU. In this article a 1-dimensional domain decomposition is examined for simplicity. 3-2 Performance of Single GPU 4-2 Overlap between Communication and Computation In order to evaluate the performance of the GPU computing and check the numerical results in comparison with CPU, we also built the In order to improve the performance of the multiple-GPU CPU code simultaneously. Since integer calculations are also done by computing, an overlapping technique between communication streaming processors of GPU, we count the number of floating point and computation is introduced. Since Eqs. (1) and (2) are explicitly operation of the calculation for the dendrite solidification by using the time-integrated, only the data of one mesh layer at the sub-domain hardware counter of the PAPI (Performance API)[5] for the CPU code. boundary is transferred. The GPU kernel is divided into two and the The maximum mesh size of the run is 640×640×640 on single first kernel computes the boundary mesh points. At this moment, GPU, because one GPU board of Tesla S1070 has 4 GByte VRAM (GDDR3). the data is ready to be transferred and the CUDA memory copy API By changing the mesh size, we measured the performance of the GPU from the global memory to host memory starts asynchronously as the computing and we had 116.8 GFlops for 64×64×64 mesh, 161.6 stream 0. Simultaneously the second kernel that computes the inner GFlops for128×128×128 mesh, 169.8 GFlops for 256×256×256 mesh, mesh points starts as the stream 1. The overlapping of stream 0 with 168.5 GFlops for 512×512×512 mesh and 171.4 GFlops 640×640× the stream 1 can hide the communication time. 640 mesh. The performance of the single CPU core (Opteron 2.4 GHz) 4-3 Performance of Multiple GPU Computing is 898 MFlops, and it was found to be a 190x-speedup on TSUBAME1.2. The phase-field calculation consists of 373 floating point operations In the four cases: 512×512×512 mesh, 960×960×960 mesh, and 28 times global memory access (26 reads and 2 writes) per one 1920 × 1920 × 1920 mesh and 2400 × 2400 × 2400 mesh, their mesh point. The same calculation is carried out on every mesh point performances are examined with changing the number of GPUs for in single precision. The arithmetic intensity is estimated to be 3.33 Flop/ both the overlapping and the non-overlapping cases. The strong Byte. In the case when using the shared memory, the number of memory scalabilities are shown in Fig.2. read reduces to 2 and the arithmetic intensity increases up to 23.31 In every case, the performance of the overlapping computation is Flop/Byte. It is understood that the calculation is compute-intensive greatly improved compared with that of the non-overlapping. The by Computational Fluid Dynamics standards. Therefore, such high ideal strong scalabilities are achieved up to 8 GPUs for 512×512 performances as 171.4 GFlops can be achieved in the GPU computing. ×512 mesh, from 4 to 24 GPUs for 960×960×960 mesh, from 30 to 48 GPUs for 1920×1920×1920 mesh. We have a perfect weak scalability in the extent of the GPU number used in our runs. In the overlapping cases, the ideal strong scalabilities are suddenly Multiple-GPU Computing 4 saturated by increasing the GPU number. For a shorter computational than communication time is not possible to hide the communication time any more. 4-1 GPU Computing on Multi-node It should be highlighted that the performance of a 10 TFlops is Multiple GPU computing is carried out for the following two purposes: (1) achieved with 60 GPUs, which is comparable performance of the enabling large-scale computing beyond the memory limitation on a application running on world top-class supercomputers. We directly single GPU card and (2) speedup the fixed problem pursuing strong compare the GPU performance with the CPU on TSUBAME 1.2 for the scalability. Multiple GPU computing requires GPU-level parallelization same test case of 960×960×960 mesh. In the overlapping case, 24 constructing a hierarchical parallel computing, since the blocks and GPUs show the performance of 3.7 TFlops in Fig.3, and it is noticed the threads in CUDA have already been parallelized inside the GPU. that 24 GPUs are comparable with 4000 Opteron (2.4 GHz) CPU cores, The computational domain is decomposed and a sub-domain is even if we assume the perfect strong scalability.

07 GPU Computing for Dendritic Solidification based on Phase-Field Model

strong and the weak scalabilities were shown. A performance of 10 TFlops was achieved with 60 GPUs, when the overlapping technique was introduced. The GPU computing greatly contributes to low electric-power consumption and is a promising candidate for the next-generation supercomputing.

Acknowledgements

This research was supported in part by KAKENHI, Grant-in-Aid for Scientific Research(B) 19360043 from The Ministry of Education, Culture, Sports, Science and Technology (MEXT), JST-CREST project ”ULP-HPC: Ultra Low-Power, High Performance Computing via Modeling and Optimization of Next Generation HPC Technologies” from Japan Science and Technology Agency(JST) and JSPS Global COE program “Computationism as a Foundation for the Sciences” from Japan Society for the Promotion of Science(JSPS). Fig.2 Strong Scalabilities of multi-GPU computing. References

[1] Tomohiro Takaki, Toshimichi Fukuoka and Yoshihiro Tomita, Phase-field simulation during directional solidification of a binary alloy using adaptive finite element method, J. Crystal Growth 283 (2005) pp.263-278. [2] Toshio Endo and Satoshi Matsuoka. Massive Supercomputing Coping with Heterogeneity of Modern Accelerators. In IEEE International Parallel & Distributed Processing Symposium (IPDPS 2008), April 2008, 2008. [3] NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0, NVIDIA Corporation, California, 2008. [4] Ryo Kobayashi, “Modeling and numerical simulations of dendritic crystal growth”, Physica D, 63, 3-4, pp.410-423, 1993. [5] PAPI, http://icl.cs.utk.edu/papi/ Fig, 3 Performance Comparison between GPU and CPUon TSUBAME 1.2 for the case of 960×960×960 mesh

Summary 5

The GPU computing for the dendritic solidification process of a pure metal was carried out on the NVIDIA Tesla S1070 GPUs of TSUBAME 1.2 by solving a time-dependent Ginzbunrg-Landau equation coupling with the thermal conduction equation based on the phase-field model. The GPU code was developed in CUDA and a performance of 171 GFlops was achieved on a single GPU. It is found that the multiple-GPU computing with domain decomposition has a large communication overhead. Both the

08 ● TSUBAME e-Science Journal No.1 (Premier issue) Published 09/18/2010 by Global Scientific Information and Computing Center, Tokyo Institute of Technology Design & Layout: Caiba & Kick and Punch Editor: TSUBAME e-Science Journal - Editorial room Takayuki AOKI, Toshio WATANABE, Masakazu SEKIJIMA, Thirapong PIPATPONGSA, Fumiko MIYAMA Address: 2-12-1 i7-3 O-okayama, Meguro-ku, Tokyo 152-8550 Tel: +81-3-5734-2087 Fax: +81-3-5734-3198 E-mail: [email protected] URL: http://www.gsic.titech.ac.jp/ International Research Collaboration

The high performance of supercomputer TSUBAME has been extended to the international arena. We promote international research collaborations using TSUBAME between researchers of Tokyo Institute of Technology and overseas research institutions as well as research groups worldwide.

Recent research collaborations using TSUBAME 1. Simulation of Tsunamis Generated by Earthquakes using Parallel Computing Technique 2. Numerical Simulation of Energy Conversion with MHD Plasma-fluid Flow 3. GPU computing for Computational Fluid Dynamics

Application Guidance

Candidates to initiate research collaborations are expected to conclude MOU (Memorandum of Understanding) with the partner organizations/ departments. Committee reviews the “Agreement for Collaboration” for joint research to ensure that the proposed research meet academic qualifications and contributions to international society. Overseas users must observe rules and regulations on using TSUBAME. User fees are paid by Tokyo Tech’s researcher as part of research collaboration. The results of joint research are expected to be released for academic publication.

Inquiry

Please see the following website for more details. http://www.gsic.titech.ac.jp/en/InternationalCollaboration