Early Evaluation of the Cray XT3 at ORNL Highlights

Early Evaluation of the Cray XT3 at ORNL J. S. Vetter, S. R. Alam, T. H. Dunigan, Jr M. R. Fahey, P. C. Roth, P. H. Worley Dozens of others… Oak Ridge National Laboratory Oak Ridge, TN, USA 37831 EARLY EVALUATION: This paper contains preliminary results from our early delivery system, which is smaller in scale than the final delivery system and which uses early versions of the system software. Highlights ¾ORNL is installing a 25 TF, 5,200 processor Cray XT3 – We currently have a 3,800 processor system – We are using CRMS 0406 ¾This is a snapshot! ¾We are evaluating its performance using – Microbenchmarks – Kernels – Applications ¾The system is running applications at scale ¾This summer, we expect to install the final system, scale up some applications, and continue evaluating its performance 2 1 Acknowledgments ¾Cray – Supercomputing Center of Excellence – Jeff Beckleheimer, John Levesque, Luiz DeRose, Nathan Wichmann, and Jim Schwarzmeier ¾Sandia National Lab – Ongoing collaboration ¾Pittsburgh Supercomputing Center – Exchanged early access accounts to promote cross testing and experiences ¾ORNL staff – Don Maxwell ¾This research was sponsored by the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Batelle, LLC. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. 3 Cray XT3 – Jaguar Current System Configuration ¾40 cabinets ¾3,784 compute processors ¾46 service and I/O processors ¾2 GB memory per processor ¾Topology (X,Y,Z): 10 torus, 16 mesh, 24 torus ¾CRMS software stack 4 2 Cray XT3 – Jaguar Final System Configuration – ETA June ¾56 cabinets ¾5,212 compute processors (25 TF) ¾82 service and I/O processors ¾2 GB memory per processor ¾10.7 TB aggregate memory ¾120 TB disk space in Lustre file system ¾Topology (X,Y,Z): 14 torus, 16 mesh/torus, 24 torus 5 Cray XT3 System Overview ¾Cray’s third generation MPP – Cray T3D, T3E ¾Key features build on previous design philosophy – Single processor per node • Commodity processor: AMD Opteron – Customized interconnect • SeaStar ASIC • 3-D mesh/torus – Lightweight operating system – catamount – on compute PEs – Linux on service and IO PEs 6 3 Cray XT3 PE Design Image courtesy of Cray. 7 AMD Opteron ¾AMD Opteron Model 150 – Processor core / 2.4 Ghz • three integer units • one floating-point unit which is capable of two floating-point operations per cycle • 4.8 GFLOPS – Integrated memory controller – Three 16b 800 Mhz HyperTransport (HT) links – L1 cache: 64KB I and D caches – L2 cache: 1MB Unified ¾Model 150 has three HT links but none support coherent HT (for SMPs) – Lower latency to main memory than SMP capable processors Image courtesy of AMD. 8 4 Cray SeaStar Interconnect ASIC ¾Routing and communications ASIC ¾Connects to the Opteron via 6.4 GBps HT link ¾Connects to six neighbors via 7.6 GBps links ¾Topologies include torus, mesh ¾Contains – PowerPC 440 chip, DMA engine, service port, router ¾Notice – No PCI bus in transfer path – Interconnect Link BW is greater than Opteron Link BW – Carries all message traffic in addition to IO traffic 9 Software ¾Operating systems ¾Filesystems – Catamount – Scratch space through Yod • Lightweight kernel w/ limited – Lustre functionality to improve reliability, performance, etc. ¾Math libraries – Linux – ACML 2.5, Goto library ¾Portals Communication Library ¾Details ¾Scalable application launch using – CRMS 0406 Yod – 0413AA firmware ¾Programming environments – PIC 0x12 – Apprentice, PAT, PAPI, mpiP – Totalview 10 5 Evaluations ¾Goals – Determine the most effective approaches for using the each system – Evaluate benchmark and application performance, both in absolute terms and in comparison with other systems – Predict scalability, both in terms of problem size and in number of processors ¾We employ a hierarchical, staged, and open approach – Hierarchical • Microbenchmarks • Kernels • Applications – Interact with many others to get the best results – Share those results 11 Recent and Ongoing Evaluations ¾Cray X1 – P.A. Agarwal, R.A. Alexander et al., “Cray X1 Evaluation Status Report,” ORNL, Oak Ridge, TN, Technical Report ORNL/TM-2004/13, 2004. – T.H. Dunigan, Jr., M.R. Fahey et al., “Early Evaluation of the Cray X1,” Proc. ACM/IEEE Conference High Performance Networking and Computing (SC03), 2003. – T.H. Dunigan, Jr., J.S. Vetter et al., “Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture,” IEEE Micro, 25(1):30-40, 2005. ¾SGI Altix – T.H. Dunigan, Jr., J.S. Vetter, and P.H. Worley, “Performance Evaluation of the SGI Altix 3700,” Proc. International Conf. Parallel Processing (ICPP), 2005. ¾Cray XD1 – M.R. Fahey, S.R. Alam et al., “Early Evaluation of the Cray XD1,” Proc. Cray User Group Meeting, 2005, pp. 12. ¾SRC – M.C. Smith, J.S. Vetter, and X. Liang, “Accelerating Scientific Applications with the SRC-6 Reconfigurable Computer: Methodologies and Analysis,” Proc. Reconfigurable Architectures Workshop (RAW), 2005. ¾Underway – XD1 FPGAs – ClearSpeed – EnLight – Multicore processors – IBM BlueGene/L – IBM Cell 12 6 Microbenchmarks ¾Microbenchmarks characterize specific components of the architecture ¾Microbenchmark suite tests – arithmetic performance, – memory-hierarchy performance, – task and thread performance, – message-passing performance, – system and I/O performance, and – parallel I/O 13 Memory Performance Measured Latency to Main Memory Platform (ns) Cray XT3 / Opteron 150 / 2.4 51.41 Cray XD1 / Opteron 248 / 2.2 86.51 IBM p690 / POWER4 / 1.3 90.57 Intel Xeon / 3.0 140.57 Compiler Issue uncovered 14 7 DGEMM Performance 15 FFT Performance 16 8 Message Passing Performance 10000 1e+5 PingPong 1000 ~1.1GBps 1e+4 100 1e+3 10 1 Bandwidth (MBps) Latency (microsecs) Latency 1e+2 ~30us 0.1 1e+1 0.01 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 Payload (bytes) Payload (bytes) 10000 1e+6 Exchange 3648 procs 1000 ~0.8 GBps 1e+5 100 1e+4 10 1e+3 1 Bandwidth (MBps) Latency (microsecs) Latency 1e+2 ~70us 0.1 1e+1 0.01 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 Payload (bytes) Payload (bytes) 17 Message Passing Performance (2) ¾AllReduce across 3,648 processors 1e+6 1e+5 1e+4 Latency (microseconds) 1e+3 ~600us 1e+2 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 Payload (bytes) 18 9 Note ¾As mentioned earlier, we are using preliminary versions of the system software for these tests. We expect future versions of the software to improve both the latency and bandwidth of these MPI operations. In fact, other sites are reporting much improved MPI results. 19 HPC Challenge Benchmark ¾http://icl.cs.utk.edu/hpcc/ ¾Version 0.8b ¾2,048 processors, 9.8 TFLOPS – HPL: 7.4 (9.8), 75% – MPI RandomAccess: 0.055 GUPS ¾These are unofficial numbers – not recorded at the HPCC website – Stay tuned – More details from HPCC presentation, and HPCC website later this year. 20 10 Kernels ¾SMG2000 ¾PSTSWM (see paper) 21 SMG2000 / Multigrid solver ¾SMG2000 is a driver for the linear solver Hypre 100 – 7-point Laplacian on structured grid 90 XT3 XD1 80 SP4 70 SP3 ¾Hypre 60 50 – Parallel semicoarsening multigrid 40 solver for the linear systems arising 30 SMG2000 Runtime from finite difference, finite volume, or 20 finite element discretizations of the 10 diffusion equation 0 1 10 100 1000 10000 NPROCS ¾Benchmark includes both setup and solve of linear system Lower is better. – We only measure the solve phase 22 11 ORNL has Major Efforts Focusing on Grand Challenge Scientific Applications SciDAC Genomes Astrophysics to Life Nanophase Materials SciDAC Climate SciDAC SciDAC Fusion Chemistry 23 Climate Modeling ¾Community Climate System Model (CCSM) is the primary model for global climate simulation in the USA – Community Atmosphere Model (CAM) – Community Land Model (CLM) – Parallel Ocean Program (POP) – Los Alamos Sea Ice Model (CICE) – Coupler (CPL) ¾Running Intergovernmental Panel on Climate Change (IPCC) experiments 24 12 Climate / Parallel Ocean Program / POP 25 Fusion ¾Advances in understanding tokamak plasma behavior are necessary for the design of large scale reactor devices (like ITER) ¾Multiple applications used to simulate various phenomena w/ different algorithms – GYRO – NIMROD – AORSA3D – GTC 26 13 Fusion / GYRO 27 sPPM ¾3-D Gas Dynamics Problem on uniform Cartesian mesh, using a simplified version of the Piecewise Parabolic Method 100 XT3 XD1 SP4 SP3 10 SPPM Grind Time (us) Time Grind SPPM 1 1 10 100 1000 10000 NPROCS 28 14 Modeling Performance Sensitivities ¾Estimate change in performance due to lower latencies GYRO (B1-std benchmark) POP (X1 grid, 4 simulation days) 800 300 700 computation communication 250 computation communication 600 200 500 400 150 300 100 Runtime (sec) Runtime (sec) 200 50 100 0 0 16 (5us) 16 (5us) 32 (5us) 64 32 (5us) 64 (5us) 32 (30us) 64 (30us) 16 (30us) (5us) 128 (5us) 256 (5us) 512 32 (30us) 64 (30us) (5us) 128 (5us) 256 (5us) 512 128 (30us) 256 (30us) 512 (30us) (5us) 1024 128 (30us) 128 (30us) 256 (30us) 512 1024 (30us) 1024 29 Next steps ¾Eager to work on full system – 25 TF, 5,200 processor Cray XT3 this summer ¾Preliminary evaluation results – Need to continue performance optimizations of system software and applications – GYRO and POP are running well – Strong scaling – SMG and sPPM are running across the system well –

Early Evaluation of the Cray XT3 at ORNL Highlights

Introducing the Cray XMT

Cray XD1™ Supercomputer Release 1.3

Merrimac – High-Performance and Highly-Efficient Scientific Computing with Streams

Cray XT and Cray XE Y Y System Overview

Pubtex Output 2006.05.15:1001

Cray XT3/XT4 Software

The Gemini Network

Overview of the Next Generation Cray XMT

INNOVATION for HPC AMD Launches Opteron 6300 Series X86 & Announces 64-Bit ARM Strategy

Titan Introduction and Timeline Bronson Messer

Taking the Lead in HPC

Tour De Hpcycles