Cray Centre of Excellence for Hector: Activities and Future Projects

HECToR User Group Meeting 2010 12 Oct 2010 Manchester Phase 2b – Cray XT6 Current and Past Projects SBLI LUDWIG HiGEM GADGET HELIUM Future Projects Cray XE6 GEMINI results New Cray Centre of Excellence for HECToR website! Cray Exascale Research Initiative Europe Training and Conferences HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 2 • Cray XT4 Compute Nodes • 3072 x 2.3 GHz Quad Core Opteron Nodes • 12,238 Cores • 8GB per node – 2GB per Core • Cray X2 Compute Nodes • 28 x 4 Cray X2 Compute nodes • 32 GB per node 8GB per Core • Cray XT6 Compute Nodes • 1856 x 2.1 GHz Dual 12 Core Opteron Nodes • 44,544 Cores •32GB per node – 1.33 GB per Core HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 3 Greyhound Greyhound Greyhound Greyhound DDR3 Channel HT3 DDR3 Channel 6MB L3 Greyhound 6MB L3 Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound HT3 DDR3 Channel Greyhound HT3 Greyhound DDR3 Channel Greyhound HT Greyhound Greyhound 3 Greyhound DDR3 Channel DDR3 Channel 6MB L3 Greyhound 6MB L3 Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound HT3 DDR3 Channel Greyhound Greyhound DDR3 Channel To Interconnect HT1 / HT3 2 Multi-Chip Modules, 4 Opteron Dies 8 Channels of DDR3 Bandwidth to 8 DIMMs 24 (or 16) Computational Cores, 24 MB of L3 cache Dies are fully connected with HT3 Snoop Filter Feature Allows 4 Die SMP to scale well HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 4 6.4 GB/sec direct connect Characteristics HyperTransport Number of Cores 24 Peak Performance 202Gflops/sec MC-12 (2.1) Memory Size 32GB per node Memory 83.5 GB/sec Bandwidth 83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 5 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 6 •User reported large runs on jaguarpf were dying with MPI buffer overflows •Preposting of MPI receives, and appropriate MPI environment variable settings, alleviates this issue, and provides much better scalability SBLI Performance on jaguarpf - UU dataset 18000 16000 14000 12000 10000 points * iterations/sec) - 8000 Optimized Code Unoptimized Code 6000 4000 Performance(Mgrid 2000 0 0 50000 100000 150000 200000 250000 Cores HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 7 Lattice-Boltzmann code by Dr Kevin Stratford (EPCC) and Prof. Mike Cates (University of Edinburgh – Dept of Physics) Scaling very well up to 216000 cores Smaller global lattice than ran previously on barcelona jaguarpf – really pushing the strong scaling Ludwig Strong Scaling on jaguarpf 250000 200000 150000 Reduced halo 100000 Parallel SpeedupParallel perfect 50000 0 0 50000 100000 150000 200000 250000 Cores HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 8 Dr Len Shaffrey, University of Reading High resolution coupled climate model The CoE is collaborating with NCAS researchers to refine UM I/O performance I/O servers now overlap communication and computation. First stages complete Currently applied to a single version of the Unified Model Requires a moderate effort to back port code to earlier versions of the UM Large numbers of different versions of the UM being used on HECToR I/O server technique has been employed in other applications as well – HELIUM, CARP HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 9 Dr Adrian Jenkins, Prof. Richard Bower, Dr Tom Theuns - Durham University Dr Robert Crain – Swinburne University GADGET3 scaling Cray XT6 and Cray XE6 The CoE is collaborating with Institute 5000000 for Computational Cosmology (ICC) 4500000 researchers to better understand and 4000000 improve the scaling of GADGET3. 3500000 Code exhibits significant load imbalance 3000000 with current datasets at small core 2500000 Cray XT6 MPI counts 2000000 Cray XT6 MPI/pthreads 50% of the time is spent waiting at (time(s)*cores)Cost 1500000 Cray XE6 MPI MPI collectives on 64 cores 1000000 Cray XE6 MPI/pthreads CoE is looking to improve the load 500000 balancing in the code 0 Performance characterization of 0 1000 2000 Cores new hybrid MPI/pthreads code HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 10 Collaboration with Prof. Ken Taylor, Queen’s University Belfast CoE work involved implementation of an I/O server Work reported at the Cray User Group in Edinburgh – May 2010 “Optimising Application Performance using I/O Servers on Cray XT Technology” HELIUM I/O Performance 200 150 100 I/O servers no I/O servers Time (s) Time 50 0 100 1000 10000 cores Use of I/O server allows for full overlap of I/O with computation and communication I/O time hidden as I/O requests are offloaded to dedicated I/O servers HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 11 Both of the 800 nm runs were successful. Both required 16,110 cores - near the maximum available on HECToR. The results confirmed present understanding of the high-energy cutoff law. The results were consistent with results at 390nm, 248nm, and 195nm. Calculations at higher intensities are not really feasible on 16,110 cores of HECToR, and remain very much a grand challenge. HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 12 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 13 Supports 2 Nodes per ASIC 168 GB/sec routing capacity Scales to over 100,000 network endpoints Link Level Reliability and Adaptive Hyper Hyper Transport 3 Transport 3 Routing NIC 1 Advanced Resiliency Features NIC 0 Provides global address SB Netlink space Block Gemini LO Advanced NIC designed to Processor efficiently support 48-Port MPI YARC Router One-sided MPI Shmem UPC, Coarray FORTRAN HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 14 Cray XE6 Node Characteristics Number of 16 or 24 Cores MC 10 12X Gemini Channels Peak 210 Gflops (Each Gemini acts High Radix Performance like two nodes on YARC Router the 3D Torus) with adaptive Routing Memory Size 32 or 64 GB 168 GB/sec capacity Memory 85 GB/s Bandwidth HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 15 Module with SeaStar Z Y X Module with Gemini HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 16 Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously Gemini provides low-overhead OS-bypass features for short transfers MPI latency around 1.4us in current release NIC provides for many millions of MPI messages per second – 20 times better than Seastar Much improved one-sided communication mechanism with hardware support AMOs provide a faster synchronization method for barriers Gemini supports adaptive routing, which Reduces problems with network hot spots Allows MPI to survive link failures Much improved injection bandwidth – 6 GB/s user data injection with HT3 link HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 17 Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays Cray compiler targets this capability directly Pipelined global loads and stores Allows for fast irregular communication patterns Atomic memory operations Provides fast synchronization needed for one-sided communication models HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 18 SBLI T1 dataset Cray XE6 and Cray XT6 120 100 80 60 Cray XT6 Cray XE6 40 (Mgrid points(Mgrid iterations/sec) * 20 0 0 500 1000 1500 2000 2500 cores HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 19 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 20 http://www.hector.ac.uk/coe/ HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 21 Launched December 2009 Initial research partners are EPCC and CSCS Cray increased Edinburgh based team with Alistair Hart and Harvey Richardson Exploring how real applications can exploit future Exascale architectures, specifically: Programming models for PGAS languages Programming GPU accelerators Improved algorithms for FFTs Network and I/O profiling Strong interactions with Cray R&D HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 22 CUG 2010 24-27 May Using I/O servers to improve performance on Cray XT technology– Tom Edwards and Kevin Roy NAIS HPC Workshop 23-24 June PGAS Programming with UPC and Fortran coarrays – Jason Beech-Brandt and Harvey Richardson DEISA Training Course 14-16 September Performance Optimization on Cray XT Systems Coarray Fortran Tutorial Hybrid Manycore Programming and Accelerators in High Performance Computing 22 October Challenges in Programming Multicore and Accelerated Nodes – John Levesque and Alistair Hart SC10 Poster - Performance Tuning of Scientific Applications on HPC Systems Poster – GPU Acceleration of Scientific Applications Tutorial – Introduction to Coarray Fortran EPCC MSc project supervision Global Arrays and NWChem Tuning on the Cray XT – Jason Beech-Brandt Co-arrays for sparse matrix methods – Tom Edwards Visits to NAG and Daresbury and various user sites (Reading, Southampton, Durham, Cambridge amongst others)HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 23 Questions? Jason Beech-Brandt: [email protected] Tom Edwards: [email protected] Kevin Roy: [email protected] HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 24.

Cray Centre of Excellence for Hector: Activities and Future Projects

Workload Management and Application Placement for the Cray Linux Environment™

Cray XT and Cray XE Y Y System Overview

The Gemini Network

Cray HPCS Response 10/17/2013

CUG Program-3Press.Pdf

Breakthrough Science Via Extreme Scalability

High Performance Datacenter Networks Architectures, Algorithms, and Opportunities Synthesis Lectures on Computer Architecture

Cray Linux Environment™ (CLE) 4.0 Software Release Overview Supplement

Cray the Supercomputer Company

Cray XE™ and XK™ Application Programming and Optimization Student Guide

Workload Management and Application Placement for the Cray Linux Environment

Cray Linux Environment™ (CLE) Software Release Overview Supplement