HECToR User Group Meeting 2010 12 Oct 2010 Manchester  Phase 2b – XT6  Current and Past Projects  SBLI  LUDWIG  HiGEM  GADGET  HELIUM  Future Projects  Cray XE6  GEMINI results  New Cray Centre of Excellence for HECToR website!  Cray Exascale Research Initiative Europe  Training and Conferences

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 2 • Cray XT4 Compute Nodes • 3072 x 2.3 GHz Quad Core Opteron Nodes • 12,238 Cores • 8GB per node – 2GB per Core

Compute Nodes • 28 x 4 Cray X2 Compute nodes • 32 GB per node 8GB per Core

• Cray XT6 Compute Nodes • 1856 x 2.1 GHz Dual 12 Core Opteron Nodes • 44,544 Cores •32GB per node – 1.33 GB per Core

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 3 Greyhound Greyhound Greyhound Greyhound DDR3 Channel HT3 DDR3 Channel 6MB L3 Greyhound 6MB L3 Greyhound Cache Greyhound Cache Greyhound

Greyhound Greyhound HT3 DDR3 Channel Greyhound HT3 Greyhound DDR3 Channel

Greyhound HT Greyhound Greyhound 3 Greyhound DDR3 Channel DDR3 Channel 6MB L3 Greyhound 6MB L3 Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound HT3 DDR3 Channel Greyhound Greyhound DDR3 Channel

To Interconnect HT1 / HT3  2 Multi-Chip Modules, 4 Opteron Dies  8 Channels of DDR3 Bandwidth to 8 DIMMs  24 (or 16) Computational Cores, 24 MB of L3 cache  Dies are fully connected with HT3  Snoop Filter Feature Allows 4 Die SMP to scale well

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 4 6.4 GB/sec direct connect Characteristics HyperTransport Number of Cores 24

Peak Performance 202Gflops/sec MC-12 (2.1) Memory Size 32GB per node

Memory 83.5 GB/sec Bandwidth

83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 5 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 6 •User reported large runs on jaguarpf were dying with MPI buffer overflows •Preposting of MPI receives, and appropriate MPI environment variable settings, alleviates this issue, and provides much better scalability SBLI Performance on jaguarpf - UU dataset 18000

16000

14000

12000

10000

points * iterations/sec) - 8000 Optimized Code Unoptimized Code 6000

4000 Performance(Mgrid 2000

0 0 50000 100000 150000 200000 250000 Cores

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 7  Lattice-Boltzmann code by Dr Kevin Stratford (EPCC) and Prof. Mike Cates (University of Edinburgh – Dept of Physics)  Scaling very well up to 216000 cores  Smaller global lattice than ran previously on barcelona jaguarpf – really pushing the strong scaling Ludwig Strong Scaling on jaguarpf 250000

200000

150000

Reduced halo 100000

Parallel Speedup Parallel perfect

50000

0 0 50000 100000 150000 200000 250000 Cores

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 8  Dr Len Shaffrey, University of Reading  High resolution coupled climate model  The CoE is collaborating with NCAS researchers to refine UM I/O performance  I/O servers now overlap communication and computation.  First stages complete  Currently applied to a single version of the Unified Model  Requires a moderate effort to back port code to earlier versions of the UM  Large numbers of different versions of the UM being used on HECToR  I/O server technique has been employed in other applications as well – HELIUM, CARP

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 9  Dr Adrian Jenkins, Prof. Richard Bower, Dr Tom Theuns - Durham University  Dr Robert Crain – Swinburne University GADGET3 scaling Cray XT6 and Cray XE6  The CoE is collaborating with Institute 5000000 for Computational Cosmology (ICC) 4500000 researchers to better understand and 4000000 improve the scaling of GADGET3. 3500000

 Code exhibits significant load imbalance 3000000

with current datasets at small core 2500000 Cray XT6 MPI counts 2000000 Cray XT6 MPI/pthreads  50% of the time is spent waiting at (time(s)*cores)Cost 1500000 Cray XE6 MPI

MPI collectives on 64 cores 1000000 Cray XE6 MPI/pthreads  CoE is looking to improve the load 500000 balancing in the code 0  Performance characterization of 0 1000 2000 Cores new hybrid MPI/pthreads code

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 10  Collaboration with Prof. Ken Taylor, Queen’s University Belfast  CoE work involved implementation of an I/O server  Work reported at the Cray User Group in Edinburgh – May 2010  “Optimising Application Performance using I/O Servers on Cray XT Technology” HELIUM I/O Performance 200 150 100 I/O servers no I/O servers

Time (s) Time 50 0 100 1000 10000

cores  Use of I/O server allows for full overlap of I/O with computation and communication  I/O time hidden as I/O requests are offloaded to dedicated I/O servers

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 11  Both of the 800 nm runs were successful.  Both required 16,110 cores - near the maximum available on HECToR.  The results confirmed present understanding of the high-energy cutoff law.  The results were consistent with results at 390nm, 248nm, and 195nm.

 Calculations at higher intensities are not really feasible on 16,110 cores of HECToR, and remain very much a grand challenge.

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 12 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 13  Supports 2 Nodes per ASIC  168 GB/sec routing capacity  Scales to over 100,000 network endpoints  Link Level Reliability and Adaptive Hyper Hyper Transport 3 Transport 3 Routing NIC 1  Advanced Resiliency Features NIC 0  Provides global address SB Netlink space Block Gemini LO  Advanced NIC designed to Processor efficiently support 48-Port  MPI YARC Router  One-sided MPI  Shmem  UPC, Coarray FORTRAN

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 14 Cray XE6 Node Characteristics Number of 16 or 24 Cores MC 10 12X Gemini Channels Peak 210 Gflops (Each Gemini acts High Radix Performance like two nodes on YARC Router the 3D Torus) with adaptive Routing Memory Size 32 or 64 GB 168 GB/sec capacity Memory 85 GB/s Bandwidth HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 15 Module with SeaStar

Z

Y

X

Module with Gemini

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 16  Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously

 Gemini provides low-overhead OS-bypass features for short transfers  MPI latency around 1.4us in current release  NIC provides for many millions of MPI messages per second – 20 times better than Seastar  Much improved one-sided communication mechanism with hardware support  AMOs provide a faster synchronization method for barriers  Gemini supports adaptive routing, which  Reduces problems with network hot spots  Allows MPI to survive link failures  Much improved injection bandwidth – 6 GB/s user data injection with HT3 link

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 17  Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays  Cray compiler targets this capability directly

 Pipelined global loads and stores  Allows for fast irregular communication patterns

 Atomic memory operations  Provides fast synchronization needed for one-sided communication models

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 18 SBLI T1 dataset Cray XE6 and Cray XT6 120

100

80

60 Cray XT6 Cray XE6

40 (Mgrid points(Mgriditerations/sec) *

20

0 0 500 1000 1500 2000 2500 cores

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 19 HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 20  http://www.hector.ac.uk/coe/

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 21  Launched December 2009  Initial research partners are EPCC and CSCS  Cray increased Edinburgh based team with Alistair Hart and Harvey Richardson  Exploring how real applications can exploit future Exascale architectures, specifically:  Programming models for PGAS languages  Programming GPU accelerators  Improved algorithms for FFTs  Network and I/O profiling  Strong interactions with Cray R&D

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 22  CUG 2010 24-27 May  Using I/O servers to improve performance on Cray XT technology– Tom Edwards and Kevin Roy  NAIS HPC Workshop 23-24 June  PGAS Programming with UPC and Fortran coarrays – Jason Beech-Brandt and Harvey Richardson  DEISA Training Course 14-16 September  Performance Optimization on Cray XT Systems  Coarray Fortran Tutorial  Hybrid Manycore Programming and Accelerators in High Performance Computing 22 October  Challenges in Programming Multicore and Accelerated Nodes – John Levesque and Alistair Hart  SC10  Poster - Performance Tuning of Scientific Applications on HPC Systems  Poster – GPU Acceleration of Scientific Applications  Tutorial – Introduction to Coarray Fortran  EPCC MSc project supervision  Global Arrays and NWChem Tuning on the Cray XT – Jason Beech-Brandt  Co-arrays for sparse matrix methods – Tom Edwards  Visits to NAG and Daresbury and various user sites (Reading, Southampton, Durham, Cambridge amongst others)HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 23  Questions?

 Jason Beech-Brandt: [email protected]  Tom Edwards: [email protected]  Kevin Roy: [email protected]

HECToR User Group Meeting, 12 Oct, Manchester 14/10/2010 24