Application Performance on Multi- core processors

Scaling, Throughput and an Historical perspective

M.F. Guest≠, C.A. Kitchen≠, M. Foster† and D. Cho§

≠ Cardiff University, †Atos, § Mellanox Technologies Outline

I. Performance Benchmarks and Cluster Systems a. Synthetic Code Performance: STREAM and IMB b. Application Code Performance: DLPOLY, GROMACS, AMBER,GAMESS_UK, VASP and Quantum Espresso c. Interconnect Performance: Intel MPI and Mellanox’s HPCX d. Processor Family and Interconnect – “core to core” and “node to node” benchmarks II. Impact of Environmental Issues in Cluster Acceptance tests a. Security patches, turbo mode and Throughput testing III. Performance profile of DL_POLY and GAMESS-UK over the past two decades IV. Acknowledgements and Summary

Application Performance on Multi-core Processors 12 December 2018 2 Contents

I. Review of parallel application performance featuring synthetics and end- user applications across a variety of clusters ¤ End-user Codes – DL_POLY, GROMACS, AMBER, NAMD, LAMMPS, GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM • Ongoing Focus on Intel’s Xeon Scalable processors (“Skylake”), AMD’s Naples EPYC processor plus nVIDIA GPUs, including ¤ Clusters with dual-socket nodes - Intel Xeon Gold 6148 Processor (20c, 27.5M Cache, 2.40 GHz) & Xeon Gold 6138 Processor (20c, 27.5M Cache, 2.00 GHz) + AMD Naples EPYC 7551 (2.00 GHz) & EPYC 7601 (2.20 GHz) CPUs. ¤ Updated review of Intel MPI and Mellanox HPCX performance analysis . II. How these benchmarks have been deployed in the framework of procurement and acceptance testing, dealing with a variety of issues e.g. (a) security patches, turbo mode etc. & (ii) Throughput testing. III. An historical perspective of two of these codes – DL_POLY and GAMESS-UK – and briefly overview the development and performance profile of both over the past two decades.

Application Performance on Multi-core Processors 12 December 2018 3 The Xeon Skylake Architecture

• The architecture of Skylake is very different from that of the prior “Haswell” and “Broadwell” Xeon chips • Three basic variants that now cover what was formerly the Xeon E5 and Xeon E7 product lines, with Intel converging the Xeon E5 and E7 chips into a single socket. • Product segmentation – Platinum, Gold, Silver, & Bronze – with 51 variants of the SP chip • Also custom versions requested by hyperscale and OEM customers. • All of these chips differ from each other in a number of ways, including number of cores, clock speed, L3 cache capacity, number and speed of UltraPath links between sockets, number of sockets supported, main memory capacity, width of the AVX vector units etc.

Application Performance on Multi-core Processors 12 December 2018 4 Intel Xeon : Westmere - Skylake

Intel Xeon Scalable Xeon 5600 Xeon E5-2600 Xeon E5-2600 v4 Processor (Westmere-EP) (Sandy Bridge-EP) “Broadwell-EP” “Skylake” Up to 6 cores / 12 Up to 8 cores / 16 Up to 22 Cores / 44 Up to 28 Cores / 56 Cores / Threads threads threads threads threads Up to 38.5 MB (non- Last-level cache 12 MB Up to 20 MB Up to 55 MB inclusive) 4 channels of up to 3 6 channels of up to 2 Max memory 3xDDR3 channels, 4xDDR3 channels, RDIMMs, LRDIMMs RDIMMs, LRDIMMs channels, speed 1333 1600 or 3DS LRDIMMs, or 3DS LRDIMMs, / socket 2400 MHz 2666 MHz New AVX 1.0 AVX 2.0 AVX 512 AES-NI instructions 8 DP Flops/Clock 16 DP Flops/Clock 32 DP Flops/Clock

QPI / UPI Speed 1 QPI channels @ 2 QPI channels @ 8.0 2 x QPI channels @ Up to 3 x UPI @ 10.4 (GT/s) 6.4 GT/s GT/s 9.6 GT/s GT/s

PCIe Lanes / 36 lanes PCIe 2.0 on 40 Lanes / Socket 40 / 10 / PCIe* 3.0 48 / 12 / PCIe* 3.0 Controllers / chipset Integrated PCIe 3.0 (2.5, 5, 8 GT/s) (2.5, 5, 8 GT/s) Speed (GT/s)

Server / Server / Up to 130W Server; Workstation 55 - 145W 70 – 205W Workstation: 130W 150W Workstation TDP

Application Performance on Multi-core Processors 12 December 2018 5 AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle

SKU 7601 7551 7501 7451 7401 7351 7301 Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2 Turboboost All cores 2.7 2.6 2.6 2.9 2.8 2.9 2.7 active Turboboost On core 3.2 3.0 3.0 3.2 3.0 2.9 2.7 active The AMD EPYC Cores/socket 32 32 32 24 24 16 16 only supports 2 L3 cache size 64 MB × 128-bit AVX Memory natively, so 8 Channel there’s a large Memory Freq 2667 MT/s gap with Intel TDP (W) 180 180 155/170 180 155/170 155/170 155/170 SKL and their 2 × 512-bit FMAs. Architecture Sandy Bridge Haswell Skylake EPYC Thus the FP ISA* AVX AVX2 AVX-512 AVX2 peak on AMD is 2 4 4 4 4 × lower than op/cycle (1 ADD, 1 MUL) (2 FMA) (2 FMA) (2 ADD, 2 MUL) on Intel SKL. Vector size 06 4 4 8 2 (DP = 64-bits) FLOP/cycle 8 16 32 8 * Instruction Set Architecture Application Performance on Multi-core Processors 12 December 2018 6 EPYC Architecture - Naples, Zeppelin & CCX

2x16 PCie 2x16 PCie • Zen cores 2x DDR4 Channels 2xDDR4

2x DDR4 Channels 2xDDR4 L2 L2 L2 L2 Coherent Links Coherent Zen Zen Links Coherent Zen Zen 8M 8M ¤ Private L1/L2 cache L3 L3 Zen L2 L2 Zen Zen L2 L2 Zen • CCX

Zen L2 L2 Zen Zen L2 L2 Zen ¤ 4 ZEN cores (or less) 8M 8M L3 L3 Zen L2 L2 Zen Zen L2 L2 Zen ¤ 8MB L3 shared cache

Coherent Links Coherent Links • Zeppelin

Coherent Links ∞ Coherent Links ¤ 2 CCX (or less)

2xChannels DDR4 2x DDR4 Channels 2xDDR4

Coherent Links Coherent L2 L2 Zen L2 L2 Zen Links Coherent Zen Zen ¤ 2 DDR4 channels 8M 8M L3 L3 Zen L2 L2 Zen Zen L2 L2 Zen ¤ 2 PCIe 16x • Naples Zen L2 L2 Zen Zen L2 L2 Zen 8M 8M L3 L3 ¤ 4 Zeppelin SoC dies fully Zen L2 L2 Zen Zen L2 L2 Zen connected by Infinity 2x16 PCie 2x16 PCie Fabric. ¤ 4 Numa Nodes ! • Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket. • Design also means that there are four NUMA nodes per socket or eight NUMA nodes in a dual socket system i.e. different memory latencies depending on which die needs data

0from7 memory that can be attached to that die or another die on the fabric. • The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within the same socket where Intel stays on a single piece of silicon.

Application Performance on Multi-core Processors 12 December 2018 7 Intel Skylake and AMD EPYC Cluster Systems

Cluster / Configuration “Hawk” – Supercomputing Wales cluster at Cardiff comprising 201 nodes, totalling 8,040 cores, 46.080 TB total memory • CPU: 2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each; RAM: 192 GB, 384GB on high memory and GPU nodes; GPU: 26 x nVidia P100 GPUs with 16GB of RAM on 13 nodes. “Helios” – 32 node HPC Advisory Council cluster running SLURM: Mellanox ConnectX-5  Supermicro SYS-6029U-TR4 / Foxconn Groot 1A42USF00-600-G 32-node cluster; Dual Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz  Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters with Socket Direct, Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches  Memory: 192GB DDR4 2677MHz RDIMMs per node

20 node Bull|ATOS AMD EPYC cluster running SLURM;  AMD EPYC 7551; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3.2 GHz Base Clock: 2.2 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s

32 node Dell|EMC PowerEdge R7425 AMD EPYC cluster running SLURM;  AMD EPYC 7601; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3GHz Base Clock: 2.0 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s

Application Performance on Multi-core Processors 12 December 2018 8 Baseline Cluster Systems

Cluster Configuration Intel Sandy Bridge Clusters

128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670 “Raven” (2.6 GHz), with Mellanox QDR infiniband.

Supercomputing 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 Wales (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell Clusters

Dell PE R730/R630, HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36- Broadwell EP-2697A v4 node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP, 2.6 GHz 16C 40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR 32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core, ATOS Broadwell EP- 145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox 2680 v4 2.4 GHz 16C ConnectX-4 EDR; and Intel OPA IBM Power 8 S822LC

IBM Power 8 S822LC 20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ; with Mellanox EDR 1 – IB (EDR) port ; 2 × NVIDIA K80 GPU; IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE; Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM Advance Toolchain 9.0)

Application Performance on Multi-core Processors 12 December 2018 9 The Performance Benchmarks • The Test suite comprises both synthetics & end-user applications. Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks (http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and STREAM • Variety of “open source” & commercial end-user application codes: GROMACS, LAMMPS, AMBER, NAMD, DL_POLY classic & DL_POLY-4 ()

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP (ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK (molecular electronic structure)

• These stress various aspects of the architectures under consideration and should provide a level of insight into why particular levels of performance are observed e.g., memory bandwidth and latency, node floating point performance and interconnect performance (both latency and B/W) and sustained I/O performance. Application Performance on Multi-core Processors 12 December 2018 10 EPYC - Compiler and Run-time Options

STREAM (Atos Clusters): STREAM (Dell|EMC EPYC): module load AMD/amd-cputype/1.0 export OMP_NUM_THREADS=32 icc -o stream.x stream.c -DSTATIC - export OMP_PROC_BIND=true Ofast -xCORE-AVX2 -qopenmp - export OMP_DISPLAY_ENV=true DSTREAM_ARRAY_SIZE=800000000 \ export -mcmodel=large -shared-intel OMP_PLACES="{0},{16},{8},{24},{2},{1 8},{10},{26},{4},{20},{12},{28},{6}, export OMP_NUM_THREADS=16 {22},{14},{30},{1},{17},{9},{25},{3} export OMP_PROC_BIND=true ,{19},{11},{27},{5},{21},{13},{29},{ export OMP_PLACES="{0:4:1}:16:4” #1 7},{23},{15},{31}" thread per CCX export OMP_DISPLAY_ENV=true

# Compilation: # Preload the amd-cputype library to navigate INTEL COMPILERS 2018, IntelMPI 2017 Update 3, FFTW-3.3.5 # the Intel Genuine cpu test module use /opt/amd/modulefiles INTEL SKL: -O3 –xCORE-AVX512 module load AMD/amd-cputype/1.0 export LD_PRELOAD=$AMD_CPUTYPE_LIB AMD EPYC: –O3 -xAVX2 export OMP_PROC_BIND=true # export KMP_AFFINITY=granularity=fine AMD EPYC: -axCORE-AVX-I export I_MPI_DEBUG=5 export MKL_DEBUG_CPU_TYPE=5

Application Performance on Multi-core Processors 12 December 2018 11 Memory B/W –STREAM performance

303,797 300,000 TRIAD [Rate (MB/s) ] 279,640

250,000 OMP_NUM_THREADS (KMP_AFFINITY=physical

200,000 196,721 185,863 184,087 169,830

150,000 132,035 128,083 118,605 114,367 Skylake Gold 6138, 6142, 6148 100,000 93,486 Broadwell E5-26xx v4 74,309

Ivy Bridge & Haswell 50,000 E5-26xx v2,v3

0 Bull b510 ClusterVision IVB Dell R730 HSW Dell HSW e5- Thor BDW e5- ATOS BDW e5- Mellanox SKL Dell SKL Gold "Hawk" Atos SKL IBM Power8 AMD Epyc 7551 AMD Epyc 7601 "Raven"SNB e5- e5-2650v2 e5-2697v3 2660v3 2.6GHz 2697A v4 2.6GHz 2680v4 2.4GHz Gold 6138 6142 2.6GHz (T) Gold 6148 S822LC 2.92GHz 2.0 GHz 2.2 GHz 2670/2.6GHz 2.6GHz 2.6GHz (T) (T) (T) (T) 2.0GHz (T) 2.4GHz

Application Performance on Multi-core Processors 12 December 2018 12 Memory B/W – STREAM / core performance

10,000 TRIAD [Rate (MB/s) ] 9,204 9,000 OMP_NUM_THREADS (KMP_AFFINITY=physical 8,000

7,000

6,000 5,843 5,718 5,808

4,918 4,747 5,000 4,644 4,574 4,369 4,236 4,126 4,246 4,000 Skylake Gold Broadwell 6138, 6142, 6148 3,000 E5-26xx v4 Ivy Bridge & Haswell 2,000 E5-26xx v2,v3

1,000

0 Bull b510 ClusterVision Dell R730 HSW Dell HSW e5- Thor BDW e5- ATOS BDW e5- Mellanox SKL Dell SKL Gold "Hawk" Atos IBM Power8 AMD Epyc 7551 AMD Epyc 7601 "Raven"SNB e5- IVB e5-2650v2 e5-2697v3 2660v3 2.6GHz 2697A v4 2680v4 2.4GHz Gold 6138 6142 2.6GHz (T) SKL Gold 6148 S822LC 2.92GHz 2.0 GHz 2.2 GHz 2670/2.6GHz 2.6GHz 2.6GHz (T) (T) 2.6GHz (T) (T) 2.0GHz (T) 2.4GHz Application Performance on Multi-core Processors 12 December 2018 13 MPI Performance – PingPong

IMB Benchmark (Intel) 10,000 11,466 export I_MPI_DAPL_TRANSLATION_CACHE=1 5,957 Memory resident cache feature in DAPL 3,694

1,729 1,000

1 PE / node ATOS AMD EPYC 7601 2.2 GHz (T) EDR 100 Intel SKL Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6150 2.7GHz (T) EDR BETTER

Mbytes/sec IBM Power8 S822LC 2.92GHz IB/EDR Thor BDW e5-2697A v4 2.6GHz (T) EDR Intel BDW e5-2690v4 2.6GHz (T) OPA 10 Dell OPA32 e5-2660v3 2.6GHz (T) OPA Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB 3.8 Dell R720 e5-2680v2 2.8 GHz (T) connect-IB Azure A9 WE (e5-2670 2.6 GHz) IB RDMA 1.7 Latency Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4) 1 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Message Length (Bytes) Application Performance on Multi-core Processors 12 December 2018 14 MPI Collectives – Alltoallv (128 PEs) Measured Time (usec) 1.0E+07 Fujitsu CX250 SNB e5-2670/2.6GHz IB-QDR ATOS BDW e5-2680v4 2.4GHz (T) OPA Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR 128 PEs Dell PE R730 BDW e5-2697Av4 2.6GHz (T) OPA 1.0E+06 Dell|EMC SKL Gold 6130 2.1GHz (T) OPA "Helios" Mellanox SKL Gold 6138 2.0GHz (T) Intel SKL Gold 6148 2.4GHz (T) OPA "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 1.0E+05 Dell|EMC SKL Gold 6142 2.6GHz (T) EDR ATOS AMD EPYC 7601 2.2 GHz (T) EDR Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

1.0E+04 IMB Benchmark (Intel) BETTER EPYC performance with Intel MPI ~ 4-6 × Latency worse than that with 1.0E+03 SKL processors!

Time-consuming messages called by Alltoall & Alltoallv (IPM) 1.0E+02 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Message Length (Bytes) Application Performance on Multi-core Processors 12 December 2018 15 Application Performance on Multi-core Processors

I.1 THE CODES: DLPOLY, GROMACS, NAMD, LAMMPS, GAMESS, NWChem, GAMESS-UK, ONETEP, VASP, SIESTA, CASTEP, Quantum Espresso, CP2K – on a variety of HPC systems. Allinea (ARM) Performance Reports Allinea Performance Reports provides a mechanism to characterize and understand the performance of HPC application runs through a single-page HTML report. • Based on Allinea MAP's adaptive sampling technology that keeps data volumes collected and application overhead low. • Modest application slowdown (ca. 5%) even with 1000’s of MPI processes. • Runs on existing codes: a single command added to execution scripts. • If submitted through a batch queuing system, then the submission script is modified to load the Allinea module and add the 'perf-report' command in front of the required mpiexec command. • perf-report mpiexec -n 4 $code • A Report Summary: This characterizes how the application's wallclock time was spent, broken down into CPU, MPI and I/O • All examples updated on Broadwell Mellanox Cluster (E5-2697A v4) Application Performance on Multi-core Processors 12 December 2018 17 Molecular Simulation I. DL_POLY

Molecular Dynamics Codes: AMBER, DL_POLY, CHARMM, NAMD, LAMMPS, GROMACS etc DL_POLY . Developed as CCP5 parallel MD code by W. Smith, T.R. Forester and I. Todorov . UK CCP5 + International user community . DLPOLY_classic (replicated data) and DLPOLY_3 & _4 (distributed data – domain decomposition) . Areas of application: . liquids, solutions, spectroscopy, ionic solids, molecular crystals, polymers, glasses, membranes, proteins, metals, solid and liquid interfaces, catalysis, clathrates, liquid crystals, biopolymers, polymer electrolytes.

Application Performance on Multi-core Processors 12 December 2018 18 DL_POLY Classic – NaCl Simulation Performance Data (32-256 PEs)

Performance Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

20.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR 19.2 Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA

"Helios" Skylake Gold 6138 2.0GHz (T) EDR 16.8 16.7 15.7 15.9 16.0 "Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR 15.4

Dell Skylake Gold 6150 2.7GHz (T) EDR 13.7 13.6

ATOS AMD EPYC 7551 2.0 GHz (T) EDR 12.4 11.6 12.0 Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR 11.1

9.8 9.3 8.3 7.7 8.0 7.1 6.8 6.0

4.8 4.4 4.5 BETTER 4.2 4.1 4.0 3.5 NaCl 27,000 atoms; 500 time steps

0.0 32 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 19 DL_POLY 4 – Distributed data Domain Decomposition - Distributed data: • Distribute atoms, forces across the nodes ¤ More memory efficient, can address much larger 5 7 cases (10 -10 ) A B • Shake and short-ranges forces require only neighbour communication ¤ communications scale linearly with number of nodes C D • Coulombic energy remains global ¤ Adopt Smooth Particle Mesh Ewald scheme • includes Fourier transform smoothed charge density (reciprocal space grid typically W. Smith and I. Todorov 64x64x64 - 128x128x128) Benchmarks 1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å 2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx

Application Performance on Multi-core Processors 12 December 2018 20 DL_POLY 4 – Gramicidin Simulation Performance Data (64-256 PEs)

Performance Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs) Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR 10.0 Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA 9.5 Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI 9.0 "Helios" Skylake Gold 6138 2.0GHz (T) EDR 8.5 Intel Skylake Platinum 8170 2.1GHz (T) OPA 8.4 7.9 8.0 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA Intel Skylake Gold 6148 2.4GHz (T) OPA 7.4 "Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR 6.0 Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR 5.7 5.4 5.1 5.0 5.0 4.9 4.5 4.0 3.2 3.2 BETTER Relative Performance Relative 3.0 3.1 SKL 6142 2.6 GHz ~ 2.92.8 2.7 2.4 1.06 X e5-2697v4 2.6 GHz 2.0 1.8 Gramicidin 792,960 atoms; 50 time steps

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 21 DL_POLY 4 – Gramicidin Simulation – EPYC

Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA 8.4 Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI "Helios" Skylake Gold 6138 2.0GHz (T) EDR 8.0 7.5 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 7.2 Intel Skylake Gold 6148 2.4GHz (T) OPA "Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR 6.0 ATOS AMD EPYC 7551 2.0 GHz (T) EDR Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR 5.15.0 4.6 4.6 4.6

4.3 BETTER 4.0 3.4 3.1 3.0 3.1 3.2 2.8 2.6 2.4 2.52.3 2.0 1.7

Gramicidin 792,960 atoms; 50 time steps

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 22 DLPOLY4 – Gramicidin Simulation Performance Report

Total Wallclock Time Performance Data (32-256 PEs) 32 PEs Breakdown 90.0 CPU (%) 80.0 70.0 MPI (%) 60.0 Smooth Particle Mesh Ewald Scheme 50.0 40.0 CPU Scalar numeric ops (%) 30.0 20.0 CPU Vector numeric ops (%) 10.0 32 PEs 0.0 CPU Memory accesses (%) 256 PEs 64 PEs 70.0 60.0 50.0 40.0 30.0 20.0 10.0 256 PEs 0.0 64 PEs 128 PEs

“DL_POLY_4 and Xeon Phi: Lessons Learnt”, Alin Marin Elena , Christian Lalanne, Victor Gamayunov , Gilles Civario, Michael Lysaght, CPU Time Breakdown and IlianTodorov 128 PEs Application Performance on Multi-core Processors 12 December 2018 23 Molecular Simulation - II GROMACS

GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics package designed for simulations of proteins, lipids and nucleic acids [University of Groningen] . • Single and Double Precision • Efficient GPU Implementations

Versions under Test: Version 4.6.1 – 5 March 2013 Version 5.0.7 – 14 October 2015 Version 2016.3 – 14 March 2017 Version 2018.2 – 14 June 2018 (optimised for “Hawk” by Ade Fewings) . Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load- Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory and Computation 4 (3): 435–447. http://manual.gromacs.org/documentation/

Application Performance on Multi-core Processors 12 December 2018 24 GROMACS Benchmark Cases

Ion channel system • The 142k particle ion channel system is the membrane protein GluCl - a pentameric chloride channel embedded in a DOPC membrane and solvated in TIP3P water, using the Amber ff99SB- ILDN force field. This system is a challenging parallelization case due to the small size, but is one of the most wanted target sizes for biomolecular simulations. Lignocellulose • Gromacs Test Case B from the UEA Benchmark Suite. A model of cellulose and lignocellulosic biomass in an aqueous solution. This system of 3.3M atoms is inhomogeneous, and uses reaction- field electrostatics instead of PME and therefore should scale well.

Application Performance on Multi-core Processors 12 December 2018 25 GROMACS – Ion-channel Performance Report

32 PEs CPU (%) 80.0 Total Wallclock Time MPI (%) 70.0 Breakdown 60.0 50.0 40.0 30.0 20.0 CPU Scalar numeric ops (%) 10.0 CPU Vector numeric ops (%) 0.0 32 PEs 256 PEs 64 PEs 50.0 45.0 CPU Memory accesses (%) 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 256 PEs 0.0 64 PEs

128 PEs

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs

Application Performance on Multi-core Processors 12 December 2018 26 GROMACS – Ion Channel Simulation Single Precision

Performance (ns /day) "Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T) Nodes with EDR Interconnect + dual P100 GPU nodes

200.0 Gromacs 4.6.1 191.7

Gromacs 5.0 167.2 Gomacs 2016.3-single-AVX 150.0 Gromacs 2018.2 132.4

Performance Data (64-256 PEs) 114.0

100.0 90.3

79.8 BETTER 60.1 48.5 50.0 45.1

142k particle ion channel system

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 27 Ion Channel Simulation – Impact of Single Precision Performance (ns /day) 160.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR 151.5 149.0 Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX 140.0 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA Intel Skylake Gold 6148 2.4GHz (T) OPA GROMACS 5.0.7 123.6 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR 120.0 Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S} 100.4 97.4 100.0 Intel Skylake Gold 6148 2.4GHz (T) OPA {S} 90.3 "Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S} 80.0 Performance Data (64-256 PEs) 68.5 68.8 60.3 60.0 54.0 48.5 BETTER

40.0 36.4 37.9 32.5

20.6 20.0 142k particle ion channel system

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 28 GROMACS – GPU Performance: Ion Channel Simulation Relative Performance GROMACS 2018.2 4.0 "Hawk" Atos Cluster - 3.5 3.5 SKL Gold 6148 2.4GHz (T) Nodes with EDR 3.2 3.2 3.2 Interconnect + dual P100 3.0 GPU nodes 2.7 2.5 2.2

2.0 1.9 1.8 1.6 1.5

1.2 1.2 BETTER 1.0 1.0

0.5 142k particle ion channel system

0.0 64 80 96 128 160 240 256 320 N=1, N=2, N=4, N=6, 2×GPU 4×GPU 8×GPU 12×GPU Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 29 GROMACS – Lignocellulose Simulation Single Precision

Performance (ns /day) "Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T) 14.0 Nodes with EDR Interconnect + dual P100 GPU nodes 13.5 Gromacs 4.6.1 12.0 Gromacs 5.0 Gromacs 2016.3-single-AVX 10.0 Gromacs 2018.2 8.7 8.6

8.0 Performance Data (64-256 PEs) 7.3

6.0

4.8 4.7 BETTER 4.0 3.7

2.4 2.4 3,316,463 atom system using 2.0 reaction-field electrostatics instead of PME

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 30 Lignocellulose Simulation – Impact of Single Precision Performance (ns /day)

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR 10.1 10.0 Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA Intel Skylake Gold 6148 2.4GHz (T) OPA GROMACS 5.0.7 8.0 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S} 6.1 6.0 Intel Skylake Gold 6148 2.4GHz (T) OPA {S} 5.5 5.2 "Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S} 5.0

Performance Data (64-256 PEs)

4.0 BETTER 3.1 3.3 2.9 2.8 2.6

2.0 1.6 1.7 1.3 1.4 0.9 3,316,463 atom system using reaction- field electrostatics instead of PME

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 31 GROMACS – GPU Performance: Lignocellulose Simulation

Relative Performance 3,316,463 atom system using 7.1 7.0 reaction-field electrostatics instead of PME

6.0 "Hawk" Atos Cluster - GROMACS 2018.2 SKL Gold 6148 2.4GHz (T) 5.3 5.0 Nodes with EDR Interconnect + dual P100 4.4 GPU nodes 4.0 3.6 3.4

3.0 2.9

2.3 BETTER 1.9 2.0 1.5 1.6 1.2 1.0 1.0

0.0 64 80 96 128 160 240 256 320 N=1, N=2, N=4, N=6, 2×GPU 4×GPU 8×GPU 12×GPU Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 32 Molecular Simulation - III The AMBER Benchmarks

• AMBER 16/1 is used, specifically PMEMD & GPU accelerated PMEMD. • M01 Benchmark ¤ Major Urinary Protein (MUP) + IBM ligand (21,736 atoms)

• M06 Benchmark R. Salomon-Ferrer, D.A. Case, R.C. Walker. An overview of the Amber biomolecular ¤ Cluster of six MUPs (134,013 atoms) simulation package. WIREs Comput. Mol. Sci. 3, 198-210 (2013). • M27 Benchmark D.A. Case, T.E. Cheatham, III, T. Darden, H. ¤ Cluster of 27 MUPs (657,585 atoms) Gohlke, R. Luo, K.M. Merz, Jr., A. Onufriev, C. Simmerling, B. Wang and R. Woods. The • M45 Benchmark Amber biomolecular simulation programs. J. Computat. Chem. 26, 1668-1688 (2005). ¤ Cluster of 45 MUPs (932,751 atoms)

All test cases run 30,000 steps * 2fs = 60ps simulation time. Periodic boundary conditions, constant pressure, T=300K. Position data written every 500 steps.

Application Performance on Multi-core Processors 12 December 2018 33 AMBER – SKL vs. SNB: M06, M27 and M45

Relative Performance SKL 6148 2.4 GHz // EDR vs SNB e5-2670 2.6 GHz // QDR

1.80 Performance Data (64-320 PEs) M06 M27 M45 1.70

1.65 1.65 1.64

1.60 1.57 1.58 1.55 1.56

1.48 1.47 1.45 1.44

1.39 1.40 1.36 1.36 1.34 1.34 BETTER 1.31 1.27 1.27 1.27 1.24 1.23 1.20 1.13

1.00 64 80 96 128 160 240 256 320 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 34 AMBER – GPU Performance: M45 Simulation

Relative Performance (64 SNB cores)

"Hawk" Atos Cluster - SKL Gold 6148 4.21 2.4GHz (T) with EDR Interconnect + dual 4.0 P100 GPU nodes vs. “Raven” 64 SNB e5- 2670 PEs 3.54 3.40 3.12 3.0 2.73 2.41 2.29

2.0 1.88

1.48 1.36 BETTER

1.0 Cluster of 45 Major Urinary Protein (MUP) + IBM ligand (932,751 atoms)

0.0 64 80 96 128 160 240 256 320 N1 (ppn1N1 (ppn2 Number of Processing Elements GPU×2) GPU×2) Application Performance on Multi-core Processors 12 December 2018 35 GAMESS-UK - Moving to Distributed Data. The MPI/ScaLAPACK Implementation of the GAMESS-UK SCF/DFT module • Pragmatic approach to the replicated data constraints: ¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays ¤ All data structures except those required for the Fock matrix build are fully distributed (F, P) • Partially distributed model chosen because, in the absence of efficient one-sided communications it is difficult to efficiently load balance a distributed Fock matrix build. • Obvious drawback - some large replicated data structures are required. ¤ These are kept to a minimum. For a closed shell HF or DFT calculation only 2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for UHF). “The GAMESS-UK electronic structure package: algorithms, developments and applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H. van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747. Application Performance on Multi-core Processors 12 December 2018 36 GAMESS-UK Performance - Zeolite Y cluster

Performance Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

4.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull Haswell e5-2695v3 2.3GHz Connect-IB Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR 3.5 3.5 Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR 3.3 3.3 Thor Broadwell e5-2697A v4 2.6GHz (T) EDR 3.2 Mellanox SKL Gold 6138 2.0GHz (T) EDR 3.0 3.0 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 2.8 Intel Skylake Gold 6148 2.4GHz (T) OPA 2.7 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 2.5 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR 2.1 IBM Power8 S822LC 2.92GHz IB/EDR 2.1 2.0 2.0 1.9 1.9 1.8 1.8 1.8 1.5 1.6 1.5 1.2 1.2 1.1 SKL 6142 2.6 GHz BETTER 1.0 ~ 1.05 X e5-2697v4 2.6 GHz

0.5 Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

0.0 128 Number of Processing Elements 256 Application Performance on Multi-core Processors 12 December 2018 37 GAMESS-UK MPI/ScaLAPACK code – EPYC Performance

Performance Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

4.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull Haswell e5-2695v3 2.3GHz Connect-IB Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR 3.5 3.5 Thor Broadwell e5-2697A v4 2.6GHz (T) EDR Mellanox SKL Gold 6138 2.0GHz (T) EDR 3.3 3.2 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 3.1 Intel Skylake Gold 6148 2.4GHz (T) OPA 3.0 3.0 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 2.8 2.8 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR 2.7 Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR 2.5 2.5 IBM Power8 S822LC 2.92GHz IB/EDR ATOS AMD EPYC 7551 2.0 GHz (T) EDR Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR 2.1 2.1 2.0 2.0 1.9 1.8 1.8 1.7 1.8 1.5 1.6 1.5 1.5 1.4 1.2

1.2 1.1 BETTER 1.0

0.5 Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

0.0 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 38 GAMESS-UK.MPI DFT – DFT Performance Report

32 PEs CPU (%) 90.0 Cyclosporin 6-31G** basis (1855 80.0 MPI (%) 70.0 GTOs); DFT B3LYP 60.0 50.0 40.0 30.0 CPU Scalar numeric ops (%) 20.0 32 PEs 10.0 70.0 CPU Vector numeric ops (%) 256 PEs 0.0 64 PEs 60.0 CPU Memory accesses (%) 50.0

40.0

30.0

20.0

10.0 256 PEs 0.0 64 PEs

128 PEs

Total Wallclock Time Breakdown

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs

Application Performance on Multi-core Processors 12 December 2018 39 Advanced Materials Software

Computational Materials

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using pseudopotentials or the projector-augmented wave method and a plane wave basis set. • Quantum Espresso – an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modelling at the nanoscale. It is based on density-functional theory (DFT), plane waves, and pseudopotentials • SIESTA - an O(N) DFT code for electronic structure calculations and ab initio molecular dynamics simulations for molecules and solids. It uses norm-conserving pseudopotentials and linear combination of numerical atomic orbitals (LCAO) basis set. • CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a framework for different methods such as e.g., DFT using a mixed & plane waves approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling code for quantum-mechanical calculations based on DFT.

Application Performance on Multi-core Processors 12 December 2018 40 Quantum Espresso

Ground-state calculations. Structural Optimization. Transition states & minimum energy paths. Ab-initio molecular dynamics. Response properties (DFPT). Quantum Espresso is an Spectroscopic properties. integrated suite of Open- Source computer codes Quantum Transport. for electronic-structure calculations and Benchmark Details materials modelling at the nanoscale. It is based on Au complex (Au ), 2,158,381 G- 112 density-functional theory, DEISA AU112 vectors, 2 k-points, FFT dimensions: (180, 90, 288) plane waves, and pseudopotentials. Carbon-Iridium complex (C Ir ), PRACE 200 243 2,233,063 G-vectors, 8 k-points, FFT Transition from v5.2 to GRIR443 dimensions: (180, 180, 192) v6.1

Application Performance on Multi-core Processors 12 December 2018 41 Quantum Espresso – Au112

10.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Relative to the Fujitsu e5-2670 Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA 2.6 GHz 8-C (32 PEs) Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL 8.8 Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA 8.3 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 8.0 7.8 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR 7.6 Intel Skylake Gold 6148 2.4GHz (T) OPA

6.7 6.4

6.0 5.6 5.7 5.7 5.9 Version 5.2 5.0 5.1 Performance 4.0 4.9 4.0 4.3 3.3

2.7 3.4 3.2 3.3 2.9 1.9 2.5 2.0 2.2 2.0 2.0

1.5 1.3 BETTER 1.0 Performance Data (32 - 320 PEs) 0.0 0 64 128 192 256 320 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 42 Quantum Espresso – Au112 Performance Report 32 PEs 70.0 Total Wallclock Time CPU (%) 60.0 Breakdown 50.0 MPI (%) 40.0 Au complex (Au ), 2,158,381 G- 30.0 112 vectors, 2 k-points, FFT 20.0 dimensions: (180, 90, 288) 10.0 256 PEs 0.0 64 PEs CPU Scalar numeric ops (%) 32 PEs CPU Vector numeric ops (%) 45.0 40.0 CPU Memory accesses (%) 35.0 30.0 25.0 20.0 15.0 128 PEs 10.0 5.0 256 PEs 0.0 64 PEs

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs Application Performance on Multi-core Processors 12 December 2018 43 Parallelism in Quantum Espresso

• Quantum ESPRESSO implements several MPI parallelization levels, with Processors organized in a hierarchy of groups identified by different MPI communicator levels. Group hierarchy: • images: Processors divided into different "images", corresponding to a different SCF or linear-response calculation, loosely coupled to others. • Pools and bands: each image can be sub-partitioned into "pools", each taking care of a group of k-points. Each pool is sub-partitioned into "band groups", each taking care of a group of Kohn-Sham orbitals. • PW Parallelisation: orbitals in the PW basis set, as well as charges and density in either reciprocal or real space, distributed across processors. All linear-algebra operations on array of PW / real-space grids are automatically and effectively parallelized. • tasks: Allows for good parallelization of the 3D FFT when no. of CPUs exceeds the no. of FFT planes, FFTs on Kohn-Sham states are redistributed to "task".

Application Performance on Multi-core Processors 12 December 2018 44 Parallelism in Quantum Espresso • linear-algebra group: A further level, independent on PW or k-point parallelization, is the parallelization of subspace diagonalization / iterative orthonormalization. • About communications Images and pools are loosely coupled and CPUs communicate between different images and pools only once in a while, whereas CPUs within each pool are tightly coupled and communications are significant. • Choosing parameters : To control the no. CPUs in each group, command line switches: -nimage, -npools, -nband, -ntg, -ndiag or –

northo. Thus for Au112, use is of the following command line: mpirun $code -inp ausurf.in -npool $NPOOL -ntg $NT -ndiag $ND • This executes an energy calculation on $NP processors, with k-points distributed across $NPOOL pools of $NP/$NPOOL processors each, 3D FFT is performed using $NT task groups, with the diagonalization of the subspace Hamiltonian distributed to a square grid of $ND processors.

Application Performance on Multi-core Processors 12 December 2018 45 Impact of npool – Au112 Relative Performance 8.0 7.7 Version 5.2 Hawk (NPOOL=2, ND=nP) 7.0 6.7 Hawk (NPOOL=1) 6.5 Raven (NPOOL=2, ND=nP) 6.0 Raven (NPOOL=1) 5.4 4.8 4.9 5.0 4.7 4.8 4.4

4.0 4.4 3.2 4.0 4.0 3.0 2.7 2.4 2.6 2.4 2.5 2.0 2.3 2.0 2.0

1.0 1.4 1.5 BETTER 1.0 Performance Data (64 - 320 PEs) 0.0 32 64 96 128 160 192 224 256 288 320 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 46 Impact of npool – GRIR443 Relative Performance 8.0 Hawk (NPOOL=2, ND=nP) 7.0 6.8 Hawk (NPOOL=1) 6.1 5.9 6.0 Raven (NPOOL=2, ND=nP) Raven (NPOOL=1) 5.3 5.0 4.7 4.4 4.2 3.9 4.0 4.5 4.0 3.9 3.7 3.9 2.9 3.0 3.2 3.3 2.2 2.8 1.9 2.0 2.2 1.8 1.0 1.4 1.1 BETTER 1.0 Performance Data (64 - 320 PEs) 0.0 32 64 96 128 160 192 224 256 288 320 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 47 Quantum Espresso – GRIR443 Performance Data (96-160 PEs)

6.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL 5.2 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 4.9

- 5.0 Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR ATOS AMD EPYC 7601 2.2 GHz (T) EDR

4.0 4.0 C (96 PEs)] (96 C - 4.0 3.8

3.2

3.0 2.8 2.8

BETTER 2670 2.6 GHz 8 GHz 2.6 2670 [Relative to the Fujitsu e5 Fujitsu theto [Relative 2.4 2.3 1.9 2.0 2.0 1.8

1.3 1.4

1.0 Performance

0.0 96 128 160 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 48 VASP – Vienna Ab-initio Simulation Package

Pd-O Benchmark

• Pd-O complex – Pd75O12, 5X4 3-layer supercell running a single point calculation and a planewave cut off of 400eV. Uses the VASP (5.4.4) performs ab- RMM-DIIS algorithm for the SCF and initio QM molecular is calculated in real space. dynamics (MD) simulations • 10 k-points; maximum number of plane- using pseudopotentials or waves: 34,470 the projector-augmented • FFT grid; NGX=31, NGY=49, NGZ=45, wave method and a plane giving a total of 68,355 points wave basis set. Zeolite Benchmark Benchmark Details • Zeolite with the MFI structure unit cell running Zeolite (Si O ), 2 k- 96 192 a single point calculation and a planewave cut MFI Zeolite points, FFT grid: (65, off of 400eV using the PBE functional 65, 43); 181,675 points • 2 k-points; maximum number of plane- Palladium-Oxygen waves: 96,834 Pd-O complex (Pd75O12), 10 complex k-points, FFT grid: (31, • FFT grid; NGX=65, NGY=65, NGZ=43, 49, 45), 68,355 points giving a total of 181,675 points Application Performance on Multi-core Processors 12 December 2018 49 VASP 5.4.4 – Pd-O Benchmark

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA 8.0 Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX Dell|EMC SKL Gold 6130 2.1GHz (T) OPA "Helios" Mellanox SKL 6138 2.0GHz (T) 6.8 "Helios" Mellanox SKL 6138 2.0GHz (T) HPCX 2.3.0 6.5 Intel SKL Gold 6148 2.4GHz (T) OPA 5.9 5.95.9 6.0 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 5.6 5.7 Dell|EMC SKL Gold 6142 2.6GHz (T) EDR 5.2 Dell|EMC SKL Gold 6150 2.7GHz (T) EDR 4.6 4.5

4.0 3.8 3.7 3.3

2.8 2.8 BETTER 2.5 2.1 2.0 1.7

Palladium-Oxygen complex (Pd75O12), 8 k- points, FFT grid: (31, 49, 45), 68,355 points

0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 50 VASP 5.4.4 – Pd-O Benchmark - Parallelisation on k-points

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

10.0 Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX 9.1 "Helios" Mellanox SKL 6138 2.0GHz (T) 8.5 "Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2] 8.0 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 7.5 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2] Dell|EMC SKL Gold 6142 2.6GHz (T) EDR 6.8 Dell|EMC SKL Gold 6150 2.7GHz (T) EDR 6.4 6.5 Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR [KPAR=2] 5.9 6.0 5.6 NPEs KPAR NPAR 5.2 5.3 64 2 2 128 2 4 4.6 256 2 8 3.9

4.0 3.7 3.7 3.8 3.7 BETTER 2.8 2.5 2.2 2.1 2.0 1.7

Palladium-Oxygen complex (Pd75O12), 8 k- points, FFT grid: (31, 49, 45), 68,355 points 0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 51 VASP – Pd-O Benchmark Performance Report

32 PEs Total Wallclock Time CPU (%) 90.0 80.0 Breakdown MPI (%) 70.0 60.0 50.0 Palladium-Oxygen complex (Pd75O12), 8 k- 40.0 points, FFT grid: (31, 49, 45), 68,355 points 30.0 20.0 10.0 256 PEs 0.0 64 PEs 32 PEs CPU Scalar numeric ops (%) 70.0 CPU Vector numeric ops (%) 60.0 50.0 CPU Memory accesses (%) 40.0 30.0 20.0 10.0 256 PEs 0.0 64 PEs 128 PEs

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs Application Performance on Multi-core Processors 12 December 2018 52 VASP 5.4.4 – Zeolite Benchmark

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs) Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX 5.0 Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) OPA 4.7 4.5 "Helios" Mellanox SKL 6138 2.0GHz (T) 4.3 Dell|EMC SKL Gold 6130 2.1GHz (T) OPA 4.2 4.0 4.0 Intel SKL Gold 6148 2.4GHz (T) OPA 3.9 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR Dell|EMC SKL Gold 6142 2.6GHz (T) EDR 3.4 3.2 3.2 Dell|EMC SKL Gold 6150 2.7GHz (T) EDR 3.0 3.0 2.9 2.7

2.0 1.8 1.7 1.7 1.7 BETTER 1.6 1.5 1.4 1.5

1.0 1.0 Zeolite (Si96O192) with MFI structure unit cell running a single point calculation and a 400eV planewave cut off of using the PBE functional. maximum number of plane-waves: 96,834, 2 k-points, FFT grid: (65, 65, 43); 181,675 points 0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 53 VASP 5.4.4 – Zeolite Benchmark - Parallelisation on k-points

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs) Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX 5.0 "Helios" Mellanox SKL 6138 2.0GHz (T) 4.7 4.7 "Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2] 4.5 4.3 Dell|EMC SKL Gold 6130 2.1GHz (T) OPA 4.2 Intel SKL Gold 6148 2.4GHz (T) OPA 4.0 4.0 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR 3.9 "Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2] Dell|EMC SKL Gold 6142 2.6GHz (T) EDR 3.4 Dell|EMC SKL Gold 6150 2.7GHz (T) EDR 3.2 3.0 3.0 2.9 2.7

2.0 1.8 1.7 1.7 1.7 BETTER 1.6 1.5 1.4 1.5

1.0 1.0 Zeolite (Si96O192) with MFI structure unit cell running a single point calculation and a 400eV planewave cut off of using the PBE functional. maximum number of plane-waves: 96,834, 2 k-points, FFT grid: (65, 65, 43); 181,675 points 0.0 64 128 256 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 54 VASP 5.4.1 – Zeolite Benchmark

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs) 3.0 Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA 2.7 Intel Skylake Gold 6148 2.4GHz (T) OPA 2.6 2.5 ATOS AMD EPYC 7601 2.2 GHz (T) EDR ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket) 2.1 2.1 2.0 2.0 1.8 1.6 1.5 1.4 1.5 1.4 1.4

Performance 1.0 0.9

0.5 Zeolite (Si96O192) with MFI structure unit cell running a single point calculation and a 400eV planewave cut off of using the BETTER PBE functional. maximum number of plane-waves: 96,834, 2 k- points, FFT grid: (65, 65, 43); 181,675 points 0.0 64 96 128 Number of Processing Elements Application Performance on Multi-core Processors 12 December 2018 55 Application Performance on Multi-core Processors:

I.2. Selecting Fabrics and Optimising Performance:

Intel MPI and Mellanox HPCX Selecting Fabrics – MPI Optimisation

• Intel MPI Library - can select a communication fabric at runtime without having to recompile the application. By default, it automatically selects the most appropriate fabric based on both S/W and H/W configuration i.e. in most cases you do not have to manually select a fabric. • Specifying a particular fabric can boost performance. Can specify fabrics for both intra-node and inter-node communications. Following fabrics available: Fabric Network hardware and software used shm Shared memory (for intra node communication only). Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB) dapl and iWarp (through DAPL). ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs). tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB). Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel tmi Omni Path Architecture and Myrinet (through TMI). OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale ofi Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API). • For inter-node communication, it uses the first available fabric from the default fabric list. List is defined automatically for each H/W and S/W configuration (see I_MPI_FABRICS_LIST). • For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi

Application Performance on Multi-core Processors 12 December 2018 57 Mellanox HPC-X Toolkit

The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC software suite for HPC environments. Delivers “enhancements to significantly increase the scalability & performance of message communications in the network”. Includes: ¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and FCA acceleration engines ¤ Offload collectives communication from MPI process onto Mellanox interconnect hardware ¤ Maximize application performance with underlying hardware architecture. Optimized for Mellanox InfiniBand and VPI interconnects ¤ Increase application scalability and resource efficiency ¤ Multiple transport support including RC, DC and UD ¤ Intra-node shared memory communication • Performance comparison conducted on the Mellanox SKL 6138 / 2.00 GHz EDR based “Helios” cluster

Application Performance on Multi-core Processors 12 December 2018 58 Application Performance & MPI Libraries

Performance comparison exercise undertaken to capture the impact of the latest release of Intel MPI and Mellanox’s HPCX. ¤ In 2017, on the Mellanox HP Proliant- E5-2697A v4 EDR based Thor cluster, comparison of Intel MPI and Mellanox HPCX for the following applications (and associated data sets).

– DLPOLY4 (NaCl and Gramicidin) & GROMACS (Ion Channel and lignocellulose) – VASP (PdO Complex & Zeolite System)

– Quantum ESPRESSO (Au112 and GRIR443) – OpenFOAM (Cavity 3D-3M) ¤ Simply compared the time to solution for each application i.e.

T HPCX / T Intel-MPI across multiple core counts

Application Performance on Multi-core Processors 12 December 2018 59 Application Performance & MPI Libraries

• Optimum performance found to be a function of both application and core count. ¤ With the materials-based codes & OpenFOAM, and at high core count (> 512 cores), HPCX exhibited a clear performance advantage over Intel MPI. ¤ This was not the case for the classical MD codes where Intel MPI showed a distinct advantage at all but the highest core counts. • Repeated the exercise on the Helios partition of the Skylake cluster using latest releases of HPCX v2.2.0 and 2.3.0-pre

http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

Application Performance on Multi-core Processors 12 December 2018 60 DL_POLY 4 – Intel MPI vs. HPCX – December 2017

% Intel MPI Performance vs. HPCX

120% DL_POLY4 - NaCl DL_POLY4 - Gramicidin 115%

110%

105%

100%

95% Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl 90% test case at all core counts, and at lower core counts for Gramicidin

85% 0 128 256 384 512 640 768 896 1024 Processor Core Count Application Performance on Multi-core Processors 12 December 2018 61 DL_POLY 4 – Intel MPI vs. HPCX – December 2018

% Intel MPI Performance vs. HPCX

120% DL_POLY4 - NaCl 115% DL_POLY4 - Gramicidin Advantage of Intel MPI now reduced at 110% most core counts for both NaCl and Gramicidin

105%

100%

95%

90%

85% 0 128 256 384 512 640 768 896 1024 Processor Core Count Application Performance on Multi-core Processors 12 December 2018 62 GROMACS – Intel MPI vs. HPCX – December 2017

% Intel MPI At no point does the HPC-X implementation of Performance vs. HPCX Gromacs outperform that using Intel MPI 125%

GROMACS - ion channel 120% GROMACS - lignocellulose

115%

110%

105%

100%

95% 0 128 256 384 512 640 768 896 1024 Processor Core Count Application Performance on Multi-core Processors 12 December 2018 63 GROMACS – Intel MPI vs. HPCX – December 2018

% Intel MPI Similar findings to DL_POLY, with the advantage of Intel Performance vs. HPCX MPI over the HPC-X implementation of Gromacs 125% significantly reduced compared to the 2017 findings.

120% GROMACS - ion channel GROMACS - lignocellulose

115%

110%

105%

100%

95% 0 128 256 384 512 640 768 896 1024 Processor Count Application Performance on Multi-core Processors 12 December 2018 64 VASP 5.4.1 – Intel MPI vs. HPCX – December 2017

% Intel MPI Performance vs. HPCX

120% VASP - Palladium Complex VASP - Zeolite Cluster

110%

100%

90%

80% Significantly different to the classical MD codes – now HPCX is seen to outperform Intel MPI for the Zeolite cluster at all core counts, and at larger core counts for 70% the Palladium complex

60% 0 128 256 384 512 Processor Count

Application Performance on Multi-core Processors 12 December 2018 65 VASP 5.4.4 – Intel MPI vs. HPCX – December 2018

% Intel MPI Performance vs. HPCX

120% VASP - Palladium Complex VASP - Zeolite Cluster 110%

100%

90%

80% Significantly different to the 2017 findings – little difference between Intel MPI and HPCX at larger core counts, with Intel MPI superior at lower core counts. 70%

60% 0 128 256 384 512 Processor Count Application Performance on Multi-core Processors 12 December 2018 66 Quantum Espresso v5.2 – Intel MPI vs. HPCX – Dec. 2017

% Intel MPI Performance vs. HPCX

125% Quantum Espresso - GRIR443 115% Quantum Espresso - Au112

105%

95%

85%

Significantly different to the classical MD codes – as 75% with VASP, HPCX is seen to outperform Intel MPI for the larger core counts

65% 0 128 256 384 512 640 768 Processor Count

Application Performance on Multi-core Processors 12 December 2018 67 Quantum Espresso v6.1 – Intel MPI vs. HPCX – Dec. 2018

% Intel MPI Performance vs. HPCX

125%

Quantum Espresso - GRIR443 115% Quantum Espresso - Au112

105%

95%

85%

Significantly different to the 2017 findings – Intel MPI 75% superior at lower core counts, with HPCX somewhat more effective at higher core counts.

65% 0 128 256 384 512 Processor Count

Application Performance on Multi-core Processors 12 December 2018 68 Application Performance on Multi- core Processors

I.3 Relative Performance as a Function of Processor Family and Interconnect – SKL and SNB Clusters. Target Codes and Data Sets – 128 PEs

128 PE Performance [Applications] Bull b510 "Raven" DLPOLY-4 Sandy Bridge e5- Gramicidin 2670/2.6 GHz IB-QDR 1.00 Fujitsu CX250 Sandy BSMBench Bridge e5-2670/2.6 GHz DLPOLY-4 NaCl Balance 0.80 IB-QDR ATOS Broadwell e5- 0.60 2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A 0.40 VASP Zeolite GROMACS ion- v4 2.6GHz (T) EDR IMPI complex channel 0.20 Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX 0.00 Dell Skylake Gold 6130 2.1GHz (T) OPA

VASP Pd-O GROMACS Intel Skylake Gold 6148 complex lignocellulose 2.4GHz (T) OPA

Dell Skylake Gold 6142 2.6GHz (T) EDR

OpenFoam - Dell Skylake Gold 6150 QE GRIR443 3d3M 2.7GHz (T) EDR

QE Au112

Application Performance on Multi-core Processors 12 December 2018 70 SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

VASP 5.4.4 - PdO Complex 1.95 QE 5.2 - GRIR443 1.71 DL_POLY 4.08 - Gramicidin 1.67 DL_POLY 4.08 - NaCl 1.65 DLPOLY Classic - Bench4 1.59 DLPOLY Classic - Bench5 1.58 NPEs = 80 GAMESS-UK - DFT.cyclo.6-31G-dp 1.58 GAMESS-UK - SiOSi7 1.54 DLPOLY Classic Bench7 1.53 Improved Performance of VASP 5.4.4 - Zeolite 1.53 Hawk - Dell |EMC Skylake WRF - conus 2.5km 1.49 Gold 6148 2.4GHz (T) EDR Gromacs 5.0 - ion channel 1.45 vs. CP2K - H2O-256 1.43 Raven - ATOS b510 Sandy QE 5.2 - AU112 1.42 Bridge e5-2670/2.6 GHz CP2K - H2O-512 1.41 IB-QDR Gromacs 4.6.1 - ion channel 1.40 Gromacs 4.6.1 - lignocellulose 1.38 Gromacs 5.0 - lignocellulose 1.37 Gromacs 2016-3 -lignocellulose 1.36 Gromacs 2016-3 - ion channel 1.33 Average Factor = 1.49 WRF - 4dbasic 1.29 OpenFOAM - Cavity3d-3M 1.11 0.9 1.1 1.3 1.5 1.7 1.9 2.1 Application Performance on Multi-core Processors 12 December 2018 71 SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

VASP 5.4.4 - PdO Complex 2.23 VASP 5.4.4 - Zeolite 2.02 DL_POLY 4.08 - Gramicidin 1.76 DL_POLY 4.08 - NaCl 1.71 GAMESS-UK - DFT.cyclo.6-31G-dp 1.59 GAMESS-UK - SiOSi7 1.56 NPEs = 160 DLPOLY Classic Bench7 1.56 CP2K - H2O-256 1.53 DLPOLY Classic - Bench4 1.50 QE 5.2 - GRIR443 1.49 Improved Performance of CP2K - H2O-512 1.49 Hawk - Dell |EMC Skylake WRF - conus 2.5km 1.49 Gold 6148 2.4GHz (T) EDR Gromacs 4.6.1 - ion channel 1.48 vs. Gromacs 5.0 - ion channel 1.46 Raven - ATOS b510 Sandy DLPOLY Classic - Bench5 1.45 Bridge e5-2670/2.6 GHz IB- Gromacs 4.6.1 - lignocellulose 1.39 QDR Gromacs 5.0 - lignocellulose 1.36 WRF - 4dbasic 1.36 Gromacs 2016-3 - lignocellulose 1.36 QE 5.2 - AU112 1.33 Average Factor = 1.53 Gromacs 2016-3 - ion channel 1.32 OpenFOAM - Cavity3d-3M 1.23 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 Application Performance on Multi-core Processors 12 December 2018 72 SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

VASP 5.4.4 - PdO Complex 2.71 DLPOLY Classic Bench7 2.16 QE 5.2 - AU112 1.97 VASP 5.4.4 - Zeolite 1.88 DL_POLY 4.08 - NaCl 1.80 DL_POLY 4.08 - Gramicidin 1.74 NPEs = 320 GAMESS-UK - SiOSi7 1.60 GAMESS-UK - DFT.cyclo.6-… 1.58 CP2K - H2O-256 1.56 Improved Performance of Gromacs 4.6.1 - ion channel 1.53 Hawk - Dell |EMC Skylake WRF - conus 2.5km 1.45 Gold 6148 2.4GHz (T) EDR OpenFOAM - Cavity3d-3M 1.44 vs. DLPOLY Classic - Bench4 1.41 Raven - ATOS b510 Sandy Gromacs 4.6.1 - lignocellulose 1.41 Bridge e5-2670/2.6 GHz IB- Gromacs 5.0 - ion channel 1.40 QDR CP2K - H2O-512 1.40 WRF - 4dbasic 1.39 Gromacs 5.0 - lignocellulose 1.39 Gromacs 2016-3 -… 1.38 Average Factor = 1.60 DLPOLY Classic - Bench5 1.37 Gromacs 2016-3 - ion channel 1.34 QE 5.2 - GRIR443 1.34 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7

Application Performance on Multi-core Processors 12 December 2018 73 Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets ¤ “Core to core” and “node to node” workload comparisons • Previous charts based on Core to core comparison i.e. performance for jobs with a fixed number of cores • Node to Node comparison typical of the performance when running a workload (real life production). Expected to reveal the major benefits of increasing core count per socket ¤ Focus on a 4 and 6 “node to node” comparison of the following:

Raven - Bull b510 Sandy Hawk - Dell |EMC Skylake 1 Bridge e5-2670/2.6 GHz IB- Gold 6148 2.4GHz (T) EDR QDR [64 cores] [160 cores]

Raven - Bull b510 Sandy Hawk - Dell |EMC Skylake 2 Bridge e5-2670/2.6 GHz IB- Gold 6148 2.4GHz (T) EDR QDR [96 cores] [240 cores]

¤ Benchmarks based on set of 10 applications & 19 data sets.

Application Performance on Multi-core Processors 12 December 2018 74 SKL “Gold” 6148 2.4 GHz EDR vs. SB e5-2670 2.6 GHz QDR

OpenFOAM - Cavity3d-3M 3.50

WRF 3.4 - conus 2.5km 3.46

GAMESS-UK (DFT.siosi7.3975) 3.40

DLPOLY-4 Gramicidin 3.31 GROMACS 2016.3 - lignocellulose 4 Node Comparison 3.28 QE 5.2 - GRIR443 3.26

GROMACS 2016.3 - ion-channel 3.11

DLPOLY-4 NaCl 3.09

WRF 3.4 - 4dbasic 2.98 Improved Performance of VASP 5.4.4 Zeolite complex Dell |EMC Skylake Gold 6148 2.96 VASP 5.4.4 Pd-O complex 2.4GHz (T) EDR [160 cores] 2.95 vs. GAMESS-UK (DFT.cyclo.6-31G-dp) Bull b510 Sandy Bridge e5- 2.94 DLPOLYclassic Bench4 2670/2.6 GHz IB-QDR [64 cores] 2.81 CP2K - H2O-512 Average Factor = 3.05 2.70 QE 5.2 - Au112 2.62

CP2K - H2O-256 2.50

1.0 1.5 2.0 2.5 3.0 3.5

Application Performance on Multi-core Processors 12 December 2018 75 SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

OpenFOAM - Cavity3d-3M 3.88

GROMACS 2016.3 - lignocellulose 3.27 QE 5.2 - Au112 6 Node Comparison 3.19 GAMESS-UK (DFT.siosi7.3975) 3.19

WRF 3.4 - conus 2.5km 3.18 Improved Performance of Hawk QE 5.2 - GRIR443 Dell |EMC Skylake Gold 6148 3.14 DLPOLY-4 NaCl 2.4GHz (T) EDR [240 cores] 3.01 vs. DLPOLYclassic Bench4 2.96 Bull b510 Sandy Bridge e5- DLPOLY-4 Gramicidin 2670/2.6 GHz IB-QDR [96 cores] 2.96 CP2K - H2O-512 2.79

GAMESS-UK (DFT.cyclo.6-31G-dp) 2.78

WRF 3.4 - 4dbasic 2.78 VASP 5.4.4 Pd-O complex 2.67 Average Factor = GROMACS 2016.3 - ion-channel 2.64 2.98 CP2K - H2O-256 2.63

VASP 5.4.4 Zeolite complex 2.59

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Application Performance on Multi-core Processors 12 December 2018 76 EPYC - Target Codes and Data Sets – 128 PEs

128 PE Performance [Applications] Fujitsu CX250 Sandy DLPOLYclassi Bridge e5-2670/2.6 GHz c Bench4 IB-QDR 1.00 ATOS Broadwell e5- VASP Zeolite 0.90 DLPOLY-4 2680v4 2.4GHz (T) OPA complex 0.80 Gramicidin 0.70 Thor Dell|EMC e5-2697A 0.60 v4 2.6GHz (T) EDR IMPI VASP Pd-O 0.50 DLPOLY-4 Dell Skylake Gold 6130 complex 0.40 NaCl 2.1GHz (T) OPA 0.30 0.20 Intel Skylake Gold 6148 0.10 2.4GHz (T) OPA 0.00 Dell Skylake Gold 6142 GROMACS QE GRIR443 2.6GHz (T) EDR ion-channel Dell Skylake Gold 6150 2.7GHz (T) EDR

Bull|ATOS Skylake Gold GROMACS QE Au112 6150 2.7GHz (T) EDR lignocellulose Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR GAMESS-UK GAMESS-UK (Siosi7) (cyc-sporin)

Application Performance on Multi-core Processors 12 December 2018 77 Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets ¤ “Core to core” and “node to node” workload comparisons • Previous EPYC charts based on Core to core comparison i.e. performance for jobs with a fixed number of cores • Node to Node comparison typical of the performance when running a workload (real life production). Expected to reveal the major benefits of increasing core count per socket ¤ Focus on a “node to node” comparison of the following:

Fujitsu CX250 Sandy Bridge e5- Dell |EMC AMD EPYC 7601 2.2 GHz 1 2670/2.6 GHz IB-QDR [64 cores] (T) EDR [256 cores] Dell |EMC Skylake Gold 6130 2.1GHz Dell |EMC AMD EPYC 7601 2.2 GHz 2 (T) OPA [128 cores] (T) EDR [256 cores]

¤ Benchmarks based on set of 6 applications & 15 data sets.

Application Performance on Multi-core Processors 12 December 2018 78 Dell|EMC EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR

GAMESS-UK (valino.A2) 4.19

GROMACS lignocellulose 4.15

GAMESS-UK (cyc-sporin) 3.62

QE Au112 3.33

GROMACS ion-channel 3.24 4 Node Comparison DLPOLYclassic Bench4 2.90

VASP Zeolite complex 2.88

DLPOLY-4 Gramicidin 2.69 Average Factor = 2.92

DLPOLY-4 NaCl 2.30 Relative Performance of DLPOLYclassic Bench5 2.13 Dell | EMC AMD EPYC 7601 2.2 GHz (T) EDR [256 cores] VASP Pd-O complex 2.09 vs. Fujitsu CX250 Sandy Bridge e5- DLPOLYclassic Bench7 1.55 2670/2.6 GHz IB-QDR [64 cores]

1.0 1.5 2.0 2.5 3.0 3.5 4.0 Application Performance on Multi-core Processors 12 December 2018 79 SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR

GAMESS-UK (hf12z) 1.83

GROMACS lignocellulose 1.78

GAMESS-UK (Siosi7) 1.78

GAMESS-UK (valino.A2) 1.64

GAMESS-UK (cyc-sporin) 1.51

DLPOLYclassic Bench4 1.51 4 Node Comparison GROMACS ion-channel 1.44

DLPOLYclassic Bench5 1.21

VASP Zeolite complex 1.13 Average Factor = 1.28

DLPOLY-4 Gramicidin 1.07

DLPOLYclassic Bench7 1.00 Relative Performance of Dell |EMC AMD EPYC 7601 2.2 DLPOLY-4 NaCl 0.94 GHz (T) EDR [256 cores] QE GRIR443 0.84 vs. Dell |EMC Skylake Gold 6130 QE Au112 0.80 2.1GHz (T) OPA [128 cores] VASP Pd-O complex 0.74

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 Application Performance on Multi-core Processors 12 December 2018 80 Summary

• Ongoing Focus on performance benchmarks and clusters featuring Intel’s SKL processors, with the addition of the “Gold” 6138, 2.0 GHz [20c] and 6148, 2.4 GHz [20c] alongside the 6142, 2.6 GHz [16c] ; and 6150, 2.7 GHz [18c]). • Performance comparison with current SNB systems and those based on dual Intel BDW processor EP nodes (16-core, 14-core) with Mellanox EDR and Intel’s Omnipath OPA interconnects. • Measurements of parallel application performance based on synthetic and end user applications – DLPOLY, Gromacs, Amber, GAMESS-UK, Quantum ESPRESSO and VASP. ¤ Use of Allinea Performance reports to guide analysis, and updated comparison of Mellanox’s HPC-X and Intel MPI on EDR-based systems • Results augmented through consideration of two AMD Naples EPYC clusters, featuring the 7601 (2.20 GHz) and 7551 (2.00 GHz) processors.

Application Performance on Multi-core Processors 12 December 2018 81 Summary II

• Relative Code Performance: Processor Family and Interconnect – “core to core” and “node to node” benchmarks. • A Core-to-Core comparison focusing on the Skylake “Gold” 6148 cluster (EDR) across 19 data sets (7 applications) suggests average speedups between 1.49 (80 cores) through 1.60 (320 cores) when comparing the to the Sandy Bridge-based “Raven” e5-2670 2.6GHz cluster with QDR environment. ¤ Some applications however show much higher factors e.g. GROMACS and VASP depending on the level of optimisation undertaken on Hawk. • A Node-to-Node comparison typical of the performance when running a workload shows increased factors. ¤ A 4-node benchmark (160 cores) based on examples from 9 applications and 16 data sets show average improvement factors of 3.05 compared to the corresponding 4 node runs (64 cores) on the Raven cluster. ¤ This factor is reduced somewhat, to 2.98, when using 6 node benchmarks, comparing 240 SKL cores to 96 SNB cores.

Application Performance on Multi-core Processors 12 December 2018 82 Summary III

• An updated comparison of Intel MPI and Mellanox’s HPCX conducted on the “Helios” cluster suggests that the clear delineation between MD (DLPOLY, GROMACS) and Materials- based codes (VASP, Quantum Espresso) is no longer evident. • Ongoing studies on the EPYC 2701 shows a complex performance dependency on EPYC architecture. ¤ Codes with high usage of vector instructions (Gromacs, VASP and Quantum Espresso) perform at best in somewhat modest fashion. ¤ The AMD EPYC only supports 2 × 128-bit AVX natively, so there’s a large gap with Intel and their 2 × 512-bit FMAs. ¤ The floating point peak on AMD is 4 × lower than Intel and given that e.g., GROMACS has a native AVX-512 kernel for Skylake, performance inevitably suffers.

Application Performance on Multi-core Processors 12 December 2018 83 Application Performance on Multi- core Processors

II. Acceptance Test Challenges and the Impact of Environment. Background - Supercomputing Wales, New HPC Systems

• Multi-million £ procurement exercise for new hubs agreed by all partners • Tender issued in May 2017 following 6-9 month review of research community requirements and development of technical reference design • Budgetary challenges due to currency devaluation and increase in component costs since budgets agreed in 2016 • Contracts awarded to Atos, March 2018. Hubs now installed and operational, based on Intel Skylake Gold 6148, supported by Nvidia GPU accelerators:  Lot 1 – “Hawk” system - Cardiff hub. 7,000 HPC + 1,040 HTC cores  Lot 2 – “Sunbird” system - Swansea hub. 5,000 HPC cores  Lot 3 – “Sparrow” – Cardiff High Performance Data Analytics development system  Suppliers to provide development opportunities and other activities through a programme of Community Benefits

Application Performance on Multi-core Processors 12 December 2018 85 Performance Acceptances Tests 1. Consideration of the Performance Acceptance tests undertaken as part of the Supercomputing Wales procurement. Carried out by Atos on the “Hawk” HPC Skylake 6148 Cluster at Cardiff University. 2. Performance targets built on benchmarks specified in the ITT – but developments impacted on the subsequent testing e.g., SPECTRE / Meltdown. 3. Assess Performance through analyses of results generated through three distinct run time environment variables, characterised by : ¤ Turbo Mode – ON or OFF. Impact considerably more complicated with Skylake compared to previous Intel processor families. ¤ Security patches – DISABLED or ENABLED on the Skylake 6148 compute nodes ¤ Distribution of processing cores – PACKED or UNPACKED on each node e.g. 256 cores on either 7 or 8 × 40-core nodes. 4. Total of 8 Combinations – Impact on Performance ? ¤ ITT defined that all – “Application benchmarks should be in “PACKED” mode; HPCC in non-turbo mode” Application Performance on Multi-core Processors 12 December 2018 86 Process Adopted

1. Performance benchmark results generated by Atos (Martyn Foster) on the Hawk HPC Skylake 6148 Cluster at Cardiff University 2. MF adopted a systematic approach to assessing performance through the analyses of results generated across four distinct environments (a subset of the 8 possible environments) ¤ “base (switch contained)” – Turbo mode off, security patches disabled on the Skylake 6148 compute nodes ¤ “turbo + packed” - Turbo mode activated, with packed nodes – Slurm default, with 40 cores per Skylake 6148 node ¤ “turbo + spread” - Turbo mode activated, de-populated nodes (32 cores / node) ¤ “base + spectre” – base configuration above with security patches enabled 3. Identify those applications where the committed performance from the SCW ITT submission (“Target”) is not achieved.10% shortfall allowed.

Application Performance on Multi-core Processors 12 December 2018 87 GLOBAL_SETTINGS

export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/disablekpti" export SPEC="disable" ##### OR ##### export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/enablekpti" export SPEC="enable"

export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_on" ; export TSTR=TURBO ##### OR ##### export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_off" ; export TSTR=OFF

export SRUN_PACKING="-m Pack" ; export PSTR=Packed ##### OR ##### export SRUN_PACKING="-m NoPack"; export PSTR=Spread

export LAUNCHER="srun ${SRUN_PACKING} --cpu_bind=verbose,cores --export LD_LIBRARY_PATH".

Application Performance on Multi-core Processors 12 December 2018 88 SCW Application Performance Benchmarks • The Benchmark suite comprises both synthetics & end-user applications. Synthetics include HPCC (http://icl.cs.utk.edu/hpcc) & IMB benchmarks (http://software.intel.com/en-us/articles/intel-mpi- benchmarks), IOR and STREAM • Variety of “open source” & commercial end-user application codes:

GROMACS and DL_POLY-4 (molecular dynamics)

Quantum Espresso and VASP (ab initio Materials properties)

BSMBench (particle physics – Lattice Gauge Theory Benchmarks)

OpenFOAM (computational engineering)

• These stress various aspects of the architectures under consideration and should provide a level of insight into why particular levels of performance are observed.

Application Performance on Multi-core Processors 12 December 2018 89 “Sunbird” Acceptance Tests – User Applications

Basket of Synthetic (HPCC, IOR, STREAM, IMB) and end-user application codes – DL_POLY, GROMACS, VASP, ESPRESSO, OpenFOAM & BSMBENCH)

115.0% 113% 111% 110% 110.0% 108% 107% 106% 106% 105% 105% 105% 104% 104% 105.0% 103% 102% 100.0% 97% 98% 96% 98% 95% 95% 95% 95.0% 94% 91% 90.0% 89% 87%

85.0%

80.0% 78%

QE-GRIR443(256) DLPOLY-Gramicidin(128) DLPOLY-Gramicidin(256) GROMACS-ionchannel (64) GROMACS-ionchannel (128) GROMACS-lignocellulose(128) GROMACS-lignocellulose(256) VASP-PdO(64) VASP-PdO(128) VASP-Zeolite(128) VASP-Zeolite(256) QE-Au112(64) QE-Au112(128) QE-GRIR443(512) OpenFOAM (128) OpenFOAM (256) BSMBench-Comms(256) BSMBench-Comms(512) BSMBench-Comms(1024) BSMBench-Balance (256) BSMBench-Balance (512) BSMBench-Balance (1024) BSMBench-Compute(256) BSMBench-Compute(512) BSMBench-Compute(1024) 75.0% DLPOLY-Gramicidin(64)

Application Performance on Multi-core Processors 12 December 2018 90 Impact of Turbo Mode on Performance (Security Patches Enabled)

Normalised to corresponding 115.0% performance with Turbo OFF

110.0% Security patches Enabled BETTER

105.0%

100.0%

95.0%

90.0%

Relative Performance (%) Performance Relative 85.0%

QE-Au112 (64) QE-Au112

VASP-PdO (64) VASP-PdO

QE-Au112 (128) QE-Au112

VASP-PdO (128) VASP-PdO

OpenFOAM (128) OpenFOAM (256) OpenFOAM

QE-GRIR443 (256) QE-GRIR443 (512) QE-GRIR443

BSM-Comms(256) BSM-Comms(512)

VASP-Zeolite (128) VASP-Zeolite (256) VASP-Zeolite

BSM-Balance (256) BSM-Balance (512) BSM-Balance

BSM-Comms(1024)

BSM-Compute (256)BSM-Compute (512)BSM-Compute

BSM-Balance (1024) BSM-Balance

BSM-Compute (1024) BSM-Compute

GROMACS-ligno. (128) GROMACS-ligno. (256) GROMACS-ligno.

DLPOLY-Gramicidin (64) DLPOLY-Gramicidin

DLPOLY-Gramicidin (128) DLPOLY-Gramicidin (256) DLPOLY-Gramicidin GROMACS-ionchannel (64) GROMACS-ionchannel GROMACS-ionchannel (128) GROMACS-ionchannel T Turbo-OFF / T Turbo-ON Computational setup

Application Performance on Multi-core Processors 12 December 2018 91 Impact of Turbo Mode on Performance (Security Patches Disabled)

Normalised to corresponding 115.0% performance with Turbo OFF

110.0% Security patches Disabled BETTER

105.0%

100.0%

95.0%

90.0%

Relative Performance (%) Performance Relative 85.0%

QE-Au112 (64) QE-Au112

VASP-PdO (64) VASP-PdO

QE-Au112 (128) QE-Au112

VASP-PdO (128) VASP-PdO

OpenFOAM (128) OpenFOAM (256) OpenFOAM

QE-GRIR443 (256) QE-GRIR443 (512) QE-GRIR443

BSM-Comms(256) BSM-Comms(512)

VASP-Zeolite (128) VASP-Zeolite (256) VASP-Zeolite

BSM-Balance (256) BSM-Balance (512) BSM-Balance

BSM-Comms(1024)

BSM-Compute (256)BSM-Compute (512)BSM-Compute

BSM-Balance (1024) BSM-Balance

BSM-Compute (1024) BSM-Compute

GROMACS-ligno.(128) GROMACS-ligno.(256)

DLPOLY-Gramicidin (64) DLPOLY-Gramicidin

DLPOLY-Gramicidin (128) DLPOLY-Gramicidin (256) DLPOLY-Gramicidin GROMACS-ionchannel (64) GROMACS-ionchannel

GROMACS-ionchannel (128) GROMACS-ionchannel T Turbo-OFF / T Turbo-ON Computational setup

Application Performance on Multi-core Processors 12 December 2018 92 Impact of Security Patches on Performance (Turbo Mode OFF)

Normalised to corresponding

110.0% performance with patches disabled on the compute nodes

Turbo OFF BETTER 105.0%

100.0%

95.0%

Relative Performance (%) Performance Relative 90.0%

QE-Au112 (64) QE-Au112

VASP-PdO (64) VASP-PdO

QE-Au112 (128) QE-Au112

VASP-PdO (128) VASP-PdO

OpenFOAM (128) OpenFOAM (256) OpenFOAM

QE-GRIR443 (256) QE-GRIR443 (512) QE-GRIR443

BSM-Comms(256) BSM-Comms(512)

VASP-Zeolite (128) VASP-Zeolite (256) VASP-Zeolite

BSM-Balance (256) BSM-Balance (512) BSM-Balance

BSM-Comms(1024)

BSM-Compute (256)BSM-Compute (512)BSM-Compute

BSM-Balance (1024) BSM-Balance

BSM-Compute (1024) BSM-Compute

GROMACS-ligno. (128) GROMACS-ligno. (256) GROMACS-ligno.

DLPOLY-Gramicidin (64) DLPOLY-Gramicidin

DLPOLY-Gramicidin (128) DLPOLY-Gramicidin (256) DLPOLY-Gramicidin

GROMACS-ionchannel (64) GROMACS-ionchannel T / T

GROMACS-ionchannel (128) GROMACS-ionchannel DISABLED ENABLED Computational setup

Application Performance on Multi-core Processors 12 December 2018 93 Impact of Security Patches on Performance (Turbo Mode ON)

Normalised to corresponding

110.0% performance with patches disabled on the compute nodes

Turbo ON BETTER 105.0%

100.0%

95.0%

Relative Performance (%) Performance Relative 90.0%

QE-Au112 (64) QE-Au112

VASP-PdO (64) VASP-PdO

QE-Au112 (128) QE-Au112

VASP-PdO (128) VASP-PdO

OpenFOAM (128) OpenFOAM (256) OpenFOAM

QE-GRIR443 (256) QE-GRIR443 (512) QE-GRIR443

BSM-Comms(256) BSM-Comms(512)

VASP-Zeolite (128) VASP-Zeolite (256) VASP-Zeolite

BSM-Balance (256) BSM-Balance (512) BSM-Balance

BSM-Comms(1024)

BSM-Compute (256)BSM-Compute (512)BSM-Compute

BSM-Balance (1024) BSM-Balance

BSM-Compute (1024) BSM-Compute

GROMACS-ligno. (128) GROMACS-ligno. (256) GROMACS-ligno.

DLPOLY-Gramicidin (64) DLPOLY-Gramicidin

DLPOLY-Gramicidin (128) DLPOLY-Gramicidin (256) DLPOLY-Gramicidin

GROMACS-ionchannel (64) GROMACS-ionchannel T DISABLED / T ENABLED GROMACS-ionchannel(128) Computational setup

Application Performance on Multi-core Processors 12 December 2018 94 Overall Impact of Environment on Performance

Normalised with respect to the most constrained environment - Turbo OFF, security patches enabled, “packed” nodes

115.0% BETTER

110.0%

105.0%

100.0%

95.0% Relative Performance (%) Performance Relative

90.0%

QE-Au112 (64) QE-Au112

VASP-PdO (64) VASP-PdO

QE-Au112 (128) QE-Au112

VASP-PdO (128) VASP-PdO

OpenFOAM (128) OpenFOAM (256) OpenFOAM

QE-GRIR443 (256) QE-GRIR443 (512) QE-GRIR443

BSM-Comms(256) BSM-Comms(512)

VASP-Zeolite (128) VASP-Zeolite (256) VASP-Zeolite

BSM-Balance (256) BSM-Balance (512) BSM-Balance

BSM-Comms(1024)

BSM-Compute (256)BSM-Compute (512)BSM-Compute

BSM-Balance (1024) BSM-Balance

BSM-Compute (1024) BSM-Compute

GROMACS-ligno. (128) GROMACS-ligno. (256) GROMACS-ligno.

DLPOLY-Gramicidin (64) DLPOLY-Gramicidin (256) DLPOLY-Gramicidin

DLPOLY-Gramicidin (128) DLPOLY-Gramicidin T CONSTRAIN / T MIN GROMACS-ionchannel (64) GROMACS-ionchannel GROMACS-ionchannel (128) GROMACS-ionchannel Computational setup

Application Performance on Multi-core Processors 12 December 2018 95 Workload validation and Throughput tests

• Aim: Throughput designed to illustrate the Stability of the system over an observed period of a week, while hardening the system • Benchmarks based on multiple, concurrent instantiations of a number of data sets associated with five of the end user application codes and two of the synthetic benchmarks. • Each data set is run a number of times on a variety of processor (core) counts - typically 40, 80, 160, 320, 640 and 1024. This combination of jobs has been designed to run for approximately 6 hours (elapsed time) on a 2720-core, 68 node cluster partition. • Note that the metrics for success of these tests are twofold: 1. All jobs comprising a given run complete successfully and 2. There is a consistency of run time across each of the tests. The measured time is simply the time at which the first of the jobs is launched through the time that the last jobs finishes.

Application Performance on Multi-core Processors 12 December 2018 96 Workload validation and Throughput tests SLURM Scripts • Based around multiple instantiations DLPOLY4.test2+test8.SCW.40.q of a number of data sets associated DLPOLY4.test2+test8.SCW.80.q DLPOLY4.test2+test8.SCW.160.q with the five codes, DLPOLY4, DLPOLY4.test2+test8.SCW.320.q DLPOLY4.test2+test8.SCW.640.q Gromacs (v5.2), Quantum Espresso, GROMACS.All.SCW.80.q GROMACS.All.SCW.160.q OpenFOAM and VASP, and the two GROMACS.All.SCW.320.q GROMACS.All.SCW.640.q synthetic benchmarks, IMB and IOR. GROMACS.All.SCW.1024.q IMB3.SCW.160.q • DLPOLY4 - NaCl & Gramicidin IMB3.SCW.320.q IOR.SCW.4.q • Gromacs - ion_channel & IOR.SCW.8.q OpenFOAM_cavity3d-3M.SCW.80.q lignocellulose OpenFOAM_cavity3d-3M.SCW.160.q OpenFOAM_cavity3d-3M.SCW.320.q • QE 6.1 - AUSURF112 & GRIR443 OpenFOAM_cavity3d-3M.SCW.640.q QE.AUSURF112.SCW.160.q • OpenFOAM - cavity3d-3M QE.AUSURF112.SCW.320.q QE.GRIR443.SCW.320.q QE.GRIR443.SCW.640.q • VASP 5.4.4 – PdO complex and VASP.example3.SCW.80.q Zeolite VASP.example3.SCW.160.q VASP.example3.SCW.320.q VASP.example4.SCW.160.q VASP.example4.SCW.320.q

Application Performance on Multi-core Processors 12 December 2018 97 Throughput Tests – Hawk System – Two partition Approach

The throughput tests were undertaken on two separate partitions of the Hawk cluster – compute64 and compute64b – to enable other testing and early pilot user service. Each partition comprised 68 nodes. Partition 1 – Compute 64 (68 Nodes) • The first set of trial runs was executed between 12-14 May. A number of the runs failed to complete, subsequently attributed to an apparent VASP related error peculiar to the lustre file system: forrtl: severe (121): Cannot access current working directory for unit 18, file "Unknown" Image PC Routine Line Source vasp_std 00000000014F3E09 Unknown Unknown Unknown vasp_std 000000000150E10F Unknown Unknown Unknown vasp_std 000000000134C950 Unknown Unknown Unknown vasp_std 000000000040AF5E Unknown Unknown Unknown libc-2.17.so 00002B450F32EC05 __libc_start_main Unknown Unknown vasp_std 000000000040AE69 Unknown Unknown Unknown forrtl: error (76): Abort trap signal • This transient error affected perhaps one in twenty identical jobs, and although reported into the appropriate Level 3 service regimes, has still not been formally addressed. A workaround module was developed by Cardiff’s Tom Green when it became clear that the formal channels were struggling. module load lustre_getcwd_fix Application Performance on Multi-core Processors 12 December 2018 98 Throughput Tests – Hawk System II

Partition 1: • A second set of trial runs were carried out over the bank holiday weekend and successfully passed the associated tests over the period 30 May – 3 June.

Total Elapsed Time Run # Start Time Finish Time (hours:Mins) 6 30May 21-21 31May 03-25 6:02 7 31May 23-33 01Jun 05-38 6:05 8 02Jun 00-04 02Jun 06-06 6:02 9 02Jun 13:24 02Jun 19:27 6:04 10 02Jun 22-57 03Jun 05-00 6:03 11 03Jun 05-45 03Jun 11-47 6:02 12 03Jun 15-57 03Jun 22-12 6:15 • Partition 2: • Runs 11 -22: Initial runs using compute64b conducted between 7 – 10 June revealed a number of issues pointing to readiness of the nodes. Timings from the first completed run suggested some variability in run times for a given application/core count, with the total run time significantly longer than those on compute64. Application Performance on Multi-core Processors 12 December 2018 99 Throughput Tests – Hawk System III

• Following a lustre upgrade, a further set of runs were undertaken between 21 June and 25 June. Runs 8 - 12 actually ran OK, so formally compute 64b, along with compute64, can be judged to have passed the Acceptance Test throughput requirement of five consecutive error-free runs, although the variations in the individual run times are perhaps larger than hoped.

Total Elapsed Time Run # Start Time Finish Time (hours:Mins) 8 23Jun 15-44 23Jun 20-57 5:13

9 23Jun 21-35 24Jun 02-55 5:20

10 24Jun 11-34 24Jun 17-06 5:32

11 24Jun 18-56 25Jun 00-16 5:20

12 25Jun 00-31 25Jun 06-03 5:32

• Testing on Hawk commenced on 12 May 2018 and was finally completed on the 25 June 2018.

Application Performance on Multi-core Processors 12 December 2018 100 Throughput Tests – Sunbird System – Two partition approach

Partition 1: Runs 1 – 4: ¤ Run 3 did not complete with JOBID #11050 hanging, while JOBID #11372 of Run 4 suffered the same fate. Both jobs failed with the all too familiar VASP/lustre error diagnostics. The scripts used were identical to those used on Hawk in June, and did not include the workaround introduced at the time. • Runs 5 -10: Completed successfully, with two of the VASP/Lustre partitions trapped though the added module module load lustre_getcwd_fix Partition 2: Runs 11 - 22: Three jobs in one of the runs hung when hitting problems on scs0105. That node had been taken out for when setting up the user-facing file systems and needed the playbooks running. Several of the runs showed the impact of the lustre issue with VASP. • However, there were significant variations in the overall run times. ¤ At least three of the nodes appeared to be either defective or possess some different bios settings (scs0064,scs0092 and scs0096). These were subsequently removed from service. ¤ Turbo in inconsistent state across the compute nodes. Usually reset by the Slurm prologue scripts, but they appear to have been commented out.

Application Performance on Multi-core Processors 12 December 2018 101 Throughput Tests – Sunbird System

• Runs 23 – 30: Runs certainly acceptable from the metric of job completion, for all completed successfully. Note there was no reoccurrence of the lustre-related issue during this set of runs.

Total Elapsed Time Run # Start Time Finish Time (hours:Mins) 23 17Aug 17:52 17Aug 23:07 5:15 24 17Aug 23:54 18Aug 05:11 5:17 25 18Aug 05:19 18Aug 10:36 5:17 26 18Aug 14:00 18Aug 19:16 5:16 27 18Aug 19:51 19Aug 01:08 5:17 28 19Aug 02:45 19Aug 07-57 5:12 29 19Aug 13:21 19Aug 18:32 5:11 5:20 (SLURM CG 30 19Aug 19:06 20Aug 00:26 Issue

• Testing on Sunbird commenced on 10 August 2018, and was finally completed on the evening of 19 August 2018

Application Performance on Multi-core Processors 12 December 2018 102 Throughput Tests – Nottingham OCF Cluster • Tests modified to run on two partition of the OCF cluster at Nottingham, “martyn" and "colin", each comprising 50 nodes with EDR interconnect. All component nodes comprised dual Gold 6138 2.0GHz 20c SKL processors • Initial runs of the workload failed to complete successfully, with each of the 8 x 320-core IMB jobs hanging, consuming all of their allocated time. Traced to an issue with the gatherv collective that failed to complete across all specified msglens. • Navigated around the issue by removing those environment variables deemed likely to trigger the problem, specifically: ¤ export I_MPI_JOB_FAST_STARTUP=enable ¤ export I_MPI_SCALABLE_OPTIMIZATION=enable ¤ export I_MPI_DAPL_UD=enable ¤ export I_MPI_TIMER_KIND=rdtsc • With these removed, runs proceeded to complete successfully. • One of the allocated nodes (compute099) rendered unusable as a result of tests - removed from service. Thus the subsequent runs used 49 nodes, rather that the intended 50.

Application Performance on Multi-core Processors 12 December 2018 103 Throughput Tests – Acceptance Achieved (OCF System)

Table. Overall run times for the throughput runs on the “martyn” partition.

Total Elapsed Time Run # Start Time Finish Time (hours:Mins) 2 31Jul 18-04 01Aug 00-49 6:45

3 01Aug 01-22 01Aug 08-07 6:45

4 01Aug 08-32 01Aug 15-18 6:46

5 01Aug 17-08 01Aug 23-52 6:44

6 02Aug 03-20 02Aug 10-06 6:46

Table. Overall run times for the throughput runs on the “colin” partition.

Total Elapsed Time Run # Start Time Finish Time (hours:Mins) 1 02Aug 12-24 02Aug 19-05 6:41

2 03Aug 02-41 03Aug 09-18 6:37

3 03Aug 10-34 03Aug 17-18 6:44

4 03Aug 17-35 04Aug 00-26 6:51

5 04Aug 01-00 04Aug 07-45 6:45 Results of "throughput benchmarks" carried out on the new OCF Skylake cluster at Nottingham University between 31 July and 4 August 2018.

Application Performance in Materials Science 12 December 2018 104 Application Performance on Multi- core Processors

III. The Performance Evolution of two Community Codes, DL_POLY and GAMESS-UK . Outline and Contents

1. Introduction – DL_POLY and GAMESS-UK ¤ Background and Flagship community codes for the UK’s CCP5 & CCP1 – Collaboration! 2. HPC Technology – Impact of Processor & Interconnect developments ¤ The last 10 years of Intel dominance – Nehalem to Skylake 3. DL_POLY and GAMESS-UK Performance ¤ Benchmarks & Test Cases ¤ Overview of two decades of Code Performance: From the Cray T3E/900 to Intel Skylake clusters

“DL_POLY - A Performance Overview. Analysing, Understanding and Exploiting available HPC Technology”, Martyn F Guest, Alin M Elena and Aidan B G Chalk, Molecular Simulation, Accepted for publication (2019).

Application Performance on Multi-core Processors 12 December 2018 106 The Story of Two Community Codes DL_POLY and GAMESS-UK - A Performance Overview

HPC Technology – Processor and Networks Computer Systems

• Benchmark timings - a wide variety of systems, starting with the Cray T3E/1200 in 1999. Access initially undertaken as part of Daresbury’s Distributed Computing support programme (DiSCO), with the benchmarks presented at the annual Machine Evaluation Workshops (1989-2014) and STFC’s successor Computing Insight (CIUK) conferences (2015 onwards). ¤ Access typically short-lived as systems provided by suppliers to enhance their profile at the MEW Workshops - limited opportunity for in depth benchmarking. • Systems include a wide range of CPU offerings. Representatives from over a dozen generations of Intel processors, from the early days of single processor nodes housing Pentium 3 and Pentium 4 CPUs, through dual processor nodes featuring dual-core Woodcrest, quad-core Clovertown & Harpertown processors, along with the Itanium and Itanium2 CPUs, through to the extensive range of multi-core offerings Westmere - Skylake.

Application Performance on Multi-core Processors 12 December 2018 108 Computer Systems

• A variety of processors from AMD (Athlon, Opteron, MagnyCours, Interlagos etc.) along with the “power” processors from the IBM pSeries have also featured (typically dual processor configurations). • In the same way a wide variety of processors feature, so too is the appearance of a range of network interconnects. Fast Ethernet and GBit Ethernet were rapidly superseded by the increasing capabilities of the family of Infiniband interconnects from Voltaire and Mellanox (SDR, DDR, QDR, FDR, EDR and soon HDR), along with the now defunct offerings from Myrinet, Quadrics and QLogic. The Truescale interconnect from Intel, along with its successor, Omnipath, also feature. • Dating from the appearance of Intel’s SNB processors, many of the timings generated with the Turbo mode feature enabled by the system administrators. Such systems are tagged with “(T)” notation. • As for software, most of the commodity clusters featuring Intel CPUs used successive generation of Intel compilers along with Intel MPI, although a range of MPI libraries have been used – OpenMPI, MPICH, MVAPICH and MVAPICH2. Proprietary systems (Cray and IBM) used system specific compilers and associated MPI libraries.

Application Performance on Multi-core Processors 12 December 2018 109 Intel Xeon : Westmere - Skylake

Intel Xeon Scalable Xeon 5600 Xeon E5-2600 Xeon E5-2600 v4 Processor (Westmere-EP) (Sandy Bridge-EP) “Broadwell-EP” “Skylake” Up to 6 cores / 12 Up to 8 cores / 16 Up to 22 Cores / 44 Up to 28 Cores / 56 Cores / Threads threads threads threads threads Up to 38.5 MB (non- Last-level cache 12 MB Up to 20 MB Up to 55 MB inclusive) 4 channels of up to 3 6 channels of up to 2 Max memory 3xDDR3 channels, 4xDDR3 channels, RDIMMs, LRDIMMs RDIMMs, LRDIMMs channels, speed 1333 1600 or 3DS LRDIMMs, or 3DS LRDIMMs, / socket 2400 MHz 2666 MHz New AVX 1.0 AVX 2.0 AVX 512 AES-NI instructions 8 DP Flops/Clock 16 DP Flops/Clock 32 DP Flops/Clock

QPI / UPI Speed 1 QPI channels @ 2 QPI channels @ 8.0 2 x QPI channels @ Up to 3 x UPI @ 10.4 (GT/s) 6.4 GT/s GT/s 9.6 GT/s GT/s

PCIe Lanes / 36 lanes PCIe 2.0 on 40 Lanes / Socket 40 / 10 / PCIe* 3.0 48 / 12 / PCIe* 3.0 Controllers / chipset Integrated PCIe 3.0 (2.5, 5, 8 GT/s) (2.5, 5, 8 GT/s) Speed (GT/s)

Server / Server / Up to 130W Server; Workstation 55 - 145W 70 – 205W Workstation: 130W 150W Workstation TDP

Application Performance on Multi-core Processors 12 December 2018 110 The Story of Two Community Codes DL_POLY and GAMESS-UK - A Performance Overview

Overview of two decades of DL_POLY Performance DL_POLY 3/4 – Distributed data Domain Decomposition - Distributed data: • Distribute atoms, forces across the nodes ¤ More memory efficient, can address much larger 5 7 cases (10 -10 ) A B • Shake and short-ranges forces require only neighbour communication ¤ communications scale linearly with number of nodes C D • Coulombic energy remains global ¤ Adopt Smooth Particle Mesh Ewald scheme • includes Fourier transform smoothed charge density (reciprocal space grid typically W. Smith and I. Todorov 64x64x64 - 128x128x128) Benchmarks 1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å 2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

https://www.scd.stfc.ac.uk/Pages/DL_POLY.aspx

Application Performance on Multi-core Processors 12 December 2018 112 The DLPOLY Benchmarks

DL_POLY Classic DL_POLY 4 • Bench4 • Test2 Benchmark ¤ NaCl Melt Simulation with Ewald ¤ NaCl Simulation; sum electrostatics & a MTS algorithm. 27,000 atoms; 500 time 216,000 ions, 200 time steps. steps, Cutoff=12Å • Bench5 • Test8 Benchmark ¤ Potassium disilicate glass (with 3- ¤ Gramicidin in water; body forces). 8,640 atoms: 3,000 rigid bonds + SHAKE: time steps 792,960 ions, 50 time • Bench7 steps ¤ Simulation of gramicidin A molecule in 4012 water molecules using neutral group electrostatics. 12,390 atoms: 5,000 time steps

Application Performance on Multi-core Processors 12 December 2018 113 DL_POLY Classic: Bench 4

DLPOLY 2 - Bench 4 (32 PEs) 112 100 Performance Relative to the Cray T3E/1200 (32 CPUs) 80

60

40

20

0

IBM p690 IBM

Performance relative to the Cray T3E/1200E Cray the to relative Performance

IBM Regatta-H IBM

IBM p690+ // // HPS p690+ IBM

CS4 AMD 1.2 GHz/FE 1.2 AMD CS4

CS11 P4/2400 + + GbitE P4/2400 CS11

CS6 PIII/800 + FE/LAM + PIII/800 CS6

CS9 P4/2000 + + Myrinet P4/2000 CS9

CS10 P4/2666 + + Myrinet P4/2666 CS10

SGI Origin 3800/R14k-500 Origin SGI

CS19 Opteron246/2.0 + SCI + Opteron246/2.0 CS19

CS20 Opteron248/2.2 + M2k + Opteron248/2.2 CS20

CS22 P4 EM64T/3200 + M2k + EM64T/3200 P4 CS22

Cray T3E/1200 EV56 600 MHz 600 EV56 T3E/1200 Cray

HP Superdome/Itanium2 1600 Superdome/Itanium2 HP

CS16 Itanium2/1300 + Myrinet + Itanium2/1300 CS16

CS29 Opteron280/2.4 DC + + IB DC Opteron280/2.4 CS29

IBM SP/Winterhawk2-375 MHz SP/Winterhawk2-375 IBM

CS26 Opteron875/2.2 DC + + M2k DC Opteron875/2.2 CS26

CS30 Xeon 5160 3.0GHz DC + + IB DC 3.0GHz 5160 Xeon CS30

IBM pSeries 575 4.7 GHz DC + IB + DC GHz 4.7 575 pSeries IBM

CS33 Xeon 5160 3.0GHz DC + IP HTX + IP DC 3.0GHz 5160 Xeon CS33

IBM Power8 S822LC 2.92GHz IB/EDR 2.92GHz S822LC Power8 IBM

Boston BDW e5-2650v4 2.2GHz (T) FDR (T) 2.2GHz e5-2650v4 BDW Boston

Cray XD1 Opteron 250/2.4 + RapidArray + 250/2.4 Opteron XD1 Cray

Bull HSW e5-2690v3 2.6GHz Connect-IB 2.6GHz e5-2690v3 HSW Bull EDR (T) 2.6GHz v4 e5-2697A BDW Thor

Dell SKL Gold 6142 2.6GHz (T) EDR [16c] EDR (T) 2.6GHz 6142 Gold SKL Dell

Fujitsu RX300 SNB E5-2680 8-C + IB QDR + IB 8-C E5-2680 SNB RX300 Fujitsu

Fujitsu "HTC" BX922 X5650 2.66 GHz + IB + GHz 2.66 X5650 BX922 "HTC" Fujitsu

CS35 CUBRIC Opteron275/2.2 DC + + GBitE DC Opteron275/2.2 CUBRIC CS35

CS32 Opteron 2218-F 2.6GHz DC + IP HTX + IP DC 2.6GHz 2218-F Opteron CS32

Bull|ATOS SKL Gold 6150 2.7GHz (T) OFA (T) 2.7GHz 6150 Gold SKL Bull|ATOS

Bull HSW e5-2680v3 2.5GHz (T) Connect-IB (T) 2.5GHz e5-2680v3 HSW Bull

Fujitsu CX250 SNB e5-2690/2.9GHz IB-QDR e5-2690/2.9GHz SNB CX250 Fujitsu IB-QDR e5-2670/2.6GHz SNB CX250 Fujitsu

Intel IVB e5-2697v2 2.7GHz True Scale PSM Scale True 2.7GHz e5-2697v2 IVB Intel

Fujitsu BX922 WSM X5650 2.67GHz IB-QDR 2.67GHz X5650 WSM BX922 Fujitsu

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer] ARIES 2.7GHz e5-2697v2 XC30 Cray

CS60 Viglen E5520 NH 2.27GHz QC + IB DDR… + IB QC 2.27GHz NH E5520 Viglen CS60

CS73 QLogic NDC X5670 2.93GHz 6-C + + QDR… 6-C 2.93GHz X5670 NDC QLogic CS73

Intel SKL Platinum 8170 2.1GHz (T) OPA[26c] (T) 2.1GHz 8170 Platinum SKL Intel

CS50 SGI Ice X5365 Clovertown 3.0GHz QC +… QC 3.0GHz Clovertown X5365 Ice SGI CS50 +… FSB 1600 QC 3.0GHz E5472 Xeon Bull CS51

CS61 Bullx X5550 NH 2.67GHz QC + + ConnectX… QC 2.67GHz NH X5550 Bullx CS61

Bull B710 IVB e5-2697v2 2.7GHz Mellanox FDR Mellanox 2.7GHz e5-2697v2 IVB B710 Bull

CS42 Opteron 2218-F 2.6GHz DC + Mellanox IB + Mellanox DC 2.6GHz 2218-F Opteron CS42

Intel HSW e5-2697v3 2.6GHz (T) Truescale QDR Truescale (T) 2.6GHz e5-2697v3 HSW Intel

CS57 Intel X5560 Nehalem 2.8GHz QC + IB IB QDR + QC 2.8GHz Nehalem X5560 Intel CS57

CS66 Intel L7555 NehalemEX 1.87GHz + IB QDR IB + 1.87GHz NehalemEX L7555 Intel CS66

Huawei Fusion CH140 e5-2690 v4 2.6GHz (T) EDR (T) 2.6GHz v4 e5-2690 CH140 Fusion Huawei

CS63 Intel X5570 NH 2.93GHz QC + ConnectX QDR… + ConnectX QC 2.93GHz NH X5570 Intel CS63

CS47 Intel E5472 Harpertown 3.0GHz QC 1600 FSB… 1600 QC 3.0GHz Harpertown E5472 Intel CS47

CS54 SGI Ice Xeon E5440 2.83GHz QC + Voltaire IB… Voltaire + QC 2.83GHz E5440 Xeon Ice SGI CS54 CS45 HP BL460c Xeon 5160/3.0GHz DC + Mellanox IB Mellanox + DC 5160/3.0GHz Xeon BL460c HP CS45 Application Performance on Multi-core Processors 12 December 2018 114 DL_POLY V2: Bench 7

50.0 47 DLPOLY 2 - Bench 7 (32 PEs) 40.0 Performance Relative to the Cray T3E/1200 (32 CPUs) 30.0

20.0

10.0

0.0

IBM p690 IBM

IBM Regatta-H IBM

IBM p690+ // // HPS p690+ IBM

Performance relative to the Cray T3E/1200E Cray the to relative Performance

CS4 AMD 1.2 GHz/FE 1.2 AMD CS4

CS11 P4/2400 + + GbitE P4/2400 CS11

CS6 PIII/800 + FE/LAM + PIII/800 CS6

CS9 P4/2000 + + Myrinet P4/2000 CS9

CS10 P4/2666 + + Myrinet P4/2666 CS10

SGI Origin 3800/R14k-500 Origin SGI

CS19 Opteron246/2.0 + SCI + Opteron246/2.0 CS19

CS20 Opteron248/2.2 + M2k + Opteron248/2.2 CS20 M2k + EM64T/3200 P4 CS22

CS26 Opteron875/2.2 DC + + IB DC Opteron875/2.2 CS26

HP Superdome/Itanium2 1600 Superdome/Itanium2 HP

CS16 Itanium2/1300 + Myrinet + Itanium2/1300 CS16

CS29 Opteron280/2.4 DC + + IB DC Opteron280/2.4 CS29

IBM SP/Winterhawk2-375 MHz SP/Winterhawk2-375 IBM

Cray T3E/1200E EV56 600 MHz 600 EV56 T3E/1200E Cray

CS30 Xeon 5160 3.0GHz DC + + IB DC 3.0GHz 5160 Xeon CS30

IBM pSeries 575 4.7 GHz DC + IB + DC GHz 4.7 575 pSeries IBM

CS33 Xeon 5160 3.0GHz DC + IP HTX + IP DC 3.0GHz 5160 Xeon CS33

IBM Power8 S822LC 2.92GHz IB/EDR 2.92GHz S822LC Power8 IBM

CS42 Opteron 2218-F 2.6GHz DC + IB + DC 2.6GHz 2218-F Opteron CS42

Boston BDW e5-2650v4 2.2GHz (T) FDR (T) 2.2GHz e5-2650v4 BDW Boston

Cray XD1 Opteron 250/2.4 + RapidArray + 250/2.4 Opteron XD1 Cray

Bull HSW e5-2690v3 2.6GHz Connect-IB 2.6GHz e5-2690v3 HSW Bull EDR (T) 2.6GHz v4 e5-2697A BDW Thor

Bull B710 IVB e5-2697v2 2.7GHz IB FDR IB 2.7GHz e5-2697v2 IVB B710 Bull

CS54 SGI Ice Xeon E5440 2.83GHz QC +… QC 2.83GHz E5440 Xeon Ice SGI CS54

CS61 Bullx X5550 NEH 2.67GHz QC + IB + QC 2.67GHz NEH X5550 Bullx CS61

CS63 Intel X5570 NEH 2.93GHz QC + C-X… + QC 2.93GHz NEH X5570 Intel CS63

CS47 Intel E5472 Harpertown 3.0GHz QC… 3.0GHz Harpertown E5472 Intel CS47

Dell SKL Gold 6142 2.6GHz (T) EDR [16c] EDR (T) 2.6GHz 6142 Gold SKL Dell

Fujitsu RX300 SNB E5-2680 8-C + IB QDR + IB 8-C E5-2680 SNB RX300 Fujitsu

Fujitsu "HTC" BX922 X5650 2.66 GHz + IB + GHz 2.66 X5650 BX922 "HTC" Fujitsu

CS35 CUBRIC Opteron275/2.2 DC + + GBitE DC Opteron275/2.2 CUBRIC CS35

CS32 Opteron 2218-F 2.6GHz DC + IP HTX + IP DC 2.6GHz 2218-F Opteron CS32

Bull|ATOS SKL Gold 6150 2.7GHz (T) OFA (T) 2.7GHz 6150 Gold SKL Bull|ATOS

Bull HSW e5-2680v3 2.5GHz (T) Connect-IB (T) 2.5GHz e5-2680v3 HSW Bull

Fujitsu CX250 SNB e5-2690/2.9GHz IB-QDR e5-2690/2.9GHz SNB CX250 Fujitsu IB-QDR e5-2670/2.6GHz SNB CX250 Fujitsu

CS57 Intel X5560 NEH 2.8GHz QC + IB QDR IB + QC 2.8GHz NEH X5560 Intel CS57

Intel IVB e5-2697v2 2.7GHz True Scale PSM Scale True 2.7GHz e5-2697v2 IVB Intel

CS50 SGI Ice X5365 Clovertown 3.0GHz QC… 3.0GHz Clovertown X5365 Ice SGI CS50 FSB… 1600 QC 3.0GHz E5472 Xeon Bull CS51

Fujitsu BX922 WSM X5650 2.67GHz IB-QDR 2.67GHz X5650 WSM BX922 Fujitsu

CS66 Intel L7555 NEH-EX 1.87GHz + IB QDR IB + 1.87GHz NEH-EX L7555 Intel CS66

Huawei Fusion CH140 e5-2690 v4 2.6GHz (T)… 2.6GHz v4 e5-2690 CH140 Fusion Huawei

CS45 HP BL460c Xeon 5160/3.0GHz DC + IB + DC 5160/3.0GHz Xeon BL460c HP CS45

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer] ARIES 2.7GHz e5-2697v2 XC30 Cray

CS73 QLogic NDC X5670 2.93GHz 6-C + + QDR… 6-C 2.93GHz X5670 NDC QLogic CS73

Intel SKL Platinum 8170 2.1GHz (T) OPA[26c] (T) 2.1GHz 8170 Platinum SKL Intel

CS60 Viglen E5520 NEH 2.27GHz QC + IB DDR IB + QC 2.27GHz NEH E5520 Viglen CS60 Intel HSW e5-2697v3 2.6GHz (T) Truescale QDR Truescale (T) 2.6GHz e5-2697v3 HSW Intel Application Performance on Multi-core Processors 12 December 2018 115 DL_POLY 3/4 – Gramicidin (128 Cores)

Performance Relative to the IBM e326 DL_POLY 4 60.0 Opteron280/2.4GHz + GbitE 61.5

50.0 DLPOLY 3/4 - Gramicidin (128 cores) 40.0

30.0 DL_POLY 3 Performance 20.0

10.0

0.0

Streamline E5472 3.0GHz // QC IB

IntelX5560 NH 2.8GHz // QC IB/QDR

BullHSW e5-2690v3[12c] 2.6GHz // IB

Intel E5472 Harpertown 3.0GHz // QC IB

IBMpSeries 575 power5 4.7GHz DC IB //

BullHSW e5-2680v3[12c] 2.5GHz (T) // IB

HPDL145 G2 Opteron 2802.4GHz DC IB // HPDL145 G2 Opteron 2802.4GHz DC IB //

IBMx3455 Opteron 2218-F/2.6GHz// DC IP

SGIIce X5365 Clovertown 3.0GHz QC IB //

SGIIce E5462 2.83GHz QC // (mvapich2) IB

BullX5550 NH 2.67GHz QC // (impi IB 3.2.2)

IntelE5482 2.80GHz // QC IB/DDR (IntelMPI)

DellHSW e5-2660v3[10c] 2.6GHz (T) // OPA IntelBDW e5-2690v4[14c] 2.6GHz (T) // OPA

IBMPower8 S822LC [10c] 2.92GHz // IB/EDR

DellSKL Gold 6142 [16c] 2.6GHz // (T) IB/EDR DellSKL Gold 6150 [18c] 2.7GHz // (T) IB/EDR

IntelSNB E5-2670[8c] 2.6GHz IB/QDR(impi) //

Bullb710 IVBe5-2697v2 [12c] 2.7GHz IB/FDR //

AtosBDW e5-2680v4 [14c]2.4GHz (T) IB/EDR //

IntelSKL Platinum 8170 [26c]2.1GHz (T) OPA //

POD SNBe5-2670 2.6GHz [8c] // Truescale QDR

POD WSMX5675 3.07GHz [6c] Truescale/QDR //

SunX4100DC Opteron 2802.4GHz DC IB // (PSC) IntelX5570 NH 2.93GHzQC IB/QDR // (impi-3.2.2)

FujitsuCX250 SNB e5-2690 [8c] 2.9GHz IB/QDR //

Streamline Xeon5160 3.0GHz DC // GBitE(SCore)

IBMe326 Opteron280 2.4GHz GbitE // (Cisco) PGI ThorBDW e5-2697Av4 [16c] 2.6GHz (T) // IB/EDR

Quadrix Xeon 5160 Woodcrest3.0GHz DC Elan4 // BullR440 Xeon 5160 3.0GHz // DC Voltaire IB/SDR

FujitsuRX300 SNB E5-2680 [8c] 2.7 // GHz IB/QDR

BostonBDW e5-2650v4[12c] 2.2GHz (T) // IB/FDR

SunX2200 M2 Opteron 2218 2.6GHz DC// (PSC) IP

IntelIVB e5-2697v2[12c] 2.7GHz // True Scale/QDR

IntelX5670 WSM [6c] 2.93GHz // IB/QDR (mvapich2)

DellPE C6145 Interlagos Opteron 6276 [16c]2.3GHz

HPDL140 G3 Xeon5160 3.0GHz DC// VoltaireIB/DDR

IntelIVB e5-2690v2[10c] 3.0GHz (T) // TrueScale/QDR

IntelL7555 NHEX [8c] 1.87GHz IB/QDR // (mvapich-1.2)

IntelHSW e5-2697v3 [14c] 2.6GHz (T) Truescale/QDR //

QLogicNDC X5675 [6c] 3.07GHz// IB/QDR (Qlogic MPI)

POD E5430 Harpertown2.66GHz QC // IB/SDR (OpenMPI)

CrayXT4 Opteron 2.3GHz QC // XT4Internal interconnect CluserVisione5-2650v2 IVB [8c] 2.6GHz Truescale/QDR // Application Performance on Multi-core Processors 12 December 2018 116 DL_POLY 4 – Gramicidin (128 cores)

Performance RelativeDLPOLY to the 4 - Gramicidin (128 cores) 61.5 60.0 IBM e326 Opteron280/2.4GHz 50.0 / GbitE

40.0

30.0 Intel SKL E5-26xx v4 20.0 Performance E5-26xx v3 E5-26xx v2 10.0 E5-26xx

0.0

Bull HSW e5-2695v3 [14c] 2.3GHz // IB // 2.3GHz [14c] e5-2695v3 Bull HSW IB // 2.6GHz [12c] e5-2690v3 Bull HSW

Intel HSW e5-2697v3 [14c] 2.6GHz [14c] (T) //… HSWIntel e5-2697v3 2.6GHz [14c] (T) //… HSWIntel e5-2697v3

Intel IVB e5-2697v2 [12c] 2.7GHz // 2.7GHz True… [12c] e5-2697v2 IVB Intel

Bull HSW e5-2680v3 [12c] 2.5GHz (T) //IB (T) 2.5GHz [12c] e5-2680v3 Bull HSW //… (T) [32c] 7601 EPYC 2.2GHz AtosAMD

Fujitsu CX250 SNB e5-2670 [8c] 2.6GHz //…2.6GHz [8c] e5-2670 SNB Fujitsu CX250 //…2.9GHz [8c] e5-2690 SNB Fujitsu CX250

Thor BDW e5-2697Av4 [16c] 2.6GHz (T) //… (T)2.6GHz [16c] e5-2697Av4 Thor BDW

CluserVision IVB e5-2650v2 [8c] 2.6GHz //… 2.6GHz e5-2650v2 IVB [8c] CluserVision

Intel IVB e5-2697v2 [12c] 2.7GHz // 2.7GHz IB/FDR[12c] e5-2697v2 IVB Intel

Boston BDW e5-2650v4 [12c] 2.2GHz (T) //… 2.2GHz(T) e5-2650v4 BDW [12c] Boston

Fujitsu BX922 WSM X5650 [6c] 2.67GHz //… 2.67GHz X5650 [6c] WSM Fujitsu BX922

Intel IVB e5-2690v2 [10c] 3.0GHz (T) // True…(T) 3.0GHz // [10c] e5-2690v2 IVB Intel

Dell SKL Gold 6130 [16c] 2.1GHz (T) // 2.1GHz OPA (T) Gold SKL 6130 [16c] Dell

Intel SKL Gold 6148 [20c] 2.4GHz (T) // OPA (T) 2.4GHz // [20c] Gold SKL 6148 Intel

Cray XC30 e5-2697v2 [12c] 2.7GHz //ARIES… 2.7GHz e5-2697v2 [12c] CrayXC30

Atos SKL Gold 6150 [18c] 2.7GHz (T) // OFA (T) Gold 2.7GHz [18c] 6150 Atos SKL

Huawei CH140 e5-2683v4 [16c] 2.1GHz (T) 2.1GHz //… [16c] CH140e5-2683v4 Huawei

Intel BDW e5-2690v4 [14c] 2.6GHz (T) // OPA(T)// 2.6GHz [14c] e5-2690v4 BDW Intel

Dell HSW e5-2660v3 [10c] 2.6GHz (T)// 2.6GHz OPA e5-2660v3 HSW [10c] Dell

IBM Power8 S822LC [10c] 2.92GHz // IB/EDR Power8 // 2.92GHz [10c] S822LC IBM

SGI ICE-X HSW e5-2690v3 [12c] 2.6GHz (T) 2.6GHz //… [12c] HSW ICE-X SGI e5-2690v3

Atos BDW e5-2680v4 [14c] 2.4GHz (T) //OPA (T) 2.4GHz [14c] e5-2680v4 Atos BDW

Intel Diamond BDW e5-2697Av4 [16c] 2.6GHz… [16c] e5-2697Av4 BDW Diamond Intel Fujitsu CX250 SNB e5-2690 [8c] 2.9GHz (T) 2.9GHz //… [8c] e5-2690 SNB Fujitsu CX250

Dell R720 IVB e5-2680v2 [10c] 2.8GHz (T) //IB (T) 2.8GHz [10c] e5-2680v2 R720DellIVB

Dell SKL Gold 6142 [16c] 2.6GHz (T) // 2.6GHz IB/EDR (T) Gold SKL 6142 [16c] Dell // 2.7GHz IB/EDR (T) Gold SKL 6150 [18c] Dell

Bull HSW e5-2680v3 [12c] 2.5GHz (T) //IB/EDR (T) 2.5GHz [12c] e5-2680v3 Bull HSW

Bull b710 IVB e5-2697v2 [12c] 2.7GHz // IB/FDR // 2.7GHz [12c] e5-2697v2 IVB b710 Bull

Azure A9 WE (e5-2670 2.6 GHz) [8c] // RDMA IB [8c]// A9 GHz) 2.6 WE Azure (e5-2670

Atos BDW e5-2680v4 [14c] 2.4GHz (T) //IB/EDR (T) 2.4GHz [14c] e5-2680v4 Atos BDW

Intel BDW e5-2690v4 [14c] 2.6GHz (T) // IB/EDR(T)// 2.6GHz [14c] e5-2690v4 BDW Intel

Intel SKL Platinum 8170 [26c] 2.1GHz (T) // OPA(T) 2.1GHz // [26c] 8170 Platinum SKL Intel

POD SNB e5-2670 2.6GHz [8c] // Truescale QDRTruescale// SNB[8c] 2.6GHz PODe5-2670

Boston BDW e5-2680v4 [14c] 2.4GHz (T) // OPA// 2.4GHz (T) e5-2680v4 BDW [14c] Boston POD WSM X5675 3.07GHz [6c] // [6c] 3.07GHz Truescale/QDR WSMPODX5675

Application Performance on Multi-core Processors 12 December 2018 117 DL_POLY4 – Gramicidin Perf Report

32 PEs Total Wallclock Time Breakdown 90.0 CPU (%) 80.0 70.0 MPI (%) 60.0 Smooth Particle Mesh Ewald Scheme 50.0 40.0 CPU Scalar numeric ops (%) 30.0 20.0 CPU Vector numeric ops (%) 10.0 32 PEs 0.0 CPU Memory accesses (%) 256 PEs 64 PEs 70.0 60.0 50.0 40.0 30.0 20.0 10.0 256 PEs 0.0 64 PEs 128 PEs

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs Application Performance on Multi-core Processors 12 December 2018 118 The Story of Two Community Codes DL_POLY and GAMESS-UK - A Performance Overview

Overview of two decades of GAMESS-UK Performance Large-Scale Parallel Ab-Initio Calculations

• GAMESS-UK now has two parallelisation schemes: ¤ The traditional version based on the Global Array tools • retains a lot of replicated data • limited to about 4000 atomic basis functions

¤ Subsequent developments by Ian Bush (High Performance Applications Group, Daresbury, now at Oxford University via NAG Ltd.) have extended the system sizes available for treatment by both GAMESS-UK (molecular systems) and CRYSTAL (periodic systems) • Partial introduction of “Distributed Data” architecture… • MPI/ScaLAPACK based

Application Performance on Multi-core Processors 12 December 2018 120 The GAMESS-UK Benchmarks

Five representative examples of increasing complexity.

• Cyclosporin 6-31g basis (1000 GTOs) DFT B3LYP (direct SCF) • Cyclosporin 6-31g-dp basis (1855 GTOs) DFT B3LYP (direct SCF) • Valinomycin (dodecadepsipeptide) in water; DZVP2 DFT basis, HCTH functional (1620 GTOs) (direct SCF)

• Mn(CO)5H TZVP/DZP MP2 - geometry optimization

• ((C6H4(CF3))2 6-31g basis DFT B3LYP opt geom + analytic 2nd Derivatives

Application Performance on Multi-core Processors 12 December 2018 121 GAMESS-UK. DFT B3LYP Performance

Performance Relative to the Cray T3E/1200 (32CPUs) Atos Skylake Gold 6148 2.4GHz (T) // IB/EDR 70.0 Valinomycin, 1620 GTOs 65.1 Valinomycin DFT - DZVP2 1620 GTOs 60.0

50.0 CS48 Bull R422 Harpertown E5472 40.0 3.0GHz QC // IB 28.5 30.0

20.0

10.0

0.0

Cray T3E/1200 EV56 600 MHz 600 EV56 T3E/1200 Cray

HP Superdome Itanium2 1.6GHz Itanium2 Superdome HP

SGI Altix 3700/Itanium2 1.3GHz 3700/Itanium2 Altix SGI

IBM pSeries 690 power4 1.3 GHz 1.3 power4 690 pSeries IBM GHz 1.3 power4 690 pSeries IBM

CS20 Opteron248 2.2GHz // M2k // 2.2GHz Opteron248 CS20 M2k // 2.2GHz Opteron248 CS20

Bull HSW e5-2690v3 [12c] 2.6GHz // IB // 2.6GHz [12c] e5-2690v3 HSW Bull

CS22 Pentium-4 EM64T 3.2GHz // M2k // 3.2GHz EM64T Pentium-4 CS22

IBM pSeries 690 power4 1.3 GHz [J-Fit] GHz 1.3 power4 690 pSeries IBM

IBM pSeries 575 power6 4.7GHz DC // IB // DC 4.7GHz power6 575 pSeries IBM

Bull b710 IVB e5-2697v2 2.7GHz // IB/FDR // 2.7GHz e5-2697v2 IVB b710 Bull

Dell R730 HSW e5-2697v3 2.6 GHz (T) // IB // (T) GHz 2.6 e5-2697v3 HSW R730 Dell

Lonestar Dell PE 1855 Pentium-4 3.2GHz // IB // 3.2GHz Pentium-4 1855 PE Dell Lonestar

SGI Origin 3800/R14k 500MHz // NUMAlink 3 NUMAlink // 500MHz 3800/R14k Origin SGI 3 NUMAlink // 500MHz 3800/R14k Origin SGI

IBM pSeries 690+ power4 1.7 GHz // HPS (SP7) HPS // 1.7GHz power4 690+ pSeries IBM

Atos BDW e5-2680v4 [14c]2.4GHz (T) // IB/EDR // (T) [14c]2.4GHz e5-2680v4 BDW Atos

Dell M510 X5650 [6c]2.67 GHz // IB/QDR (impi) IB/QDR // GHz [6c]2.67 X5650 M510 Dell

CS42 IBM x3455 Opteron 2218-F 2.6GHz DC // IB // DC 2.6GHz 2218-F Opteron x3455 IBM CS42

Intel SNB e5-2670 [8c] 2.6GHz // IB/QDR (ppn=8) IB/QDR // 2.6GHz [8c] e5-2670 SNB Intel

Bull b510 SNB E5-2680 [8c] 2.7 GHz (T) // IB/QDR // (T) GHz 2.7 [8c] E5-2680 SNB b510 Bull

Fujitsu CX250 SNB e5-2670 [8c] 2.6GHz // IB/QDR // 2.6GHz [8c] e5-2670 SNB CX250 Fujitsu

Fujitsu CX250 SNB E5-2690 [8c] 2.9 GHz // IB/QDR // GHz 2.9 [8c] E5-2690 SNB CX250 Fujitsu

Atos SKL Gold 6148 2.4GHz (T) // IB/EDR (ppn=16) IB/EDR // (T) 2.4GHz 6148 Gold SKL Atos

CS48 Bull R422 Harpertown E5472 3.0GHz QC // IB QC // 3.0GHz E5472 Harpertown R422 Bull CS48

Fujitsu "HTC" BX922 X5650 [6c] 2.66GHz // IB/QDR // 2.66GHz [6c] X5650 BX922 "HTC" Fujitsu

Intel IVB e5-2697v2 [12c] 2.7GHz // True Scale/QDR True // 2.7GHz [12c] e5-2697v2 IVB Intel SGI ICE-X HSW e5-2680v3 [12c] 2.6GHz (T) // IB/FDR // (T) 2.6GHz [12c] e5-2680v3 HSW ICE-X SGI

Basis: DZVP2_A2 (Dgauss) IB // QC 2.83GHz E5440 Xeon 8200 Ice Altix SGI CS54

CS60 Viglen E5520 NH 2.27GHz QC // IB DDR (mvapich) DDR IB // QC 2.27GHz NH E5520 Viglen CS60 CS66 Intel L7555 NH-EX [8c] 1.87GHz // IB QDR [ppn=16] QDR IB // 1.87GHz [8c] NH-EX L7555 Intel CS66 Application Performance on Multi-core Processors 12 December 2018 (mvapich2) IB/QDR // 2.93GHz [6c] WSM X5670 Intel CS64 122 Performance of MP2 Gradient Module

Performance Relative to the Cray T3E/900 (32 CPUs)

60.0 55.3

50.0 45.8 Mn(CO) H - MP2 geometryMP2 optimisation Mn(CO)5H 5 Intel SNB e5-2670 [8c] 40.0 BASIS: TZVP + f (217 GTOs) 2.6GHz // IB/QDR (ppn=8) 30.0 CS48 Bull Xeon E5472 3.0GHz QC + DDR 20.0

10.0

0.0

IBM SP/P2SC IBM

Cray T3E/900 EV56 450 MHz 450 EV56 T3E/900 Cray

Cray T3E/1200 EV56 600 MHz 600 EV56 T3E/1200 Cray

SGI Origin 3800/R12k 400MHz 3800/R12k Origin SGI

CS18 Pentium-4 2.8GHz // M2k // 2.8GHz Pentium-4 CS18 M2k // 2.8GHz Pentium-4 CS18

CS18 Pentium-4 2.8GHz // GbitE // 2.8GHz Pentium-4 CS18

HP Superdome Itanium2 1.6GHz Itanium2 Superdome HP

CS18 Pentium-4 2.8GHz // GBitE // 2.8GHz Pentium-4 CS18

SGI Altix 3700/Itanium2 1.3GHz 3700/Itanium2 Altix SGI 1.3GHz 3700/Itanium2 Altix SGI 1.3GHz 3700/Itanium2 Altix SGI 1.5GHz 3700/Itanium2 Altix SGI

IBM pSeries 690 power4 1.3 GHz 1.3 power4 690 pSeries IBM GHz 1.3 power4 690 pSeries IBM

CS20 Opteron248 2.2GHz // M2k 2.2GHz Opteron248 CS20 // M2k 2.2GHz Opteron248 CS20 // M2k 2.2GHz Opteron248 CS20

CS1 Pentium-3 450MHz // FE/LAM // 450MHz Pentium-3 CS1

CS6 Pentium-3 800 MHz // FE/LAM // MHz 800 Pentium-3 CS6

Compaq AlphaServer ES40 667 Mhz 667 ES40 AlphaServer Compaq

CS21 Pentium-4 EM64T 3.4GHz // IB // 3.4GHz EM64T Pentium-4 CS21

CS9 Pentium-4 2.0GHz // Myrinet 2k Myrinet // 2.0GHz Pentium-4 CS9

Compaq AlphaServer SC ES45 1.0GHz ES45 SC AlphaServer Compaq

CS7 AMD Athlon K7 1.0GHz MP // SCI // MP 1.0GHz K7 Athlon AMD CS7

Alice X5550 NH 2.67GHz QC // QDR // IB QC 2.67GHz NH X5550 Alice

IBM SP/Winterhawk-2 power3 375MHz power3 SP/Winterhawk-2 IBM

IBM pSeries 575 power6 4.7GHz DC // IB DC 4.7GHz power6 575 pSeries IBM

IBM pSeries 690+ power4 1.7 GHz // HPS GHz 1.7 power4 690+ pSeries IBM

CS57 Intel X5560 NH 2.8GHz QC // IB QDR IB // QC 2.8GHz NH X5560 Intel CS57

IBM pSeries 575 power5 1.5GHz DC // HPS DC 1.5GHz power5 575 pSeries IBM

CS2 Quadrix UP2000/EV67 667MHz // QSNet 667MHz UP2000/EV67 Quadrix CS2

SGI Origin 3800/R14k 500MHz // NUMAlink 3 // NUMAlink 500MHz 3800/R14k Origin SGI 3 // NUMAlink 500MHz 3800/R14k Origin SGI 3 // NUMAlink 500MHz 3800/R14k Origin SGI

Bull b510 SNB E5-2680 [8c] 2.7 GHz // IB/QDR // GHz 2.7 [8c] E5-2680 SNB b510 Bull

IBM pSeries 690+ power4 1.7 GHz // HPS (SP9) // HPS GHz 1.7 power4 690+ pSeries IBM

CS64 Intel X5670 WSM [6c] 2.93GHz // IB/QDR… // [6c] 2.93GHz WSM X5670 Intel CS64

CS66 Intel L7555 NH-EX [8c] 1.87GHz // IB QDR IB // 1.87GHz [8c] NH-EX L7555 Intel CS66

Atos BDW e5-2680v4 [14c]2.4GHz (T) // IB/EDR (T) [14c]2.4GHz e5-2680v4 BDW Atos

Dell M510 X5650 [6c]2.67 GHz // IB/QDR (impi) IB/QDR GHz // [6c]2.67 X5650 M510 Dell

Raven B510 SNB e5-2670 [8c] 2.6GHz // IB/QDR // 2.6GHz [8c] e5-2670 SNB B510 Raven

CS42 IBM x3455 Opteron 2218-F 2.6GHz DC // IB // DC 2.6GHz 2218-F Opteron x3455 IBM CS42

Intel SNB e5-2670 [8c] 2.6GHz // IB/QDR (ppn=8) // IB/QDR 2.6GHz [8c] e5-2670 SNB Intel

Bull b510 SNB E5-2680 [8c] 2.7 GHz (T) // IB/QDR (T) GHz 2.7 [8c] E5-2680 SNB b510 Bull

Fujitsu RX300 SNB E5-2680 [8c] 2.7 GHz // IB/QDR // GHz 2.7 [8c] E5-2680 SNB RX300 Fujitsu

CS48 Bull R422 Harpertown E5472 3.0GHz QC // IB // QC 3.0GHz E5472 Harpertown R422 Bull CS48

Fujitsu "HTC" BX922 X5650 [6c] 2.66GHz // IB/QDR 2.66GHz [6c] X5650 BX922 "HTC" Fujitsu

CS54 SGI Altix Ice 8200 Xeon E5440 2.83GHz QC // IB // QC 2.83GHz E5440 Xeon Ice 8200 Altix SGI CS54

CS55 SGI Ice Xeon E5462 2.83GHz QC // IB (mpavich2) IB // QC 2.83GHz E5462 Ice Xeon SGI CS55

Atos Skylake Gold 6148 2.4GHz (T) // IB/EDR (ppn=16) IB/EDR // (T) 2.4GHz 6148 Gold Skylake Atos

CS60 Viglen E5520 NH 2.27GHz QC // IB DDR (mvapich) IB DDR // QC 2.27GHz NH E5520 Viglen CS60

CS66 Intel L7555 NH-EX [8c] 1.87GHz // IB QDR [ppn=8] QDR IB // 1.87GHz [8c] NH-EX L7555 Intel CS66

Fujitsu CX250 SNB e5-2670 [8c] 2.6GHz // IB/QDR (2017) // IB/QDR 2.6GHz [8c] e5-2670 SNB CX250 Fujitsu CS66 Intel L7555 NH-EX [8c] 1.87GHz // IB QDR [ppn=16] QDR IB // 1.87GHz [8c] NH-EX L7555 Intel CS66 Application Performance on Multi-core Processors 12 December 2018 123 GAMESS-UK – DFT Performance Report

32 PEs 90.0 Cyclosporin 6-31G** basis (1855 CPU (%) 80.0 70.0 GTOs); DFT B3LYP MPI (%) 60.0 50.0 40.0 30.0 CPU Scalar numeric ops (%) 20.0 10.0 32 PEs CPU Vector numeric ops (%) 70.0 256 PEs 0.0 64 PEs 60.0 CPU Memory accesses (%)

50.0

40.0

30.0

20.0

10.0 256 PEs 0.0 64 PEs 128 PEs Total Wallclock Time Breakdown

Performance Data (32-256 PEs) CPU Time Breakdown 128 PEs

Application Performance on Multi-core Processors 12 December 2018 124 Summary

1. Introduction – DL_POLY and GAMESS-UK ¤ Background and Flagship codes for the UK’s CCP5 & CCP1 ¤ Critical role of collaborative developments 2. HPC Technology - Processor & Interconnect Technologies ¤ The last 10 years of Intel dominance – Nehalem to Skylake 3. DL_POLY and GAMESS-UK Performance ¤ Benchmarks & Test Cases ¤ Overview of two decades of Code Performance: From T3E/1200E to Intel Skylake clusters 4. Understanding Performance – Useful Tools 5. Acknowledgements and Summary

Application Performance on Multi-core Processors 12 December 2018 125 Acknowledgements

• Ludovic Sauge, Enguerrand Petit, Martyn Foster and Nick Allsopp and John Humphries (Bull/ATOS) for informative discussions and access to the Skylake & EPYC clusters at the Bull HPC Competency Centre. • David Cho, Gilad Shainer, Colin Bridger & Steve Davey for access to and considerable assistance with the “Helios” cluster at the HPC Advisory Council. • Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles Civario and Christopher Huggins for access to, and assistance with, the variety of Skylake and EPYC SKUs at the Dell Benchmarking Centre. • Alin Marin Elena and Ilian Todorov (STFC) for discussions around the DL_POLY software • The DisCO programme at Daresbury Laboratory.

Application Performance on Multi-core Processors 12 December 2018 126 Final Thoughts & Summary

I. Performance Benchmarks and Cluster Systems a. Synthetic Code Performance: STREAM and IMB b. Application Code Performance: DLPOLY, GROMACS, AMBER,GAMESS_UK, VASP and Quantum Espresso c. Interconnect Performance: Intel MPI and Mellanox’s HPCX d. Processor Family and Interconnect – “core to core” and “node to node” benchmarks II. Impact of Environmental Issues in Cluster acceptance tests. a. Security patches, turbo mode and Throughput testing III. Performance profile of DL_POLY and GAMESS-UK over the past two decades

Application Performance on Multi-core Processors 12 December 2018 127