Best Practices: Application Profiling at the HPCAC High Performance Center

Pak Lui 164 Applications Best Practices Published

• Abaqus • COSMO • HPCC • Nekbone • RADIOSS • ABySS • CP2K • HPCG • NEMO • RFD tNavigator • AcuSolve • CPMD • HYCOM • NWChem • SNAP

• Amber • Dacapo • ICON • • SPECFEM3D

• AMG • • Lattice QCD • OpenAtom • STAR-CCM+ • DL-POLY • LAMMPS • STAR-CD • AMR • OpenFOAM • ANSYS CFX • Eclipse • LS-DYNA • VASP • OpenMX • ANSYS Fluent • FLOW-3D • miniFE • WRF • OptiStruct • ANSYS Mechanical• GADGET-2 • MILC • PAM-CRASH / VPS • BQCD • Graph500 • MSC Nastran • PARATEC • BSMBench • GROMACS • MR Bayes • Pretty Fast Analysis • CAM-SE • Himeno • MM5 • PFLOTRAN • CCSM 4.0 • HIT3D • MPQC • Quantum ESPRESSO • CESM • HOOMD-blue • NAMD For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php

2 38 Applications Installation Best Practices Published

• Adaptive Mesh Refinement (AMR) • ESI PAM-CRASH / VPS 2013.1 • NWChem

• Amber (for GPU/CUDA) • GADGET-2 • NWChem 6.5

• Amber (for CPU) • GROMACS 5.1.2 • Octopus

• ANSYS Fluent 15.0.7 • GROMACS 4.5.4 • OpenFOAM

• ANSYS Fluent 17.1 • GROMACS 5.0.4 (GPU/CUDA) • OpenMX

• BQCD • Himeno • PyFR • Caffe • HOOMD Blue • Quantum ESPRESSO 4.1.2

• CASTEP 16.1 • LAMMPS • Quantum ESPRESSO 5.1.1

• CESM • LAMMPS-KOKKOS • Quantum ESPRESSO 5.3.0

• CP2K • LS-DYNA • TensorFlow 0.10.0

• CPMD • MrBayes • WRF 3.2.1

• DL-POLY 4 • NAMD • WRF 3.8

• ESI PAM-CRASH 2015.1 • NEMO

For more information, visit: http://www.hpcadvisorycouncil.com/subgroups_hpc_works.php

3 HPC Advisory Council HPC Center

HPE Apollo 6000 HPE ProLiant SL230s Gen8 HPE Cluster Platform 3000SL Dell™ PowerEdge™ Dell™ PowerEdge™ 10-node cluster 4-node cluster 16-node cluster C6145 6-node cluster R815 11-node cluster

Dell™ PowerEdge™ R730 Dell PowerVault MD3420 Dell™ PowerEdge™ Dell™ PowerEdge™ M610 IBM POWER8 GPU Dell PowerVault MD3460 R720xd/R720 32-node GPU 38-node cluster 8-node cluster 36-node cluster cluster

InfiniBand Storage (Lustre)

Dell™ PowerEdge™ C6100 4-node cluster

4-node GPU cluster 4-node GPU cluster

4 Agenda – Example of HPCAC Applications Activity

• Overview of HPC Applications Performance

• Way to Inspect, Profile, Optimize HPC Applications – CPU, memory, file I/O, network

• System Configurations and Tuning

• Case Studies, Performance Comparisons, Optimizations and Highlights

• Conclusions

5 Note

• The following research was performed under the HPC Advisory Council activities – Compute resource - HPC Advisory Council Cluster Center

• The following was done to provide best practices – HPC application performance overview – Understanding HPC application communication patterns – Ways to increase HPC application productivity

6 Test Clusters

• HPE ProLiant DL360 Gen9 128-node (4096-core) “Hercules” cluster – Dual-Socket 16-Core Intel E5-2697A v4 @ 2.60 GHz CPUs – Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop – OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-2.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Intel® Omni-Path Host Fabric Interface (HFI) 100Gb/s Adapter, Intel® Omni-Path Edge Switch 100 Series – MPI: Intel MPI 2017, Open MPI 2.02

• IBM OperPOWER 8-node “Telesto” cluster - IBM Power System S822LC (8335-GTA) – IBM: Dual-Socket 10-Core @ 3.491 GHz CPUs, Memory: 256GB memory, DDR3 PC3-14900 MHz – Wistron OpenPOWER servers (Dual-Socket 8-Core @ 3.867 GHz CPUs. Memory: 224GB memory, DDR3 PC3-14900 MHz) – OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Compilers: GNU compilers 4.8.5, IBM XL Compilers 13.1.3 – MPI: Open MPI 2.0.2, IBM Spectrum MPI 10.1.0.2, MPI Profiler: IPM

• Dell PowerEdge R730 32-node (1024-core) “Thor” cluster – Dual-Socket 16-Core Intel E5-2697Av4 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop) – Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop – OS: RHEL 7.2, M MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Intel® Omni-Path Host Fabric Interface (HFI) 100Gbps Adapter, Intel® Omni-Path Edge Switch 100 Series – Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 – Compilers: Intel Compilers 2016.4.258 – MPI: Intel Parallel Studio XE 2016 Update 4, Mellanox HPC-X MPI Toolkit v1.8, MPI Profiler: IPM (from Mellanox HPC-X)

7 GROMACS (GROningen MAchine for Chemical Simulation)

• A simulation package • Primarily designed for biochemical molecules like proteins, lipids and nucleic acids – A lot of algorithmic optimizations have been introduced in the code – Extremely fast at calculating the nonbonded interactions • Ongoing development to extend GROMACS with interfaces both to Quantum and Bioinformatics/databases • An open source software released under the GPL

8 GROMACS Performance – MPI Libraries

• Small performance gain in the latest GROMACS version – About 3% better performance seen on GROMACS version 2016.2 than 5.1.2

3%

Higher is better Optimized parameters used

9 GROMACS Performance – Interconnects

• EDR InfiniBand enables higher scalability than Omni-Path for GROMACS – InfiniBand delivers 136% better scaling versus Omni-Path for 128 nodes

136%

Higher is better Intel MPI

10 GROMACS Performance – MPI

• Intel MPI includes multiple transport providers for running on InfiniBand fabrics – Native transport provides up to 31% better scaling than uDAPL provider at 128 nodes

31%

Higher is better Optimized parameters used

11 GROMACS Profiling – % of MPI Calls

• For the most time consuming MPI calls (as % of MPI time): – MPI_Iprobe (51%), MPI_Allreduce (23%), MPI_Bcast (16%), MPI_Waitall (9%)

32 Nodes / 1024 Processes

12 NAMD Performance – MPI Libraries

• Different MPI options deliver better performance per different benchmarks

Higher is better Optimized parameters used

13 NAMD Profiling – % of MPI Calls

• For the most time consuming MPI calls (as % of MPI time): – MPI_Iprobe (90%), MPI_Isend (4%), MPI_Test (3%), MPI_Recv (3%)

64 Nodes / 2048 Processes

14 BSMBench Profiling – % of MPI Calls

• Major MPI calls (as % of wall time): – Balance: MPI_Barrier (26%), MPI_Allreduce (6%), MPI_Waitall (5%), MPI_Isend (4%) – Communications: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (2%) – Compute: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (1%)

Balance Communications Compute

32 Nodes / 1024 Processes

15 BSMBench Profiling – MPI Msg Distribution

• Similar communication pattern seen across all 3 examples: – Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall – Communications: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall – Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall

Balance Communications Compute

32 Nodes / 1024 Processes

16 BSMBench Profiling – Time Spent in MPI

• The different communications across the MPI processes is mostly balance – Does not appear to be any significant load imbalances in the communication layer

Balance Communications Compute

32 Nodes / 1024 Processes

17 BSMBench Performance – MPI Libraries

• Comparison between two commercial available MPI libraries • Intel MPI and HPC-X delivers similar performance – HPC-X demonstrates 5% advantage at 32 nodes

Higher is better 32 MPI Processes / Node

18 BSMBench Summary

• Benchmark for BSM Lattice Physics – Utilizes both compute and network communications • MPI Profiling – Most MPI time is spent on MPI collective operations and non-blocking communications • Heavy use of MPI collective operations (MPI_Allreduce, MPI_Barrier) – Similar communication patterns seen across all three examples • Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall • Comms: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall • Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall

19 HPCG Performance – SMT

• Simultaneous Multithreading (SMT) allows additional hardware threads for compute • Additional performance gain is seen with SMT enabled – Up to 45% of performance gain is seen between no SMT versus 5 SMT threads are used – As more MPI ranks being used for SMT cores, but would need more memory to run – Memory bandwidth saturation appears to be around at around 5 SMT thread

45%

Higher is better

20 HPCG Performance – System Architecture

• Power CPU demonstrates 8% higher performance compared to x86 – Performance gain on a single node is approximately 8% for Power8 – 32 cores per node used for x86, versus 20 physical cores used per node for Power

Higher is better SMT=5 for IBM POWER8

21 HPCG Performance – Matrix Size

• The sparse matrix size specified determines the amount of memory consumed – The amount of memory for sparse matrix computation is bounded by matrix size specified – Performance achieved by using slightly lower matrix size appeared to have no effect • Shorter time duration appeared to have no effect on the performance – The standard runtime for HPCG is 30 minutes; running shorter appears to perform the same

Higher is better SMT=1 for IBM POWER8

22 HPCG Profiling – % of MPI Calls

• For the most time consuming MPI calls (as % of wall time): – MPI_Wait (1.5%), MPI_Send (%), MPI_Waitall (0.5%) • The percentage time spent on communication is limited

23 HPCG Summary

• HPCG Project – Potential replacement for the High Performance LINPACK (HPL) benchmark • Simultaneous Multithreading (SMT) provides additional benefit for compute – Up to 45% of performance gain is seen between no SMT versus 5 SMT threads are used • Power CPU showcase 8% higher performance vs x86 – 32 cores per node used for x86, versus 20 cores used per node for Power • The sparse matrix size specified determines the amount of memory consumed – The amount of memory for sparse matrix computation is bounded by matrix size specified – Performance achieved by using slightly lower matrix size appeared to have no effect – Shorter time duration appeared to have no effect on the performance – The standard runtime for HPCG is 30 minutes; running shorter appears to perform the same • The percentage time spent on communication is limited – For the most time consuming MPI calls (as % of wall time): – MPI_Wait (1.5%), MPI_Send (%), MPI_Waitall (0.5%)

24 Thank You

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein