MPI Profiling
Total Page:16
File Type:pdf, Size:1020Kb
Best Practices: Application Profiling at the HPCAC High Performance Center Pak Lui 164 Applications Best Practices Published • Abaqus • COSMO • HPCC • Nekbone • RADIOSS • ABySS • CP2K • HPCG • NEMO • RFD tNavigator • AcuSolve • CPMD • HYCOM • NWChem • SNAP • Amber • Dacapo • ICON • Octopus • SPECFEM3D • AMG • Desmond • Lattice QCD • OpenAtom • STAR-CCM+ • DL-POLY • LAMMPS • STAR-CD • AMR • OpenFOAM • ANSYS CFX • Eclipse • LS-DYNA • VASP • OpenMX • ANSYS Fluent • FLOW-3D • miniFE • WRF • OptiStruct • ANSYS Mechanical• GADGET-2 • MILC • PAM-CRASH / VPS • BQCD • Graph500 • MSC Nastran • PARATEC • BSMBench • GROMACS • MR Bayes • Pretty Fast Analysis • CAM-SE • Himeno • MM5 • PFLOTRAN • CCSM 4.0 • HIT3D • MPQC • Quantum ESPRESSO • CESM • HOOMD-blue • NAMD For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php 2 38 Applications Installation Best Practices Published • Adaptive Mesh Refinement (AMR) • ESI PAM-CRASH / VPS 2013.1 • NWChem • Amber (for GPU/CUDA) • GADGET-2 • NWChem 6.5 • Amber (for CPU) • GROMACS 5.1.2 • Octopus • ANSYS Fluent 15.0.7 • GROMACS 4.5.4 • OpenFOAM • ANSYS Fluent 17.1 • GROMACS 5.0.4 (GPU/CUDA) • OpenMX • BQCD • Himeno • PyFR • Caffe • HOOMD Blue • Quantum ESPRESSO 4.1.2 • CASTEP 16.1 • LAMMPS • Quantum ESPRESSO 5.1.1 • CESM • LAMMPS-KOKKOS • Quantum ESPRESSO 5.3.0 • CP2K • LS-DYNA • TensorFlow 0.10.0 • CPMD • MrBayes • WRF 3.2.1 • DL-POLY 4 • NAMD • WRF 3.8 • ESI PAM-CRASH 2015.1 • NEMO For more information, visit: http://www.hpcadvisorycouncil.com/subgroups_hpc_works.php 3 HPC Advisory Council HPC Center HPE Apollo 6000 HPE ProLiant SL230s Gen8 HPE Cluster Platform 3000SL Dell™ PowerEdge™ Dell™ PowerEdge™ 10-node cluster 4-node cluster 16-node cluster C6145 6-node cluster R815 11-node cluster Dell™ PowerEdge™ R730 Dell PowerVault MD3420 Dell™ PowerEdge™ Dell™ PowerEdge™ M610 IBM POWER8 GPU Dell PowerVault MD3460 R720xd/R720 32-node GPU 38-node cluster 8-node cluster 36-node cluster cluster InfiniBand Storage (Lustre) Dell™ PowerEdge™ C6100 4-node cluster 4-node GPU cluster 4-node GPU cluster 4 Agenda – Example of HPCAC Applications Activity • Overview of HPC Applications Performance • Way to Inspect, Profile, Optimize HPC Applications – CPU, memory, file I/O, network • System Configurations and Tuning • Case Studies, Performance Comparisons, Optimizations and Highlights • Conclusions 5 Note • The following research was performed under the HPC Advisory Council activities – Compute resource - HPC Advisory Council Cluster Center • The following was done to provide best practices – HPC application performance overview – Understanding HPC application communication patterns – Ways to increase HPC application productivity 6 Test Clusters • HPE ProLiant DL360 Gen9 128-node (4096-core) “Hercules” cluster – Dual-Socket 16-Core Intel E5-2697A v4 @ 2.60 GHz CPUs – Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop – OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-2.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Intel® Omni-Path Host Fabric Interface (HFI) 100Gb/s Adapter, Intel® Omni-Path Edge Switch 100 Series – MPI: Intel MPI 2017, Open MPI 2.02 • IBM OperPOWER 8-node “Telesto” cluster - IBM Power System S822LC (8335-GTA) – IBM: Dual-Socket 10-Core @ 3.491 GHz CPUs, Memory: 256GB memory, DDR3 PC3-14900 MHz – Wistron OpenPOWER servers (Dual-Socket 8-Core @ 3.867 GHz CPUs. Memory: 224GB memory, DDR3 PC3-14900 MHz) – OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Compilers: GNU compilers 4.8.5, IBM XL Compilers 13.1.3 – MPI: Open MPI 2.0.2, IBM Spectrum MPI 10.1.0.2, MPI Profiler: IPM • Dell PowerEdge R730 32-node (1024-core) “Thor” cluster – Dual-Socket 16-Core Intel E5-2697Av4 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop) – Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop – OS: RHEL 7.2, M MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack – Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters, Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch – Intel® Omni-Path Host Fabric Interface (HFI) 100Gbps Adapter, Intel® Omni-Path Edge Switch 100 Series – Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 – Compilers: Intel Compilers 2016.4.258 – MPI: Intel Parallel Studio XE 2016 Update 4, Mellanox HPC-X MPI Toolkit v1.8, MPI Profiler: IPM (from Mellanox HPC-X) 7 GROMACS (GROningen MAchine for Chemical Simulation) • A molecular dynamics simulation package • Primarily designed for biochemical molecules like proteins, lipids and nucleic acids – A lot of algorithmic optimizations have been introduced in the code – Extremely fast at calculating the nonbonded interactions • Ongoing development to extend GROMACS with interfaces both to Quantum Chemistry and Bioinformatics/databases • An open source software released under the GPL 8 GROMACS Performance – MPI Libraries • Small performance gain in the latest GROMACS version – About 3% better performance seen on GROMACS version 2016.2 than 5.1.2 3% Higher is better Optimized parameters used 9 GROMACS Performance – Interconnects • EDR InfiniBand enables higher scalability than Omni-Path for GROMACS – InfiniBand delivers 136% better scaling versus Omni-Path for 128 nodes 136% Higher is better Intel MPI 10 GROMACS Performance – MPI • Intel MPI includes multiple transport providers for running on InfiniBand fabrics – Native transport provides up to 31% better scaling than uDAPL provider at 128 nodes 31% Higher is better Optimized parameters used 11 GROMACS Profiling – % of MPI Calls • For the most time consuming MPI calls (as % of MPI time): – MPI_Iprobe (51%), MPI_Allreduce (23%), MPI_Bcast (16%), MPI_Waitall (9%) 32 Nodes / 1024 Processes 12 NAMD Performance – MPI Libraries • Different MPI options deliver better performance per different benchmarks Higher is better Optimized parameters used 13 NAMD Profiling – % of MPI Calls • For the most time consuming MPI calls (as % of MPI time): – MPI_Iprobe (90%), MPI_Isend (4%), MPI_Test (3%), MPI_Recv (3%) 64 Nodes / 2048 Processes 14 BSMBench Profiling – % of MPI Calls • Major MPI calls (as % of wall time): – Balance: MPI_Barrier (26%), MPI_Allreduce (6%), MPI_Waitall (5%), MPI_Isend (4%) – Communications: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (2%) – Compute: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (1%) Balance Communications Compute 32 Nodes / 1024 Processes 15 BSMBench Profiling – MPI Msg Distribution • Similar communication pattern seen across all 3 examples: – Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall – Communications: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall – Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall Balance Communications Compute 32 Nodes / 1024 Processes 16 BSMBench Profiling – Time Spent in MPI • The different communications across the MPI processes is mostly balance – Does not appear to be any significant load imbalances in the communication layer Balance Communications Compute 32 Nodes / 1024 Processes 17 BSMBench Performance – MPI Libraries • Comparison between two commercial available MPI libraries • Intel MPI and HPC-X delivers similar performance – HPC-X demonstrates 5% advantage at 32 nodes Higher is better 32 MPI Processes / Node 18 BSMBench Summary • Benchmark for BSM Lattice Physics – Utilizes both compute and network communications • MPI Profiling – Most MPI time is spent on MPI collective operations and non-blocking communications • Heavy use of MPI collective operations (MPI_Allreduce, MPI_Barrier) – Similar communication patterns seen across all three examples • Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall • Comms: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall • Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall 19 HPCG Performance – SMT • Simultaneous Multithreading (SMT) allows additional hardware threads for compute • Additional performance gain is seen with SMT enabled – Up to 45% of performance gain is seen between no SMT versus 5 SMT threads are used – As more MPI ranks being used for SMT cores, but would need more memory to run – Memory bandwidth saturation appears to be around at around 5 SMT thread 45% Higher is better 20 HPCG Performance – System Architecture • Power CPU demonstrates 8% higher performance compared to x86 – Performance gain on a single node is approximately 8% for Power8 – 32 cores per node used for x86, versus 20 physical cores used per node for Power Higher is better SMT=5 for IBM POWER8 21 HPCG Performance – Matrix Size • The sparse matrix size specified determines the amount of memory consumed – The amount of memory for sparse matrix computation is bounded by matrix size specified – Performance achieved by using slightly lower matrix size appeared to have no effect • Shorter time duration appeared to have no effect on the performance – The standard runtime for HPCG is 30 minutes; running shorter appears to perform the same Higher is better SMT=1 for IBM POWER8 22 HPCG Profiling – % of MPI Calls • For the most time consuming MPI calls (as % of wall time): – MPI_Wait (1.5%), MPI_Send (%), MPI_Waitall (0.5%) • The percentage time spent on communication is limited 23 HPCG Summary • HPCG Project – Potential replacement for the