Scalability Performance Analysis and Tuning on OpenFOAM Pak Lui – HPC Advisory Council

www.esi-group.com 1

Copyright © ESICopyright Group, ©2016. ESI Group,All rights 2016. reserved. All rights reserved. 4th ESI4 thOpenFOAMOpenFOAM UserUser Conference Conference 2016 2016 – October– October 11 11 &- 12 –- Cologne, Germany Agenda

• Introduction to HPC Advisory Council • Benchmark Configurations • Performance Benchmark Testing • Results and MPI Profiles • Summary • Q&A / For More Information

www.esi-group.com 2

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany The HPC Advisory Council

Mission Statement • World-wide HPC non-profit organization (420+ members) • Bridges the gap between HPC usage and its potential • Provides best practices and a support/development center • Explores future technologies and future developments • Leading edge solutions and technology demonstrations

www.esi-group.com 3

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany HPC Advisory Council Members

www.esi-group.com 4

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany HPCAC Centers

HPC ADVISORY COUNCIL CENTERS

HPCAC HQ SWISS (CSCS) CHINA AUSTIN

www.esi-group.com 5

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany HPC Advisory Council HPC Center HP Cluster Platform 3000SL Dell™ PowerEdge™ R730 Dell PowerVault MD3420 HP Proliant XL230a 16-node cluster 32-node cluster Dell PowerVault MD3460 Gen9 10-node cluster

InfiniBand Storage (Lustre)

Dell™ PowerEdge™ C6145 Dell™ PowerEdge™ R815 Dell™ PowerEdge™ Dell™ PowerEdge™ M610 6-node cluster 11-node cluster R720xd/R720 32-node cluster 38-node cluster

Dell™ PowerEdge™ C6100 4-node cluster

White-box InfiniBand-based Storage (Lustre)

www.esi-group.com 6

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany 158 Applications Best Practices Published

• CPMD • Lattice QCD • OpenFOAM • ABySS • Dacapo • LAMMPS • OpenMX • AcuSolve • Desmond • LS-DYNA • OptiStruct App App • Amber • DL-POLY • miniFE • PARATEC • Eclipse • AMG • MILC • PFA • AMR • FLOW-3D • MSC Nastran • PFLOTRAN • CFX • GADGET-2 • MR Bayes • Quantum ESPRESSO • ANSYS FLUENT • Graph500 • MM5 • RADIOSS • ANSYS Mechanical • GROMACS App • MPQC • SNAP • BQCD • Himeno App • NAMD • SPECFEM3D • CAM-SE • HIT3D • Nekbone • STAR-CCM+ • CCSM App • HOOMD-blue App • CESM • HPCC • NEMO • STAR-CD • COSMO • HPCG • NWChem • VASP App App • CP2K • HYCOM • Octopus • VSP App App • ICON • OpenAtom • WRF

For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php www.esi-group.com 7

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany University Award Program

• University award program ‣ Universities / individuals are encouraged to submit proposals for advanced research

• Selected proposal will be provided with: ‣ Exclusive computation time on the HPC Advisory Council’s Compute Center ‣ Invitation to present in one of the HPC Advisory Council’s worldwide workshops ‣ Publication of the research results on the HPC Advisory Council website

• 2010 award winner is Dr. Xiangqian Hu, Duke University ‣ Topic: “Massively Parallel Quantum Mechanical Simulations for Liquid Water” • 2011 award winner is Dr. Marco Aldinucci, University of Torino ‣ Topic: “Effective Streaming on Multi-core by Way of the FastFlow Framework’ • 2012 award winner is Jacob Nelson, University of Washington ‣ “Runtime Support for Sparse Graph Applications” • 2013 award winner is Antonis Karalis ‣ Topic: “Music Production using HPC” • 2014 award winner is Antonis Karalis ‣ Topic: “Music Production using HPC” • 2015 award winner is Christian Kniep ‣ Topic: Dockers

• To submit a proposal – please check the HPC Advisory Council web site

www.esi-group.com 8

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany HPCAC - ISC’16 Student Cluster Competition

• University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the International Super Computing HPC (ISC HPC) show-floor • The Student Cluster Challenge is designed to introduce the next generation of students to the high performance computing world and community • Submission for Participation in ISC’ 17 SCC opens now

www.esi-group.com 9

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany ISC'16 – Student Cluster Competition Teams

www.esi-group.com 10

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Getting Ready to 2017 Student Cluster Competition

www.esi-group.com 11

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany 2016 HPC Advisory Council Conferences Introduction

• HPC Advisory Council (HPCAC) ‣ 429+ members, http://www.hpcadvisorycouncil.com/ ‣ Application best practices, case studies ‣ Benchmarking center with remote access for users ‣ World-wide workshops ‣ Value add for your customers to stay up to date and in tune to HPC market • 2016 Conferences ‣ USA (Stanford University) – February 24-25 ‣ Switzerland (CSCS) – March 21-23 ‣ Spain (BSC) – September 21 ‣ China (HPC China) – October 26 • For more information ‣ www.hpcadvisorycouncil.com ‣ [email protected]

www.esi-group.com 12

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Objectives

• The following was done to provide best practices ‣ OpenFOAM performance benchmarking ‣ Interconnect performance comparisons ‣ MPI performance comparison ‣ Understanding OpenFOAM communication patterns

• The presented results will demonstrate ‣ The scalability of the compute environment to provide nearly linear application scalability ‣ The capability of OpenFOAM to achieve scalable productivity

www.esi-group.com 13

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Test Cluster Configuration 1 Dell PowerEdge R730 32-node (896-core) “Thor” cluster

• Dell PowerEdge R730 32-node (896-core) “Thor” cluster ‣ Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs ‣ BIOS: Maximum Performance, Home Snoop ‣ Memory: 64GB memory, DDR4 2133 MHz ‣ OS: RHEL 6.5, MLNX_OFED_LINUX-3.2 InfiniBand SW stack ‣ Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” • Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters • Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch • Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and MD3420 ‣ Based on Intel Enterprise Edition for Lustre (IEEL) software version 2.1 • NVIDIA Tesla K80 GPUs • MPI: Mellanox HPC-X v1.6.392 (based on Open MPI 1.10) • Application: OpenFOAM 2.2.2 • Benchmark: ‣ Automotive benchmark – 670 Million elements, DP, pimpleFoam solver

www.esi-group.com 14

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance pimpleFoam Solver – Interconnect Comparisons

• EDR InfiniBand delivers better application performance ‣ EDR IB provides 13% higher performance than FDR IB ‣ Perf benefit due to concentration of large messages of transfers ‣ EDR IB uses new driver for enhanced scalability compared to FDR IB

13%

5%

Number of Nodes Higher is better

www.esi-group.com 15

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Profiling pimpleFoam Solver - Number of MPI Calls

• The pimpleFOAM solver is the transient solver for incompressible flow • Mostly in non-blocking and collective comm, plus other MPI operations • MPI_Waitall, MPI_Allreduce are used mostly • MPI_Waitall (33%), MPI_Allreduce (37% each) at 32 nodes/ 1K cores

www.esi-group.com 16

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Profiling pimpleFoam Solver - % MPI Time

• MPI profiling reports large consumption in large messages ‣ MPI_Probe and MPI_Isend occurred in large messages • Interconnect with higher large-message bandwidth would be advantageous for the pimpleFoam performance

www.esi-group.com 17

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance pimpleFoam Solver - MPI Tuning

• Transport method provide significant advantage on larger core counts • Mellanox HPC-X shows 10% higher performance than Intel MPI at 32 nodes ‣ HPC-X can use MXM UD or RC for transport and hcoll for collective offload ‣ Intel MPI can use OFA or DAPL as the transport

10%

Higher is better EDR InfiniBand

www.esi-group.com 18

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Test Cluster Configuration 2 Dell PowerEdge R730 32-node (1024-core) “Thor” cluster

• Dell PowerEdge R730 32-node (1024-core) “Thor” cluster ‣ Dual-Socket 16-Core Intel E5-2697Av4 @ 2.60 GHz CPUs ‣ BIOS: Maximum Performance, Home Snoop ‣ Memory: 256GB memory, DDR4 2400 MHz ‣ OS: RHEL 7.2, MLNX_OFED_LINUX-3.3-1.0.4.0 InfiniBand SW stack ‣ Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” • Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters • Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch • Intel Omni-Path 100Gb/s Host Fabric Adapter and Switch • Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and MD3420 ‣ Based on Intel Enterprise Edition for Lustre (IEEL) software version 3.0 • NVIDIA Tesla K80 GPUs • MPI: Mellanox HPC-X v1.6.392 (based on Open MPI 1.10) • Application: OpenFOAM 2.2.2 • Benchmark: ‣ Automotive benchmark – 670 Million elements, DP, pimpleFoam solver ‣ Automotive benchmark 2 – 85 Million elements, DP, simpleFoam solver

www.esi-group.com 19

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance simpleFoam Solver - MPI Tuning

• SimpleFOAM (Semi-Implicit Method for Pressure-Linked Equations) ‣ allows to couple the Navier-Stokes equations • Mellanox HPC-X shows 13% higher performance than Open MPI at 32 nodes ‣ HPC-X is based on Open MPI, with additional libraries for support offloads ‣ HPC-X uses MXM UD transport for messaging and hcoll for collective offload ‣ Communication % of overall runtime drops from 34% (Open MPI) to 24% (HPC-X)

13%

Number of Nodes Higher is better EDR InfiniBand

www.esi-group.com 20

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Profiling simpleFoam Solver – Time spent in MPI Calls

• Mostly in non-blocking point-to-point communication and collective ops ‣ MPI_Waitall(33.8% MPI, 8% wall) ‣ MPI_Probe(25% MPI, 6% wall) ‣ MPI_Allreduce(19% MPI, 5% wall)

www.esi-group.com 21

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance pimpleFoam Solver – Interconnect Comparisons

• EDR IB provides an advantage at scale compared to OPA • EDR IB outperforms Omni-Path by nearly 10% on the pimpleFoam simulation • MPI runtime options used: ‣ EDR IB: -bind-to core -mca coll_hcoll_enable 1 -x MXM_TLS=ud,shm,self -x HCOLL_ENABLE_MCAST_ALL=1 -x HCOLL_CONTEXT_CACHE_ENABLE=1 -x MALLOC_MMAP_MAX_=0 -x MALLOC_TRIM_THRESHOLD_=-1 ‣ OPA: -genv I_MPI_PIN on -PSM2

10%

Higher is better

www.esi-group.com 22

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Profiling pimpleFoam Solver – Interconnect Comparisons

• Mostly in non-blocking pt-to-pt communication, plus other MPI operations • EDR appears to be lower in MPI communication time ‣ EDR IB: 12.58% wall MPI_Allreduceat 1K cores ‣ OPA: 14.49% wall in MPI_Allreduceat 1K cores

Mellanox EDR InfiniBand Intel Omni-Path

www.esi-group.com 23

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Test Cluster Configuration 3 128-node (3584-core) “Orion” cluster

• HP ProLiant DL360 Gen9 128-node “Orion” cluster ‣ Dual-Socket 14-core Intel E5-2697 v3 @ 2.60 GHz CPUs ‣ Memory: 256GB memory, DDR3 2133 MHz ‣ OS: RHEL 7.2, MLNX_OFED_LINUX-3.2-1.0.1.1 InfiniBand SW stack • Mellanox ConnectX-4 EDR InfiniBand adapters • Mellanox SwitchX-IB 2 EDR InfiniBand switches • MPI: HPC-X v1.6.392 (based on Open MPI 1.10) • Compilers: Intel Parallel Studio XE 2015 • Application: OpenFOAM 3.1 • Benchmark dataset: ‣ Lid Driven Cavity Flow - 1 million elements, DP, 2D icoFoam solver for laminar, isothermal, incompressible flow

www.esi-group.com 24

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance icoFoam Solver – Open MPI Tuning

• HPC-X MPI library provides switch-based collective offload with network • Switch based MPI collective operations in the network (MPI_Allreduce & MPI_Barrier) ‣ Seen up to by 75% of performance benefit on a 64-node simulation • Benefit of FCA depends on the usage of MPI collective operations in solvers

75%

Higher is better

www.esi-group.com 25

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Profiling icoFoam Solver – Time spent in MPI Calls

• IcoFoam ‣ solves incompressible laminar Navier-Stokes equations using the PISO algorithm • Mostly involves in collective communication • MPI_Allreduce at 16 bytes is almost used exclusively • MPI_Allreduce (85%), MPI_Waitall (5%)

www.esi-group.com 26

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Performance icoFoam Solver - MPI Comparisons

• Tuned HPC-X MPI library has significant performance advantage at scale • HPC-X (SHArP) delivers 120% of greater scalability than Intel MPI at 64 nodes • SHArP - Scalable Hierarchical Aggregation Protocol • Benefit is mostly due to MPI collective offloads in HPC-X ‣ Offloading the MPI_Allreduceto the networking hardware

120%

Higher is better

www.esi-group.com 27

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany OpenFOAM Summary

• OpenFOAM scales well to thousands of cores with suitable setup • OpenFOAM solvers scales well for the solvers/cases being tested • Summary: ‣ EDR IB vs FDR IB: • EDR IB provides 13% higher performance than FDR IB on 32 nodes ‣ EDR vs OPA: • IB provide nearly 10% higher performance with pimpleFoam simulation on 32 nodes ‣ Network Offload: • 75% of performance gain by using offload in HPC-X on 64 nodes ‣ MPI: • HPC-X shows 13% higher performance than Intel MPI on simpleFoam solver • HPC-X Provided 120% of scalability than the Intel MPI on icoFoamsolver

www.esi-group.com 28

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany Contact Us

• Web: www.hpcadvisorycouncil.com • Email: [email protected] • Facebook: http://www.facebook.com/HPCAdvisoryCouncil • Twitter: www.twitter.com/hpccouncil • YouTube: www.youtube.com/user/hpcadvisorycouncil

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

www.esi-group.com 29

Copyright © ESI Group, 2016. All rights reserved. 4th ESI OpenFOAM User Conference 2016 – October 11 & 12 – Cologne, Germany