Software Ecosystem for Arm-based HPC

CUG 2018 - Stockholm

© 2018 Arm Limited [email protected] Ecosystem for HPC

List of components needed: • OS availability • Compilers • Libraries • Job schedulers • Debuggers • Profilers

Mix of open source and commercial products and applications… https://developer.arm.com/hpc/hpc-software

2 © 2018 Arm Limited Arm development tools portfolio for HPC

Arm Allinea Studio Develop and run on today’s hardware

Arm Compiler for HPC Arm Performance Libraries Arm Forge Professional Arm Performance Reports

Linux user space compiler BLAS, LAPACK and FFT Multi-node interoperable Interoperable application for HPC applications profiler and debugger performance insight

and also… Explore tomorrow’s architecture today Arm Code Advisor ArmInstruction Instruction Emulator Emulator

Understand what the Run SVE Runbinaries SVE onbinaries today’s compiler could/could not do Armv8on today’s-A hardware hardware

3 © 2018 Arm Limited Arm Compiler – Building on LLVM, Clang and Flang projects

Arm /C++/ Compiler

Clang based LLVM based LLVM based C/C++ Files C/C++ Armv8-A Armv8-A Optimizer (.c/.cpp) Frontend Backend binary IR Optimizations LLVM IR LLVM IR Auto-vectorization PGI Flang based LLVM based Fortran Files Fortran Enhanced optimization for SVE (.f/.f90) ARMv8-A and SVE SVE Frontend Backend binary

Language specific frontend Language agnostic optimization Architecture specific backend

4 © 2018 Arm Limited Arm Compiler – OpenMP scaling

Better scaling at higher thread count Lulesh – size 40

Arm Compiler uses libomp based optimized OpenMP runtime For Lulesh (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics), Arm Compiler shows better scaling than GCC for

higher thread count Zones per Second per Zones

Number of threads armclang 18.0 gcc 7.1

5 © 2018 Arm Limited DGEMM performance on Cavium ThunderX2 Excellent serial and parallel performance DGEMM – 56 threads on Cavium ThunderX2 Achieving very high performance at the CN99 node level leveraging high core counts and 100% large memory bandwidth 90% 80% Single core performance at 70% 60% 95% of peak for DGEMM 50% 40% Parallel performance significantly higher 30% than OpenBLAS peak of Percentage 20% 10% 0% 0 2000 4000 6000 8000 10000 Matrix dimension (M=N=K)

ARM Performance Libraries OpenBLAS

6 © 2018 Arm Limited Arm Performance Libraries

7 © 2018 Arm Limited Arm HPC ecosystem Porting to Arm

Arm is engaging directly with partners and HPC scientific code developers to support porting and optimisation of common HPC libraries, tools and applications Initial focus on successfully building with both Arm and GCC compilers across a broad front Often only modest changes to environment variables, build scripts and architecture files are needed Degree of commonality between codes

8 © 2018 Arm Limited Example: Particle in Cell codes Two different approaches

VPIC EPOCH Explicit 2nd order push, charge conserving Explicit 2nd order push, charge conserving FDTD fields FDTD fields C & C++ with MPI & pthreads Fortran with MPI Low particle order High order particles Flexible, extensible, versatile Heavily optimised push, previously tuned for specific platforms Linked list storage Vector kernel Dependencies: SDF https://github.com/lanl/vpic http://www.ccpp.ac.uk

9 © 2018 Arm Limited Example: Leveraging Arm intrinsics from C VPIC VPIC’s v4 kernel pushes four particles at a time – optimised

with SSE SIMD calls 2 ) Arm’s NEON instructions offer similar functionality 1.5 Datatypes and intrinsic calls from SSE can be mapped over

to NEON in many cases 1 normalised Projects like SIMD Everywhere: 0.5

https://github.com/nemequ/simde pushes ( pushes may help generate portable code able to exploit Arm’s 0 vector calls Standard NEON Could such vectorised kernels stand to benefit from Arm’s SVE instructions? Arm

10 © 2018 Arm Limited Example: Leveraging Arm intrinsics from Fortran EPOCH Particle prefetch Arm C compiler preload C wrapper Uses intel’s _mm_prefetch to improve Use __pld in place of _mm_prefetch src/housekeeping/arm_intrinsics.c performance of linked-list Requires Fortran 2003’s C-binding src/housekeeping/prefetch.f90

SUBROUTINE prefetch_particle(p) INTERFACE #include TYPE(particle),INTENT(INOUT) :: p SUBROUTINE arm_prefetch(p, x, w) BIND(C) void arm_prefetch(void const* p) #ifdef PREFETCH USE, INTRINSIC :: iso_c_binding { CALL mm_prefetch(p%part_p(1)) REAL(c_double),DIMENSION(3) :: p __pld(p); CALL mm_prefetch(p%weight) REAL(c_double), DIMENSION(c_ndims) :: x return; #endif REAL(c_double) :: w } END SUBROUTINE prefetch_particle END SUBROUTINE arm_prefetch END INTERFACE A similar approach can be used to call GCC’s __builtin_prefetch

11 © 2018 Arm Limited Example: Performance improvement Speed-up memory-bound code

Armflang vs. GNU Armflang with preload GNU with prefetch

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 108% 86% 93% 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 standard standard standard prefetch standard prefetch arm gnu arm gnu

12 © 2018 Arm Limited Meets the requirements of HPC developers on Arm Arm MAP Arm Performance Libraries Cross-platform lightweight profiler BLAS, LAPLACK, FFT Arm Performance Reports Maximize System Efficiency Optimize Profile

Develop Debug and build Arm Compiler for HPC For C, C++ and Fortran codes Arm DDT Cross-platform parallel debugger

13 © 2018 Arm Limited Community building

Outside the people we collaborate with, various complementary Arm HPC communities already exist: • Arm HPC User Group (SC) and GoingArm (ISC/ArmRS) • Arm HPC Google Group (https://groups.google.com/forum/#!forum/arm-hpc) • Arm HPC GitLab pages (https://gitlab.com/arm-hpc/) Encouraging our partners to use GitLab is a priority

Our app work is engaging with code owners and users to get suitable test cases, to get Arm support built in, and including helping them make AArch64 testing part of their development processes

14 © 2018 Arm Limited Community site – gitlab.com/arm-hpc https://gitlab.com/arm-hpc/packages/wikis/home

Dynamic list of common HPC applications Up-to-date summary of package status Provides focus for porting progress Community driven. Maintained by Arm, but anyone can join and contribute. Allows developers to share recipes, and learn from progress on other applications Provides a mechanism for tracking status of applications and package sets (e.g. OpenHPC packages, Mantevo, etc.)

15 © 2018 Arm Limited Tack! Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धꅍयवाद

16 © 2018 Arm Limited Migrate and debug application to Arm

Switch between OpenMP threads Visualise data structures

Integrate to continuous integration tools

Display pending communications 17 © 2018 Arm Limited Optimise for Arm platforms

Detect MPI load imbalance

Understand CPU usage

Identify regions of high OpenMP synchronisation

18 © 2018 Arm Limited Maximize System Efficiency

Aggregate data

19 © 2018 Arm Limited