Arm Compiler for HPC Arm Performance Libraries Arm Forge Professional Arm Performance Reports

Software Ecosystem for Arm-based HPC CUG 2018 - Stockholm © 2018 Arm Limited [email protected] Ecosystem for HPC List of components needed: • Linux OS availability • Compilers • Libraries • Job schedulers • Debuggers • Profilers Mix of open source and commercial products and applications… https://developer.arm.com/hpc/hpc-software 2 © 2018 Arm Limited Arm development tools portfolio for HPC Arm Allinea Studio Develop and run on today’s hardware Arm Compiler for HPC Arm Performance Libraries Arm Forge Professional Arm Performance Reports Linux user space compiler BLAS, LAPACK and FFT Multi-node interoperable Interoperable application for HPC applications profiler and debugger performance insight and also… Explore tomorrow’s architecture today Arm Code Advisor ArmInstruction Instruction Emulator Emulator Understand what the Run SVE Runbinaries SVE onbinaries today’s compiler could/could not do Armv8on today’s-A hardware hardware 3 © 2018 Arm Limited Arm Compiler – Building on LLVM, Clang and Flang projects Arm C/C++/Fortran Compiler Clang based LLVM based LLVM based C/C++ Files C/C++ Armv8-A Armv8-A Optimizer (.c/.cpp) Frontend Backend binary IR Optimizations LLVM IR LLVM IR Auto-vectorization PGI Flang based LLVM based Fortran Files Fortran Enhanced optimization for SVE (.f/.f90) ARMv8-A and SVE SVE Frontend Backend binary Language specific frontend Language agnostic optimization Architecture specific backend 4 © 2018 Arm Limited Arm Compiler – OpenMP scaling Better scaling at higher thread count Lulesh – size 40 Arm Compiler uses libomp based optimized OpenMP runtime For Lulesh (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics), Arm Compiler shows better scaling than GCC for higher thread count Zones per Second per Zones Number of threads armclang 18.0 gcc 7.1 5 © 2018 Arm Limited DGEMM performance on Cavium ThunderX2 Excellent serial and parallel performance DGEMM – 56 threads on Cavium ThunderX2 Achieving very high performance at the CN99 node level leveraging high core counts and 100% large memory bandwidth 90% 80% Single core performance at 70% 60% 95% of peak for DGEMM 50% 40% Parallel performance significantly higher 30% than OpenBLAS of peak Percentage 20% 10% 0% 0 2000 4000 6000 8000 10000 Matrix dimension (M=N=K) ARM Performance Libraries OpenBLAS 6 © 2018 Arm Limited Arm Performance Libraries 7 © 2018 Arm Limited Arm HPC ecosystem Porting to Arm Arm is engaging directly with partners and HPC scientific code developers to support porting and optimisation of common HPC libraries, tools and applications Initial focus on successfully building with both Arm and GCC compilers across a broad front Often only modest changes to environment variables, build scripts and architecture files are needed Degree of commonality between codes 8 © 2018 Arm Limited Example: Particle in Cell codes Two different approaches VPIC EPOCH Explicit 2nd order push, charge conserving Explicit 2nd order push, charge conserving FDTD fields FDTD fields C & C++ with MPI & pthreads Fortran with MPI Low particle order High order particles Flexible, extensible, versatile Heavily optimised push, previously tuned for specific platforms Linked list storage Vector kernel Dependencies: SDF https://github.com/lanl/vpic http://www.ccpp.ac.uk 9 © 2018 Arm Limited Example: Leveraging Arm intrinsics from C VPIC VPIC’s v4 kernel pushes four particles at a time – optimised with SSE SIMD calls 2 ) Arm’s NEON instructions offer similar functionality 1.5 Datatypes and intrinsic calls from SSE can be mapped over to NEON in many cases 1 normalised Projects like SIMD Everywhere: 0.5 https://github.com/nemequ/simde pushes ( pushes may help generate portable code able to exploit Arm’s 0 vector calls Standard NEON Could such vectorised kernels stand to benefit from Arm’s SVE instructions? Arm 10 © 2018 Arm Limited Example: Leveraging Arm intrinsics from Fortran EPOCH Particle prefetch Arm C compiler preload C wrapper Uses intel’s _mm_prefetch to improve Use __pld in place of _mm_prefetch src/housekeeping/arm_intrinsics.c performance of linked-list Requires Fortran 2003’s C-binding src/housekeeping/prefetch.f90 SUBROUTINE prefetch_particle(p) INTERFACE #include<arm_acle.h> TYPE(particle),INTENT(INOUT) :: p SUBROUTINE arm_prefetch(p, x, w) BIND(C) void arm_prefetch(void const* p) #ifdef PREFETCH USE, INTRINSIC :: iso_c_binding { CALL mm_prefetch(p%part_p(1)) REAL(c_double),DIMENSION(3) :: p __pld(p); CALL mm_prefetch(p%weight) REAL(c_double), DIMENSION(c_ndims) :: x return; #endif REAL(c_double) :: w } END SUBROUTINE prefetch_particle END SUBROUTINE arm_prefetch END INTERFACE A similar approach can be used to call GCC’s __builtin_prefetch 11 © 2018 Arm Limited Example: Performance improvement Speed-up memory-bound code Armflang vs. GNU Armflang with preload GNU with prefetch 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 108% 86% 93% 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 standard standard standard prefetch standard prefetch arm gnu arm gnu 12 © 2018 Arm Limited Meets the requirements of HPC developers on Arm Arm MAP Arm Performance Libraries Cross-platform lightweight profiler BLAS, LAPLACK, FFT Arm Performance Reports Maximize System Efficiency Optimize Profile Develop Debug and build Arm Compiler for HPC For C, C++ and Fortran codes Arm DDT Cross-platform parallel debugger 13 © 2018 Arm Limited Community building Outside the people we collaborate with, various complementary Arm HPC communities already exist: • Arm HPC User Group (SC) and GoingArm (ISC/ArmRS) • Arm HPC Google Group (https://groups.google.com/forum/#!forum/arm-hpc) • Arm HPC GitLab pages (https://gitlab.com/arm-hpc/) Encouraging our partners to use GitLab is a priority Our app work is engaging with code owners and users to get suitable test cases, to get Arm support built in, and including helping them make AArch64 testing part of their development processes 14 © 2018 Arm Limited Community site – gitlab.com/arm-hpc https://gitlab.com/arm-hpc/packages/wikis/home Dynamic list of common HPC applications Up-to-date summary of package status Provides focus for porting progress Community driven. Maintained by Arm, but anyone can join and contribute. Allows developers to share recipes, and learn from progress on other applications Provides a mechanism for tracking status of applications and package sets (e.g. OpenHPC packages, Mantevo, etc.) 15 © 2018 Arm Limited Tack! Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धꅍयवाद 16 © 2018 Arm Limited Migrate and debug application to Arm Switch between OpenMP threads Visualise data structures Integrate to continuous integration tools Display pending communications 17 © 2018 Arm Limited Optimise for Arm platforms Detect MPI load imbalance Understand CPU usage Identify regions of high OpenMP synchronisation 18 © 2018 Arm Limited Maximize System Efficiency Aggregate data 19 © 2018 Arm Limited.

Arm Compiler for HPC Arm Performance Libraries Arm Forge Professional Arm Performance Reports

Forge and Performance Reports Modules $ Module Load Intel Intelmpi $ Module Use /P/Scratch/Share/VI-HPS/JURECA/Mf/ $ Module Load Arm-Forge Arm-Reports

Adaptive Data Migration in Load-Imbalanced HPC Applications

Fairborn Camera & Video

Performance Tuning Workshop

Performance Tuning Workshop

Best Practice Guide Modern Processors

Arm in HPC the Future of Supercomputing Starts Now

Debugging and Performance Analysis of Heterogenous HPC Applications

High-Performance I/O Programming Models for Exascale Computing

Arm® Forge User Guide Copyright © 2021 Arm Limited Or Its Affiliates

Arm® Forge User Guide Copyright © 2021 Arm Limited Or Its Affiliates

Arm MAP and Performance Reports