Software Ecosystem for Arm-based HPC
CUG 2018 - Stockholm
© 2018 Arm Limited [email protected] Ecosystem for HPC
List of components needed: • Linux OS availability • Compilers • Libraries • Job schedulers • Debuggers • Profilers
Mix of open source and commercial products and applications… https://developer.arm.com/hpc/hpc-software
2 © 2018 Arm Limited Arm development tools portfolio for HPC
Arm Allinea Studio Develop and run on today’s hardware
Arm Compiler for HPC Arm Performance Libraries Arm Forge Professional Arm Performance Reports
Linux user space compiler BLAS, LAPACK and FFT Multi-node interoperable Interoperable application for HPC applications profiler and debugger performance insight
and also… Explore tomorrow’s architecture today Arm Code Advisor ArmInstruction Instruction Emulator Emulator
Understand what the Run SVE Runbinaries SVE onbinaries today’s compiler could/could not do Armv8on today’s-A hardware hardware
3 © 2018 Arm Limited Arm Compiler – Building on LLVM, Clang and Flang projects
Clang based LLVM based LLVM based C/C++ Files C/C++ Armv8-A Armv8-A Optimizer (.c/.cpp) Frontend Backend binary IR Optimizations LLVM IR LLVM IR Auto-vectorization PGI Flang based LLVM based Fortran Files Fortran Enhanced optimization for SVE (.f/.f90) ARMv8-A and SVE SVE Frontend Backend binary
Language specific frontend Language agnostic optimization Architecture specific backend
4 © 2018 Arm Limited Arm Compiler – OpenMP scaling
Better scaling at higher thread count Lulesh – size 40
Arm Compiler uses libomp based optimized OpenMP runtime For Lulesh (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics), Arm Compiler shows better scaling than GCC for
higher thread count Zones per Second per Zones
Number of threads armclang 18.0 gcc 7.1
5 © 2018 Arm Limited DGEMM performance on Cavium ThunderX2 Excellent serial and parallel performance DGEMM – 56 threads on Cavium ThunderX2 Achieving very high performance at the CN99 node level leveraging high core counts and 100% large memory bandwidth 90% 80% Single core performance at 70% 60% 95% of peak for DGEMM 50% 40% Parallel performance significantly higher 30% than OpenBLAS peak of Percentage 20% 10% 0% 0 2000 4000 6000 8000 10000 Matrix dimension (M=N=K)
ARM Performance Libraries OpenBLAS
6 © 2018 Arm Limited Arm Performance Libraries
7 © 2018 Arm Limited Arm HPC ecosystem Porting to Arm
Arm is engaging directly with partners and HPC scientific code developers to support porting and optimisation of common HPC libraries, tools and applications Initial focus on successfully building with both Arm and GCC compilers across a broad front Often only modest changes to environment variables, build scripts and architecture files are needed Degree of commonality between codes
8 © 2018 Arm Limited Example: Particle in Cell codes Two different approaches
VPIC EPOCH Explicit 2nd order push, charge conserving Explicit 2nd order push, charge conserving FDTD fields FDTD fields C & C++ with MPI & pthreads Fortran with MPI Low particle order High order particles Flexible, extensible, versatile Heavily optimised push, previously tuned for specific platforms Linked list storage Vector kernel Dependencies: SDF https://github.com/lanl/vpic http://www.ccpp.ac.uk
9 © 2018 Arm Limited Example: Leveraging Arm intrinsics from C VPIC VPIC’s v4 kernel pushes four particles at a time – optimised
with SSE SIMD calls 2 ) Arm’s NEON instructions offer similar functionality 1.5 Datatypes and intrinsic calls from SSE can be mapped over
to NEON in many cases 1 normalised Projects like SIMD Everywhere: 0.5
https://github.com/nemequ/simde pushes ( pushes may help generate portable code able to exploit Arm’s 0 vector calls Standard NEON Could such vectorised kernels stand to benefit from Arm’s SVE instructions? Arm
10 © 2018 Arm Limited Example: Leveraging Arm intrinsics from Fortran EPOCH Particle prefetch Arm C compiler preload C wrapper Uses intel’s _mm_prefetch to improve Use __pld in place of _mm_prefetch src/housekeeping/arm_intrinsics.c performance of linked-list Requires Fortran 2003’s C-binding src/housekeeping/prefetch.f90
SUBROUTINE prefetch_particle(p) INTERFACE #include
11 © 2018 Arm Limited Example: Performance improvement Speed-up memory-bound code
Armflang vs. GNU Armflang with preload GNU with prefetch
1.2 1.2 1.2
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6 108% 86% 93% 0.4 0.4 0.4
0.2 0.2 0.2
0 0 0 standard standard standard prefetch standard prefetch arm gnu arm gnu
12 © 2018 Arm Limited Meets the requirements of HPC developers on Arm Arm MAP Arm Performance Libraries Cross-platform lightweight profiler BLAS, LAPLACK, FFT Arm Performance Reports Maximize System Efficiency Optimize Profile
Develop Debug and build Arm Compiler for HPC For C, C++ and Fortran codes Arm DDT Cross-platform parallel debugger
13 © 2018 Arm Limited Community building
Outside the people we collaborate with, various complementary Arm HPC communities already exist: • Arm HPC User Group (SC) and GoingArm (ISC/ArmRS) • Arm HPC Google Group (https://groups.google.com/forum/#!forum/arm-hpc) • Arm HPC GitLab pages (https://gitlab.com/arm-hpc/) Encouraging our partners to use GitLab is a priority
Our app work is engaging with code owners and users to get suitable test cases, to get Arm support built in, and including helping them make AArch64 testing part of their development processes
14 © 2018 Arm Limited Community site – gitlab.com/arm-hpc https://gitlab.com/arm-hpc/packages/wikis/home
Dynamic list of common HPC applications Up-to-date summary of package status Provides focus for porting progress Community driven. Maintained by Arm, but anyone can join and contribute. Allows developers to share recipes, and learn from progress on other applications Provides a mechanism for tracking status of applications and package sets (e.g. OpenHPC packages, Mantevo, etc.)
15 © 2018 Arm Limited Tack! Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धꅍयवाद
16 © 2018 Arm Limited Migrate and debug application to Arm
Switch between OpenMP threads Visualise data structures
Integrate to continuous integration tools
Display pending communications 17 © 2018 Arm Limited Optimise for Arm platforms
Detect MPI load imbalance
Understand CPU usage
Identify regions of high OpenMP synchronisation
18 © 2018 Arm Limited Maximize System Efficiency
Aggregate data
19 © 2018 Arm Limited