Arm HPC Ecosystem Hardware, Software and tools

Srinath Vadlamani, Field Application Engineer SEA, April 8, 2019 Arm Technology Already Connects the World

Arm is ubiquitous Partnership is key Choice is good

21 billion chips sold by We design IP, not One size is not always the partners in 2017 alone manufacture chips best fit for all Mobile/Embedded/IoT/ Partners build products HPC is a great fit for Automotive/Server/GPUs for their target markets co-design and collaboration 2 © 2019 Arm Limited Armv8-A Architecture Evolution

RISC architecture § Only have 32 bits available for encoding all instructions § Supports the development of efficient implementations 64-bit capable since 2012 § Known as AArch64 (or AArch32 when run in a 32-bit mode) § 128-bit vector unit (aka NEON Advanced SIMD)

• AArch64 execution state • Atomic memory ops • Half-precision float • Pointer authentication • A64 instruction set • Type2 hypervisor support • RAS support • Nested virtualization 3 © 2019 Arm Limited • Statistical profiling • Complex float Arm business model

Business Development Arm develops technology that is licensed to Arm licenses technology to Partner semiconductor companies. Technology Arm receives an SemiCo upfront license fee Partner License Fee and a royalty on every chip that contains its technology. Per-Chip Royalty Partner OEM develops Customer chips OEM sells consumer products

4 © 2019 Arm Limited CPU Engagement Models With Arm

Core License Architecture License

Partner licenses complete Partner designs complete CPU microarchitecture design microarchitecture from scratch § Wide choices available • Clean room – no reference to Arm § Many different A, R & M products core designs

Standard CPU Proprietary CPU CPU differentiation through: Freedom to develop any design § Flexible configuration options • Must conform to the rules & § Wide implementation envelope with Core license Architecture license model of a given different process technologies architecture variant • Must pass Arm architecture Range of licensing & engagement validation to preserve software models possible compatibility Long term strategic investment

5 © 2019 Arm Limited HPC on Arm – What’s new in 2018/19

Powerful hardware for now and future

• Marvell ThunderX2 now GA • Fujitsu announced details of A64FX (with SVE) for Post-K • Arm announces Neoverse brand for infrastructure and core IP roadmap (Ares, Zeus, Poseidon) with each generation delivering 30% perf boost. N1 platform details announced.

Mature toolchains and ISV Software

• Three mature toolchains available –Arm Commercial, GNU and Cray CE • ISVs start porting to Arm – Altair RADIOSS, ANSYS Fluent and LS-DYNA

Deployments

• New deployments across the EU and USA • USA - Sandia Astra (Top 500), Comanche Clusters • EU – Catalyst and Isambard in UK, GENCI and Dibona (MontBlanc 3) in France

6 © 2019 Arm Limited Arm Hardware for Infrastrucutre (including HPC) AWS Graviton by Amazon

8 © 2019 Arm Limited AWS Graviton by Amazon

9 © 2019 Arm Limited Huawei unveils KunPeng 920 CPU and TaiShan Servers Industry’s Highest Performance 2.6GHz 64-cores 7nm based ARM v8 Server SoC & Servers

Big Data, Distributed Storage and Arm-Native applications

TaiShan 2280 Balanced Server

TaiShan 5280/5290 Storage Server

TaiShan X6000 High-Density Server

“Use ARM-based CPU in areas like cloud and servers where they are better.” – William XU, Chief Strategy Marketing Officer, Huawei

10 © 2019 Arm Limited The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent Devices Broad SoC system design options within Arm Ecosystem

Arm IP High performance CPUs Arm Architectural design Data plane CPUs CMN Fabric Custom Arm High performance CPU Other IP Custom Fabric & IP

Accelerators ML, on-die FPGA Networking, security, encryption Video, Custom

Memory DDR, HBM, Flash, Storage Class memory

IO PCIe, CCIX, 100G+ ethernet

Foundry TSMC 7FF, Samsung 7LPP, UMC

Common Software Platform and Ecosystem Arm Architecture v8.x-A

© 2018 Arm Limited Arm IP : Commitment to Infrastructure segment

5nm

7nm+ Poseidon Platform 7nm Zeus Platform 16nm Ares (N1) Cosmos Platform 2021 Platform 2020 (A72, A75)~30% per Gen Faster Performance & New Features 2019 Today

13 © 2019 Arm Limited Neoverse N1 platform Accelerating the transformation to a scalable cloud to edge infrastructure

Revolutionary compute performance

Platform features specific to infrastructure

Extreme range of scale and diversity of compute

14 Confidential © 2018 Arm Limited Neoverse N1 platform: Revolutionary compute performance

2.5x NGINX 1.7x 2.5x Java* MemcacheD

Improved cloud to edge TCO through revolutionary workload performance

15 Confidential © 2018 Arm Limited Data shown for Neoverse N1 has been collected/projected from an array of platforms, and relative to Cortex A72 ”Cosmos” *Based on an industry standard Java-based benchmark Arm Hardware for HPC Arm Architecture Partner SoC for HPC Available or Announced in 2018-19

17 © 2019 Arm Limited HPC Software Ecosystem Arm HPC Ecosystem – Overview

Job schedulers HPC Applications: and Resource Open-source, Owned, and Commercial ISV codes User-space Management: utilities, scripting, SLURM, IBM LSF, App/ISA specific optimizations, optimized libs and intrinsics: container, and Altair PBS Pro, Arm PL, BLAS, FFTW, etc. other packages: etc. Singularity, CMU, Bright, HPE Parallelism HPC Programming Debug and Filesystems: Openstack, Cluster Management Tools: standards: Languages: performance BeeGFS, OpenHPC, OpenMP , , C++ analysis tools: LUSTRE, ZFS, Python, NumPy, (omp / gomp), via Arm Forge, SciPy, etc. MPI, SHMEM GNU, LLVM, Arm Rogue Wave, HDFS, GPFS (see below) & OEMs TAU, etc. xCat , Communication Stacks and run-times: Warewulf Silicon Suppliers: Mellanox IB/OFED/HPC-X, OpenMPI, MPICH, MVAPICH2, OpenSHMEM, OpenUCX, HPE MPI Marvell, Fujitsu, Mellanox OS Distro of choice: RHEL, SUSE, CENTOS,… OEM/ODM’s: Cray, HPE, ATOS-Bull, Fujitsu, Gigabyte, Inventec, Foxconn Arm Server Ready Platform: Standard OS compatible FW and RAS features

19 © 2019 Arm Limited Common HPC applications now available

GROMACS LAMMPS CESM2 MrBayes Bowtie

NAMD AMBER Paraview SIESTA UM

Quantum WRF VASP MILC GEANT4 ESPRESSO

OpenFOAM GAMESS VisIT DL-Poly NEMO

BLAST NWCHEM Abinit BWA QMCPACK

Build recipes online at https://gitlab.com/arm-hpc/packages/wikis/home

20 © 2019 ArmChem Limited /Phys Weather CFD Visualization Genomics ISVs codes on Arm

Porting underway Available

21 © 2019 Arm Limited : Typical HPC packages available for Arm

Functional Areas Components include OpenHPC is a community effort to provide a common, Base OS CentOS 7.5, SLES 12 SP3 verified set of open source packages for HPC Administrative Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh- Tools mod-slurm, prun, EasyBuild, ClusterShell, mrsh, deployments Genders, Shine, test-suite

Provisioning Warewulf

Arm and partners actively involved: Resource Mgmt. SLURM, Munge • Arm is a silver member of OpenHPC I/O Services Lustre client (community version) • Numerical/Scientific Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, Linaro is on Technical Steering Committee Libraries SuperLU, SuperLU_Dist,Mumps, OpenBLAS, • Arm-based machines in the OpenHPC build Scalapack, SLEPc, PLASMA, ptScotch I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran infrastructure interfaces), Adios

Compiler Families GNU (gcc, g++, gfortran), LLVM Status: 1.3.6 release out now MPI Families OpenMPI, MPICH Development Tools Autotools (autoconf, automake, libtool), Cmake, • Packages built on Armv8-A for CentOS and SUSE Valgrind,R, SciPy/NumPy, hwloc

Performance Tools PAPI, IMB, pdtoolkit, TAU, Scalasca, Score-P, SIONLib

22 © 2019 Arm Limited Arm HPC Ecosystem website: www.arm.com/hpc Starting point for developers and end-users of Arm for HPC

Latest events, news, blogs, and collateral including whitepapers, webinars, and presentations

Links to HPC open-source & commercial SW packages Guides for porting HPC applications

Quick-start guides to Arm tools

Links to community collaboration sites

Curated and moderated by Arm

23 © 2019 Arm Limited Arm HPC Community: community.arm.com/tools/hpc/ HPC Community-driven Content

Blogs by Arm and our HPC community

Calendar of upcoming events such as workshops and webinars

HPC Forum with questions & posts curated and moderated by Arm HPC technical specialists

Ask, answer, share progress and expertise

24 © 2019 Arm Limited Arm HPC Packages wiki www.gitlab.com/arm-hpc/packages/wikis • Dynamic list of common HPC packages • Status and porting recipes • Community driven • Anyone can join and contribute • Provides focus for porting progress • Allows developers to share and learn

25 © 2019 Arm Limited Open source libraries for helping increase performance

Arm Optimized Routines Perf-libs-tools https://github.com/ARM-software/optimized-routines https://github.com/ARM-software/perf-libs-tools These routines provide high performing Understanding an application’s needs for versions of many math.h functions BLAS, LAPACK and FFT calls • Algorithmically better performance than • Used in conjunction with Arm Performance standard library calls Libraries can generate logging info to help profile • No loss of accuracy applications for specific case breakdowns SLEEF library https://github.com/shibatch/sleef/ Vectorized math.h functions

• Provided as an option for use in Arm Compiler Example visualization: DGEMM cases called

26 © 2019 Arm Limited Arm HPC deployments Deployments

28 © 2019 Arm Limited Arm Supercomputer Makes Top500 List!

“Astra, the world’s fastest Arm-based supercomputer according to the TOP500 list, has achieved a speed of 1.529 petaflops, placing it 203rd on a ranking of top computers …”

29 © 2019 Arm Limited Vanguard Astra at Sandia MOST POWERFUL ARM SUPERCOMPUTER, IN TOP 500 (#203 in HPL and #36 in HPCG)

• 2,592 HPE Apollo 70 compute nodes • Mellanox IB EDR, ConnectX-5 • 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak) • 112 36-port edges, 3 648-port spine switches • Marvell ThunderX2 Arm SoC, 28 core, 2.0 GHz • Red Hat RHEL for Arm • Memory per node: 128 GB (16 x 8 GB DR DIMMs) • HPE Apollo 4520 All–flash Lustre storage • Aggregate capacity: 332 TB, 885 TB/s (peak) • Storage Capacity: 403 TB (usable) • Storage Bandwidth: 244 GB/s

30 © 2019 Arm Limited Isambard in Production at Bristol/GW4 Largest EU Arm HPC cluster to date • Cray XC50 system w/ 168 nodes with Marvell ThunderX2 (32C) • 10,752 total cores • High-speed ARIES interconnect • Cray HPC SW Stack including CCE, CrayPAT, Cray MPI, libs, ... • Production deployment reached @ SC18

31 © 2019 Arm Limited Deployments: Catalyst UK

Bristol: VASP, • HPE, in conjunction with Arm and CASTEP, Gromacs, SUSE, announced in April the “Catalyst CP2K, Unified Model, Hydra, UK” program: deployments to NAMD, Oasis, NEMO, OpenIFS, accelerate the growth of the Arm HPC CASINO, LAMMPS ecosystem into three universities EPCC: WRF, • Each machine will have: OpenFOAM, Rolls Royce Hydra opt, 2 • 64 HPE Apollo 70 systems, each with PhD candidates two 32-core Cavium ThunderX2 Leicester: Data- intensive apps, processors (i.e. 4096 cores per system), genomics, MOAB 128GB of memory and Mellanox Torque, DiRAC collab InfiniBand interconnects • SUSE Linux Enterprise Server for HPC 32 © 2019 Arm Limited Deployment: Mont Blanc

34 © 2019 Arm Limited Deployments: HPE’s Comanche Collaboration Early access to Cavium ThunderX2 systems that became Apollo 70

Engagements in HPE Comanche program have accelerated adoption • We have been able to assess the state of fundamental software stacks, such as MPI and NUMA capabilities • Collaborative work here especially great with all partners focusing on interoperability issues • Examples include fixing bugs with kernels, MPI drivers and OpenMP thread placement • Optimization of packages, environment and execution configurations

Over 1,000 processors delivered | LLNL TOSS stack ported and demoed | InfiniBand optimized

35 © 2019 Arm Limited Performance results on Arm 37 © 2019 Arm Limited Single node results from GENCI - France

38 © 2019 Arm Limited Isambard, UK – Single node results

39 © 2019 Arm Limited Isambard, UK – Multi-node results Gromacs (42M atoms) on Horizon (Intel Skylake, 20C) vs Isambard (Marvell ThunderX2, 32C)

40 © 2019 Arm Limited Commercial Tools for HPC by Arm Our Solution for any Architecture, at any Scale Commercial tools for AArch64, x86_64, ppc64 and accelerators

Arm Cross-Platforms Tools Debug, optimise and analyse any platform

Arm DDT Arm MAP Arm Forge Professional Arm Performance Reports

Slash your time to debug on Speed-up applications with a Arm DDT and MAP in Find the most efficient any hardware, at any scale. lightweight scalable profiler One Single Package settings for your workloads.

All-inclusive development toolkit for Arm hardware

Arm HPC Compiler Arm Performance Libraries Arm Forge Professional Arm Performance Reports

Linux user space compiler BLAS, LAPACK and FFT Multi-node interoperable Interoperable application for HPC applications profiler and debugger performance insight

42 © 2019 Arm Limited Arm Allinea Studio Built for developers to achieve best performance on Arm with minimal effort Comprehensive and integrated tool suite for Scientific computing, HPC and Enterprise developers Seamless end-to-end workflow from getting started to advanced optimization of your workloads Commercially supported by Arm engineers Frequent releases with continuous performance improvements Ready for current and future generations of server-class Arm-based platforms Available for a wide-variety of Arm-based server-class platforms

43 © 2019 Arm Limited Meets the requirements of HPC developers on Arm Arm MAP Arm Performance Libraries Cross-platform lightweight profiler BLAS, LAPLACK, FFT Arm Performance Reports Scalar and vector math functions Maximize System Efficiency Optimize Profile

Develop Debug and build Arm Linux Compiler For C, C++ and Fortran codes Arm DDT Cross-platform parallel debugger

44 © 2019 Arm Limited arm Allinea Studio A quick glance at what is in Arm Allinea Studio

C/C++ Compiler Fortran Compiler Performance Libraries Forge (DDT and MAP) Performance Reports • C++ 14 support • Fortran 2003 support • Optimized math libraries • Profile, Tune and Debug • Analyze your application • OpenMP 4.5 without • Partial Fortran 2008 • BLAS, LAPACK and FFT • Scalable debugging with • Memory, MPI, Threads, offloading support • Threaded parallelism with DDT I/O, CPU metrics • SVE ready • OpenMP 3.1 OpenMP • Parallel Profiling with MAP • SVE ready • Scalar math routines

Tuned by Arm for a wide-range of server-class Arm-based platforms

45 © 2019 Arm Limited Progress in the last year A fully integrated tools suite for deployment on Arm systems

Arm C/C++ Compiler Arm Fortran Arm Perf Libraries Forge and Perf GNU8 toolchain • Porting and tuning Compiler • BLAS, FFT and LAPACK Reports • GCC and Gfortran guides for common • New Fortran Directives Improvements • General cross-platform • 2nd toolchain in the applications • Improved Fortran 2008 • Sparse routine SPMV improvements studio • Optimizations and bug support support • Python profiling • Better suited for certain fixes • Support for vectorization • Scalar math routines • Better interop with Arm applications of loops with math calls Compiler and Libraries • Beta support for HPC users Support and tuning for Arm server-class platforms

46 © 2019 Arm Limited Commercial C/C++/Fortran compiler with best-in-class performance Tuned for Scientific Computing, HPC and Enterprise workloads • Processor-specific optimizations for various server-class Arm-based platforms • Optimal shared-memory parallelism using latest Arm-optimized OpenMP runtime Compilers tuned for Scientific Computing and HPC Linux user-space compiler with latest features • C++ 14 and Fortran 2003 language support with OpenMP 4.5 • Support for Armv8-A and SVE architecture extension Latest features and • performance optimizations Based on LLVM and Flang, leading open-source compiler projects

Commercially supported by Arm • Available for a wide range of Arm-based platforms running leading Linux Commercially supported by Arm distributions – RedHat, SUSE and Ubuntu

47 © 2019 Arm Limited Arm Compiler – Building on LLVM, Clang and Flang projects

Arm C/C++/Fortran Compiler

Clang based LLVM based LLVM based C/C++ Files C/C++ Armv8-A Armv8-A Optimizer (.c/.cpp) Frontend code-gen binary IR Optimizations LLVM IR LLVM IR Auto-vectorization Flang based LLVM based Fortran Files Fortran Enhanced optimization for SVE (.f/.f90) Armv8-A and SVE SVE Frontend code-gen binary

Language specific frontend Language agnostic optimization Architecture specific backend

48 © 2019 Arm Limited Arm Linux Compiler – What’s new in 2018/19?

Overall - Better code generation

• For Arm platforms for current generation (Marvell ThunderX2) and future (SVE based) • Base compiler technology upgrade (Clang/LLVM 7, GNU8, Latest Flang) • Vectorization of loops with math function calls

Fortran – Increase in maturity

• Enable key Fortran applications (open source, in house and commercial) • Improved auto vectorization • Fortran vectorization directives like IVDEP

49 © 2019 Arm Limited Optimized BLAS, LAPACK and FFT Commercial 64-bit Armv8-A math libraries • Commonly used low-level math routines - BLAS, LAPACK and FFT Commercially supported • Provides FFTW compatible interface for FFT routines by Arm • Batched BLAS support

Best-in-class serial and parallel performance • Generic Armv8-A optimizations by Arm Best in class performance • Tuning for specific platforms like Cavium ThunderX2 in collaboration with silicon vendors Validated and supported by Arm • Available for a wide range of server-class Arm-based platforms Validated with • Validated with NAG’s test suite, a de-facto standard NAG test suite

50 © 2019 Arm Limited Arm Performance Libraries progress Progress and additions since SC17 Key improvements in since 18.0 New features in 19.0

• Massive improvements in FFT performance • Sparse linear algebra for higher performing • All basic, advanced and guru interface FFTW SpMV calls calls now supported • FFTW MPI interface for FFT calls added • Many functions have had extra serial and • Parallelisation of many FFTW plans parallel performance improvements targeting • Parallel scaling improvements, especially for ThunderX2 ThunderX2 • Addition of libamath • Particular focus on GEMMs and POTRF, GETRF • High performing implementations of certain and GETQR key math.h functions

51 © 2019 Arm Limited Compiler and Libraries - Future roadmap Focus on current and next generation hardware

Libraries : Vector Math routines and • Application specific tuning and • SVE enabled Performance Libraries more scalar math routines optimization • Application specific tuning and Fortran Compiler : Directives & new • For Marvell ThunderX2 and other optimization in Compilers and Fortran 2008/OpenMP features server-class Arm-based platforms Libraries for SVE

All compilers : Vectorization and optimization report improvements

More features in More Getting ready for compilers and optimizations for SVE-based future libraries current hardware hardware

53 © 2019 Arm Limited Toolchain performance results Arm Compiler and Libraries – 19.1 release Progress and additions since SC18 (19.0 release) Arm C/C++/Fortran Compilers Arm Perf Libraries

• Fortran: TRAILZ intrinsic, a Fortran 2008 feature, • BLAS - Improved GEMV and GEMM (SCZ now supported variants) • Fortran: Runtime I/O performance improvement • FTW Fortran MPI interface now supported when handling formatted text data • FFT MPI parallel scaling has been improved. • Fortran: New UNROLL directive to provide • SpMV - Support for CSC and COO formats; unrolling hints to the compiler Improved single-precision performance; Fortran • Bug fixes Interface now supported. • Math routines (in libamath) – Vector routines support with optimized logf and expf; Arm Compiler uses libamath by default; A GNU compatible version provided

65 © 2019 Arm Limited BLAS improvements to many GEMM routines in 19.1 Shown below: CGEMM on Marvell ThunderX2 run using 56 threads CGEMM on 56 ThunderX2 threads 1600 1400 1200 1000 800 600 400 19.0 Performance, GFLOPs 19.1 200 0 0 1000 2000 3000 4000 5000 Matrix size, M=N=K

66 © 2019 Arm Limited BLAS improvements to GEMV routines in 19.1 All cases improved for both serial and parallel. Comparison shown on ThunderX2 for serial SGEMV and DGEMV against OpenBLAS SGEMV on ThunderX2 DGEMV on ThunderX2 16000 7000 OpenBLAS - N OpenBLAS - N 14000 OpenBLAS - T 6000 OpenBLAS - T 12000 Arm PL 19.1 - N 5000 Arm PL 19.1 - N Arm PL 19.1 - T 10000 Arm PL 19.1 - T 4000 8000 3000 6000 4000 2000 Performance, MFLOPs Performance, Performance, MFLOPs Performance, 2000 1000 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Matrix size, M=N Matrix size, M=N

67 © 2019 Arm Limited FFT MPI performance in 19.1 Scaling using FFTW MPI interface improved; now similar scaling to FFTW

FFT MPI performance on ThunderX2 FFT MPI performance on ThunderX2 3-d case: 512x512x512 3-d case: 1024x1024x1024 100 1000 ArmPL 19.1 ArmPL 19.1 FFTW 3.3.8 FFTW 3.3.8 Perfect scaling Perfect scaling 10 100

1 10 Solution(s) time Solution(s) time

0.1 1 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Number of MPI processes Number of MPI processes

68 © 2019 Arm Limited Libamath – increased performance for math.h functions ELEFUNT run on ThunderX2: cases no libamath, Arm Compiler with libamath 19.0 and 19.1

Math performance measured by ELEFUNT

350 gfortran/libm 300 libamath 19.0 250 200 libamath 19.1

percentage of 19.0 percentage 150 - 100 50 0

Performance ALOG EXP PWR SIN NINT DLOG DEXP DPWR DSIN DNINT DRECI

69 © 2019 Arm Limited Cross-Platform tools Arm Forge and Arm Performance Reports By Choosing Arm, You Choose a State-of-the-art Solution

Available on the vast majority of HPC platforms, including Interoperable AMD, IBM, Intel, Nvidia… and of course Arm!

Fast, lightweight and transparent tools that help focus on Performant the real issues that count

Packed with the best features to slash the development Comprehensive overhead spent on debugging and optimising issues

71 © 2019 Arm Limited Arm Forge Professional A cross-platform toolkit for debugging and profiling

The de-facto standard for HPC development • Available on the vast majority of the Top500 machines in the world Commercially supported • Fully supported by Arm on x86, IBM Power, Nvidia GPUs, etc. by Arm

State-of-the art debugging and profiling capabilities • Powerful and in-depth error detection mechanisms (including memory Fully Scalable debugging) • Sampling-based profiler to identify and understand bottlenecks • Available at any scale (from serial to petaflopic applications)

Easy to use by everyone Very user-friendly • Unique capabilities to simplify remote interactive sessions 72 © 2019 Arm Limited • Innovative approach to present quintessential information to users Arm Performance Reports Characterize and understand the performance of HPC application runs

Gathers a rich set of data • Analyses metrics around CPU, memory, IO, hardware counters, etc. Commercially supported • Possibility for users to add their own metrics by Arm

Build a culture of application performance & efficiency awareness • Analyses data and reports the information that matters to users Accurate and astute • Provides simple guidance to help improve workloads’ efficiency insight

Adds value to typical users’ workflows • Define application behaviour and performance expectations • Integrate outputs to various systems for validation (e.g. continuous Relevant advice to avoid pitfalls integration)

73 © 2019 Arm Limited • Can be automated completely (no user intervention) Key highlights in Forge & Performance Reports Latest 19.0 version released in Dec 2018 Forge Performance DDT MAP Reports Creation of Allinea Studio Packaging A new solution for aarch64 platforms that includes the Arm Compiler, Arm Performance Libraries, and the former Allinea tools! Full support for IBM systems Full support for IBM systems Full support for IBM systems Platforms Arm v8 support Arm v8 support Arm v8 support CUDA 9 support CUDA 9 support CUDA 9 support Usability Improvements Optimizations for many-core Optimizations for many-core Improvements Memory debugging systems systems optimizations Python profiling Combined C/C++/Fortran and Backfill Custom Metrics Python performance analysis New Features Python Debugging On-kernel GPU profiling Ability to profile selected ranks Ability to profile selected ranks

74 © 2019 Arm Limited Forge and Performance Reports – Future roadmap Why do our tools matter and what will we focus on this year? Reduce migration costs Slash down code validation Provide capabilities and increase portability costs and time on demand Finding and using the right For every run in production, Too often, users are stopped in hardware is hard, even more so codes are run 3 to 5 times to their work by licence sizes because of porting and validate they meet standards. limitations. migration costs. We will assist the community We will work on providing We will keep providing cross- reduce their testing costs by capabilities to users on demand platform tools to enable choice promoting best practices and at any time. and innovation in HPC. tightening the link between tools agile continuous delivery.

75 © 2019 Arm Limited Forge/Performance Reports Roadmap 2018-2019 Key highlights for Forge/PR 19.1 and 19.2

Continuous work 19.1 19.2/20.0

• Support for latest software • Arithmetic evaluations of • Addition of a “burst mode” in environments (MPI, CPU metrics the tools compilers, etc.) • Assembly views to Forge • Simplify the integration of • Support for popular HPC • Integration with DynamoRIO tools within scripts systems (Intel, Arm, Power, for low-level instrumentation • Add the json, xml, csv GPUs…) of operations outputs of the “offline” tools • Developing exclusive features features in collaboration with vendors (e.g. HPE, etc.)

76 © 2019 Arm Limited SVE - Introduction, tools and workflow Scalable Vector Extension (SVE) A vector extension to the ARMv8-A architecture with some major new features

Gather-load and scatter-store Loads a single register from several non-contiguous memory locations.

1 2 3 4 + 5 5 5 5 Per-lane predication pred 1 0 1 0 = 6 2 8 4 Operations work on individual lanes under control of a predicate register.

for (i = 0; i < n; ++i) INDEX i n-2 n-1 n n+1 Predicate-driven loop control and management CMPLT n 1 1 0 0 Eliminate scalar loop heads and tails by processing partial vectors.

1 2 Vector partitioning and software-managed speculation + 1 2 0 0 pred 1 1 0 0 First Faulting Load instructions allow memory accesses to cross into invalid pages.

1 + 2 + 3 + 4 = 1 + 2 3 + 4 Extended floating-point horizontal reductions = = 3 + 7 = In-order and tree-based reductions trade-off performance and repeatability.

78 © 2019 Arm Limited SVE is Arm’s next generation SIMD ISA

1 2 3 4 for (i = 0; i < n; ++i) + 5 5 5 5 pred 1 0 1 0 INDEX i n-2 n-1 n n+1 = 6 2 8 4 CMPLT n 1 1 0 0 Predicate-driven loop Gather-load Per-lane predication and scatter-store control and management

1 2 1 + 2 + 3 + 4 =

1 + 2 3 + 4 + 1 2 0 0 = = pred 1 1 0 0 3 + 7 = Vector partitioning and Extended floating-point software-managed speculation horizontal reductions

79 © 2019 Arm Limited SVE: HPGMG & Lulesh

80 © 2019 Arm Limited SVE: Optimizing Stencil

• What are the effects of Vector Length Agnosticism? • How well suited is the the ISA to express the semantics of stencil codes? Baseline: Unroll j×" Unroll j×#, i×" Vectorise on k j j j

k i k i k i

Version $%$&$'( %,+- ./0 %,+- ./0 )($*+(),' )($*+(),' $%$&$'( Baseline 1234 7(6) 7(6) Unroll j2 2×1234 12(10) 6(5) Unroll i2j3 2×3×1234 28(22) 5.6(4.6)

81 © 2019 Arm Limited Open source support

• Arm actively posting SVE open source patches upstream • Beginning with first public announcement of SVE at HotChips 2016.

• Available upstream • GNU Binutils-2.28: released Feb 2017, includes SVE assembler & disassembler. • GCC 8: Full assembly, disassembly and basic auto-vectorization • GDB 8.2 SVE support • LLVM 7: Full assembly, disassembly • Linux kernel: since Mar 2017 • QEMU 3.1: SVE support (user-space and system mode) • Under upstream review • LLVM: since Nov 2016, as presented at LLVM conference.

82 © 2019 Arm Limited Compiler support

Feature Upstream GCC Upstream LLVM Arm Compiler 6 (For Arm Linux Compiler bare metal) (for Linux user-space) SVE asm and disasm Yes Yes Yes Yes SVE code generation Yes No Yes Yes Planned for 2019-20 SVE ACLE No No Yes Yes Planned for GCC10 Planned for 2019-20 (2020) Auto-vectorization Basic None Advanced Advanced More improvements Planned for 2019-20 planned for GCC9

83 © 2019 Arm Limited Getting ready for SVE

Port to Arm Get ready for SVE Tune for SVE • Port to current Arm hardware – • Port to SVE using QEMU and/or • On real SVE hardware Single node and multi-node ArmIE on current Arm hardware • Tune it for current Arm hardware

Co-work with Arm tools and professional services team

84 © 2019 Arm Limited Arm Instruction Emulator for SVE Develop tomorrow’s software on today’s hardware • Simple “black box” tool aimed at userspace software developers • $ armclang hello.c --march=sve $ ./a.out Illegal instruction $ armie –msve-vector-bits=256 -- ./a.out Hello • Runs userspace application binaries at close to native speed • runs multithreaded applications • transparent to system calls • Intercepts and emulates use of ARM instructions newer than hardware

85 © 2019 Arm Limited Arm Instruction Emulator Develop your user-space applications for future hardware today Start porting and tuning for future architectures early • Reduce time to market, Save development and debug time with Arm Develop software for support tomorrow’s hardware today Run 64-bit user-space Linux code that uses new hardware features on current Arm hardware • SVE support available now. Runs at close to • Tested with Arm Architecture Verification Suite (AVS) native speed Near native speed with commercial support • Emulates only unsupported instructions • Maintained and supported by Arm for a wide range of Arm-based SoCs

Commercially Supported by Arm

86 © 2019 Arm Limited DynamoRIO

Dynamic Binary Instrumentation Fast code translation in userspace Originally developed in MIT Now managed by Google Used for • profiling • valgrind-like checking • architecture emulation

87 © 2019 Arm Limited Key points of contact Visit www.arm.com/hpc-tools for further information

Product team Sales team David Lecomber Rob Rick and Andrew Westergren – Sr Director, Infrastructure tools Americas Ashok Bhat Marcin Krzysztofik – EMEA, India and China Sr Product manager – Compiler and Libraries Toshinori Kujiraoka – Japan Patrick Wohlschlegel Sr Product manager – Forge and Perf Reports

91 © 2019 Arm Limited