Arm HPC Ecosystem Hardware, Software and tools
Srinath Vadlamani, Field Application Engineer SEA, April 8, 2019 Arm Technology Already Connects the World
Arm is ubiquitous Partnership is key Choice is good
21 billion chips sold by We design IP, not One size is not always the partners in 2017 alone manufacture chips best fit for all Mobile/Embedded/IoT/ Partners build products HPC is a great fit for Automotive/Server/GPUs for their target markets co-design and collaboration 2 © 2019 Arm Limited Armv8-A Architecture Evolution
RISC architecture § Only have 32 bits available for encoding all instructions § Supports the development of efficient implementations 64-bit capable since 2012 § Known as AArch64 (or AArch32 when run in a 32-bit mode) § 128-bit vector unit (aka NEON Advanced SIMD)
• AArch64 execution state • Atomic memory ops • Half-precision float • Pointer authentication • A64 instruction set • Type2 hypervisor support • RAS support • Nested virtualization 3 © 2019 Arm Limited • Statistical profiling • Complex float Arm business model
Business Development Arm develops technology that is licensed to Arm licenses technology to Partner semiconductor companies. Technology Arm receives an SemiCo upfront license fee Partner License Fee and a royalty on every chip that contains its technology. Per-Chip Royalty Partner OEM develops Customer chips OEM sells consumer products
4 © 2019 Arm Limited CPU Engagement Models With Arm
Core License Architecture License
Partner licenses complete Partner designs complete CPU microarchitecture design microarchitecture from scratch § Wide choices available • Clean room – no reference to Arm § Many different A, R & M products core designs
Standard CPU Proprietary CPU CPU differentiation through: Freedom to develop any design § Flexible configuration options • Must conform to the rules & § Wide implementation envelope with Core license Architecture license programmers model of a given different process technologies architecture variant • Must pass Arm architecture Range of licensing & engagement validation to preserve software models possible compatibility Long term strategic investment
5 © 2019 Arm Limited HPC on Arm – What’s new in 2018/19
Powerful hardware for now and future
• Marvell ThunderX2 now GA • Fujitsu announced details of A64FX (with SVE) for Post-K • Arm announces Neoverse brand for infrastructure and core IP roadmap (Ares, Zeus, Poseidon) with each generation delivering 30% perf boost. N1 platform details announced.
Mature toolchains and ISV Software
• Three mature toolchains available –Arm Commercial, GNU and Cray CE • ISVs start porting to Arm – Altair RADIOSS, ANSYS Fluent and LS-DYNA
Deployments
• New deployments across the EU and USA • USA - Sandia Astra (Top 500), Comanche Clusters • EU – Catalyst and Isambard in UK, GENCI and Dibona (MontBlanc 3) in France
6 © 2019 Arm Limited Arm Hardware for Infrastrucutre (including HPC) AWS Graviton by Amazon
8 © 2019 Arm Limited AWS Graviton by Amazon
9 © 2019 Arm Limited Huawei unveils KunPeng 920 CPU and TaiShan Servers Industry’s Highest Performance 2.6GHz 64-cores 7nm based ARM v8 Server SoC & Servers
Big Data, Distributed Storage and Arm-Native applications
TaiShan 2280 Balanced Server
TaiShan 5280/5290 Storage Server
TaiShan X6000 High-Density Server
“Use ARM-based CPU in areas like cloud and servers where they are better.” – William XU, Chief Strategy Marketing Officer, Huawei
10 © 2019 Arm Limited The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent Devices Broad SoC system design options within Arm Ecosystem
Arm IP High performance CPUs Arm Architectural design Data plane CPUs CMN Fabric Custom Arm High performance CPU Other IP Custom Fabric & IP
Accelerators ML, on-die FPGA Networking, security, encryption Video, Custom
Memory DDR, HBM, Flash, Storage Class memory
IO PCIe, CCIX, 100G+ ethernet
Foundry TSMC 7FF, Samsung 7LPP, UMC
Common Software Platform and Ecosystem Arm Architecture v8.x-A
© 2018 Arm Limited Arm IP : Commitment to Infrastructure segment
5nm
7nm+ Poseidon Platform 7nm Zeus Platform 16nm Ares (N1) Cosmos Platform 2021 Platform 2020 (A72, A75)~30% per Gen Faster Performance & New Features 2019 Today
13 © 2019 Arm Limited Neoverse N1 platform Accelerating the transformation to a scalable cloud to edge infrastructure
Revolutionary compute performance
Platform features specific to infrastructure
Extreme range of scale and diversity of compute
14 Confidential © 2018 Arm Limited Neoverse N1 platform: Revolutionary compute performance
2.5x NGINX 1.7x 2.5x Java* MemcacheD
Improved cloud to edge TCO through revolutionary workload performance
15 Confidential © 2018 Arm Limited Data shown for Neoverse N1 has been collected/projected from an array of platforms, and relative to Cortex A72 ”Cosmos” *Based on an industry standard Java-based benchmark Arm Hardware for HPC Arm Architecture Partner SoC for HPC Available or Announced in 2018-19
17 © 2019 Arm Limited HPC Software Ecosystem Arm HPC Ecosystem – Overview
Job schedulers HPC Applications: and Resource Open-source, Owned, and Commercial ISV codes User-space Management: utilities, scripting, SLURM, IBM LSF, App/ISA specific optimizations, optimized libs and intrinsics: container, and Altair PBS Pro, Arm PL, BLAS, FFTW, etc. other packages: etc. Singularity, CMU, Bright, HPE Parallelism HPC Programming Debug and Filesystems: Openstack, Cluster Management Tools: standards: Languages: performance BeeGFS, OpenHPC, OpenMP Fortran, C, C++ analysis tools: LUSTRE, ZFS, Python, NumPy, (omp / gomp), via Arm Forge, SciPy, etc. MPI, SHMEM GNU, LLVM, Arm Rogue Wave, HDFS, GPFS (see below) & OEMs TAU, etc. xCat , Communication Stacks and run-times: Warewulf Silicon Suppliers: Mellanox IB/OFED/HPC-X, OpenMPI, MPICH, MVAPICH2, OpenSHMEM, OpenUCX, HPE MPI Marvell, Fujitsu, Mellanox Linux OS Distro of choice: RHEL, SUSE, CENTOS,… OEM/ODM’s: Cray, HPE, ATOS-Bull, Fujitsu, Gigabyte, Inventec, Foxconn Arm Server Ready Platform: Standard OS compatible FW and RAS features
19 © 2019 Arm Limited Common HPC applications now available
GROMACS LAMMPS CESM2 MrBayes Bowtie
NAMD AMBER Paraview SIESTA UM
Quantum WRF VASP MILC GEANT4 ESPRESSO
OpenFOAM GAMESS VisIT DL-Poly NEMO
BLAST NWCHEM Abinit BWA QMCPACK
Build recipes online at https://gitlab.com/arm-hpc/packages/wikis/home
20 © 2019 ArmChem Limited /Phys Weather CFD Visualization Genomics ISVs codes on Arm
Porting underway Available
21 © 2019 Arm Limited : Typical HPC packages available for Arm
Functional Areas Components include OpenHPC is a community effort to provide a common, Base OS CentOS 7.5, SLES 12 SP3 verified set of open source packages for HPC Administrative Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh- Tools mod-slurm, prun, EasyBuild, ClusterShell, mrsh, deployments Genders, Shine, test-suite
Provisioning Warewulf
Arm and partners actively involved: Resource Mgmt. SLURM, Munge • Arm is a silver member of OpenHPC I/O Services Lustre client (community version) • Numerical/Scientific Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, Linaro is on Technical Steering Committee Libraries SuperLU, SuperLU_Dist,Mumps, OpenBLAS, • Arm-based machines in the OpenHPC build Scalapack, SLEPc, PLASMA, ptScotch I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran infrastructure interfaces), Adios
Compiler Families GNU (gcc, g++, gfortran), LLVM Status: 1.3.6 release out now MPI Families OpenMPI, MPICH Development Tools Autotools (autoconf, automake, libtool), Cmake, • Packages built on Armv8-A for CentOS and SUSE Valgrind,R, SciPy/NumPy, hwloc
Performance Tools PAPI, IMB, pdtoolkit, TAU, Scalasca, Score-P, SIONLib
22 © 2019 Arm Limited Arm HPC Ecosystem website: www.arm.com/hpc Starting point for developers and end-users of Arm for HPC
Latest events, news, blogs, and collateral including whitepapers, webinars, and presentations
Links to HPC open-source & commercial SW packages Guides for porting HPC applications
Quick-start guides to Arm tools
Links to community collaboration sites
Curated and moderated by Arm
23 © 2019 Arm Limited Arm HPC Community: community.arm.com/tools/hpc/ HPC Community-driven Content
Blogs by Arm and our HPC community
Calendar of upcoming events such as workshops and webinars
HPC Forum with questions & posts curated and moderated by Arm HPC technical specialists
Ask, answer, share progress and expertise
24 © 2019 Arm Limited Arm HPC Packages wiki www.gitlab.com/arm-hpc/packages/wikis • Dynamic list of common HPC packages • Status and porting recipes • Community driven • Anyone can join and contribute • Provides focus for porting progress • Allows developers to share and learn
25 © 2019 Arm Limited Open source libraries for helping increase performance
Arm Optimized Routines Perf-libs-tools https://github.com/ARM-software/optimized-routines https://github.com/ARM-software/perf-libs-tools These routines provide high performing Understanding an application’s needs for versions of many math.h functions BLAS, LAPACK and FFT calls • Algorithmically better performance than • Used in conjunction with Arm Performance standard library calls Libraries can generate logging info to help profile • No loss of accuracy applications for specific case breakdowns SLEEF library https://github.com/shibatch/sleef/ Vectorized math.h functions
• Provided as an option for use in Arm Compiler Example visualization: DGEMM cases called
26 © 2019 Arm Limited Arm HPC deployments Deployments
28 © 2019 Arm Limited Arm Supercomputer Makes Top500 List!
“Astra, the world’s fastest Arm-based supercomputer according to the TOP500 list, has achieved a speed of 1.529 petaflops, placing it 203rd on a ranking of top computers …”
29 © 2019 Arm Limited Vanguard Astra at Sandia MOST POWERFUL ARM SUPERCOMPUTER, IN TOP 500 (#203 in HPL and #36 in HPCG)
• 2,592 HPE Apollo 70 compute nodes • Mellanox IB EDR, ConnectX-5 • 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak) • 112 36-port edges, 3 648-port spine switches • Marvell ThunderX2 Arm SoC, 28 core, 2.0 GHz • Red Hat RHEL for Arm • Memory per node: 128 GB (16 x 8 GB DR DIMMs) • HPE Apollo 4520 All–flash Lustre storage • Aggregate capacity: 332 TB, 885 TB/s (peak) • Storage Capacity: 403 TB (usable) • Storage Bandwidth: 244 GB/s
30 © 2019 Arm Limited Isambard in Production at Bristol/GW4 Largest EU Arm HPC cluster to date • Cray XC50 system w/ 168 nodes with Marvell ThunderX2 (32C) • 10,752 total cores • High-speed ARIES interconnect • Cray HPC SW Stack including CCE, CrayPAT, Cray MPI, libs, ... • Production deployment reached @ SC18
31 © 2019 Arm Limited Deployments: Catalyst UK
Bristol: VASP, • HPE, in conjunction with Arm and CASTEP, Gromacs, SUSE, announced in April the “Catalyst CP2K, Unified Model, Hydra, UK” program: deployments to NAMD, Oasis, NEMO, OpenIFS, accelerate the growth of the Arm HPC CASINO, LAMMPS ecosystem into three universities EPCC: WRF, • Each machine will have: OpenFOAM, Rolls Royce Hydra opt, 2 • 64 HPE Apollo 70 systems, each with PhD candidates two 32-core Cavium ThunderX2 Leicester: Data- intensive apps, processors (i.e. 4096 cores per system), genomics, MOAB 128GB of memory and Mellanox Torque, DiRAC collab InfiniBand interconnects • SUSE Linux Enterprise Server for HPC 32 © 2019 Arm Limited Deployment: Mont Blanc
34 © 2019 Arm Limited Deployments: HPE’s Comanche Collaboration Early access to Cavium ThunderX2 systems that became Apollo 70
Engagements in HPE Comanche program have accelerated adoption • We have been able to assess the state of fundamental software stacks, such as MPI and NUMA capabilities • Collaborative work here especially great with all partners focusing on interoperability issues • Examples include fixing bugs with kernels, MPI drivers and OpenMP thread placement • Optimization of packages, environment and execution configurations
Over 1,000 processors delivered | LLNL TOSS stack ported and demoed | InfiniBand optimized
35 © 2019 Arm Limited Performance results on Arm 37 © 2019 Arm Limited Single node results from GENCI - France
38 © 2019 Arm Limited Isambard, UK – Single node results
39 © 2019 Arm Limited Isambard, UK – Multi-node results Gromacs (42M atoms) on Horizon (Intel Skylake, 20C) vs Isambard (Marvell ThunderX2, 32C)
40 © 2019 Arm Limited Commercial Tools for HPC by Arm Our Solution for any Architecture, at any Scale Commercial tools for AArch64, x86_64, ppc64 and accelerators
Arm Cross-Platforms Tools Debug, optimise and analyse any platform
Arm DDT Arm MAP Arm Forge Professional Arm Performance Reports
Slash your time to debug on Speed-up applications with a Arm DDT and MAP in Find the most efficient any hardware, at any scale. lightweight scalable profiler One Single Package settings for your workloads.
All-inclusive development toolkit for Arm hardware
Arm HPC Compiler Arm Performance Libraries Arm Forge Professional Arm Performance Reports
Linux user space compiler BLAS, LAPACK and FFT Multi-node interoperable Interoperable application for HPC applications profiler and debugger performance insight
42 © 2019 Arm Limited Arm Allinea Studio Built for developers to achieve best performance on Arm with minimal effort Comprehensive and integrated tool suite for Scientific computing, HPC and Enterprise developers Seamless end-to-end workflow from getting started to advanced optimization of your workloads Commercially supported by Arm engineers Frequent releases with continuous performance improvements Ready for current and future generations of server-class Arm-based platforms Available for a wide-variety of Arm-based server-class platforms
43 © 2019 Arm Limited Meets the requirements of HPC developers on Arm Arm MAP Arm Performance Libraries Cross-platform lightweight profiler BLAS, LAPLACK, FFT Arm Performance Reports Scalar and vector math functions Maximize System Efficiency Optimize Profile
Develop Debug and build Arm Linux Compiler For C, C++ and Fortran codes Arm DDT Cross-platform parallel debugger
44 © 2019 Arm Limited arm Allinea Studio A quick glance at what is in Arm Allinea Studio
C/C++ Compiler Fortran Compiler Performance Libraries Forge (DDT and MAP) Performance Reports • C++ 14 support • Fortran 2003 support • Optimized math libraries • Profile, Tune and Debug • Analyze your application • OpenMP 4.5 without • Partial Fortran 2008 • BLAS, LAPACK and FFT • Scalable debugging with • Memory, MPI, Threads, offloading support • Threaded parallelism with DDT I/O, CPU metrics • SVE ready • OpenMP 3.1 OpenMP • Parallel Profiling with MAP • SVE ready • Scalar math routines
Tuned by Arm for a wide-range of server-class Arm-based platforms
45 © 2019 Arm Limited Progress in the last year A fully integrated tools suite for deployment on Arm systems
Arm C/C++ Compiler Arm Fortran Arm Perf Libraries Forge and Perf GNU8 toolchain • Porting and tuning Compiler • BLAS, FFT and LAPACK Reports • GCC and Gfortran guides for common • New Fortran Directives Improvements • General cross-platform • 2nd toolchain in the applications • Improved Fortran 2008 • Sparse routine SPMV improvements studio • Optimizations and bug support support • Python profiling • Better suited for certain fixes • Support for vectorization • Scalar math routines • Better interop with Arm applications of loops with math calls Compiler and Libraries • Beta support for HPC users Support and tuning for Arm server-class platforms
46 © 2019 Arm Limited Commercial C/C++/Fortran compiler with best-in-class performance Tuned for Scientific Computing, HPC and Enterprise workloads • Processor-specific optimizations for various server-class Arm-based platforms • Optimal shared-memory parallelism using latest Arm-optimized OpenMP runtime Compilers tuned for Scientific Computing and HPC Linux user-space compiler with latest features • C++ 14 and Fortran 2003 language support with OpenMP 4.5 • Support for Armv8-A and SVE architecture extension Latest features and • performance optimizations Based on LLVM and Flang, leading open-source compiler projects
Commercially supported by Arm • Available for a wide range of Arm-based platforms running leading Linux Commercially supported by Arm distributions – RedHat, SUSE and Ubuntu
47 © 2019 Arm Limited Arm Compiler – Building on LLVM, Clang and Flang projects
Arm C/C++/Fortran Compiler
Clang based LLVM based LLVM based C/C++ Files C/C++ Armv8-A Armv8-A Optimizer (.c/.cpp) Frontend code-gen binary IR Optimizations LLVM IR LLVM IR Auto-vectorization Flang based LLVM based Fortran Files Fortran Enhanced optimization for SVE (.f/.f90) Armv8-A and SVE SVE Frontend code-gen binary
Language specific frontend Language agnostic optimization Architecture specific backend
48 © 2019 Arm Limited Arm Linux Compiler – What’s new in 2018/19?
Overall - Better code generation
• For Arm platforms for current generation (Marvell ThunderX2) and future (SVE based) • Base compiler technology upgrade (Clang/LLVM 7, GNU8, Latest Flang) • Vectorization of loops with math function calls
Fortran – Increase in maturity
• Enable key Fortran applications (open source, in house and commercial) • Improved auto vectorization • Fortran vectorization directives like IVDEP
49 © 2019 Arm Limited Optimized BLAS, LAPACK and FFT Commercial 64-bit Armv8-A math libraries • Commonly used low-level math routines - BLAS, LAPACK and FFT Commercially supported • Provides FFTW compatible interface for FFT routines by Arm • Batched BLAS support
Best-in-class serial and parallel performance • Generic Armv8-A optimizations by Arm Best in class performance • Tuning for specific platforms like Cavium ThunderX2 in collaboration with silicon vendors Validated and supported by Arm • Available for a wide range of server-class Arm-based platforms Validated with • Validated with NAG’s test suite, a de-facto standard NAG test suite
50 © 2019 Arm Limited Arm Performance Libraries progress Progress and additions since SC17 Key improvements in since 18.0 New features in 19.0
• Massive improvements in FFT performance • Sparse linear algebra for higher performing • All basic, advanced and guru interface FFTW SpMV calls calls now supported • FFTW MPI interface for FFT calls added • Many functions have had extra serial and • Parallelisation of many FFTW plans parallel performance improvements targeting • Parallel scaling improvements, especially for ThunderX2 ThunderX2 • Addition of libamath • Particular focus on GEMMs and POTRF, GETRF • High performing implementations of certain and GETQR key math.h functions
51 © 2019 Arm Limited Compiler and Libraries - Future roadmap Focus on current and next generation hardware
Libraries : Vector Math routines and • Application specific tuning and • SVE enabled Performance Libraries more scalar math routines optimization • Application specific tuning and Fortran Compiler : Directives & new • For Marvell ThunderX2 and other optimization in Compilers and Fortran 2008/OpenMP features server-class Arm-based platforms Libraries for SVE
All compilers : Vectorization and optimization report improvements
More features in More Getting ready for compilers and optimizations for SVE-based future libraries current hardware hardware
53 © 2019 Arm Limited Toolchain performance results Arm Compiler and Libraries – 19.1 release Progress and additions since SC18 (19.0 release) Arm C/C++/Fortran Compilers Arm Perf Libraries
• Fortran: TRAILZ intrinsic, a Fortran 2008 feature, • BLAS - Improved GEMV and GEMM (SCZ now supported variants) • Fortran: Runtime I/O performance improvement • FTW Fortran MPI interface now supported when handling formatted text data • FFT MPI parallel scaling has been improved. • Fortran: New UNROLL directive to provide • SpMV - Support for CSC and COO formats; unrolling hints to the compiler Improved single-precision performance; Fortran • Bug fixes Interface now supported. • Math routines (in libamath) – Vector routines support with optimized logf and expf; Arm Compiler uses libamath by default; A GNU compatible version provided
65 © 2019 Arm Limited BLAS improvements to many GEMM routines in 19.1 Shown below: CGEMM on Marvell ThunderX2 run using 56 threads CGEMM on 56 ThunderX2 threads 1600 1400 1200 1000 800 600 400 19.0 Performance, GFLOPs 19.1 200 0 0 1000 2000 3000 4000 5000 Matrix size, M=N=K
66 © 2019 Arm Limited BLAS improvements to GEMV routines in 19.1 All cases improved for both serial and parallel. Comparison shown on ThunderX2 for serial SGEMV and DGEMV against OpenBLAS SGEMV on ThunderX2 DGEMV on ThunderX2 16000 7000 OpenBLAS - N OpenBLAS - N 14000 OpenBLAS - T 6000 OpenBLAS - T 12000 Arm PL 19.1 - N 5000 Arm PL 19.1 - N Arm PL 19.1 - T 10000 Arm PL 19.1 - T 4000 8000 3000 6000 4000 2000 Performance, MFLOPs Performance, Performance, MFLOPs Performance, 2000 1000 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Matrix size, M=N Matrix size, M=N
67 © 2019 Arm Limited FFT MPI performance in 19.1 Scaling using FFTW MPI interface improved; now similar scaling to FFTW
FFT MPI performance on ThunderX2 FFT MPI performance on ThunderX2 3-d case: 512x512x512 3-d case: 1024x1024x1024 100 1000 ArmPL 19.1 ArmPL 19.1 FFTW 3.3.8 FFTW 3.3.8 Perfect scaling Perfect scaling 10 100
1 10 Solution(s) time Solution(s) time
0.1 1 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Number of MPI processes Number of MPI processes
68 © 2019 Arm Limited Libamath – increased performance for math.h functions ELEFUNT run on ThunderX2: cases no libamath, Arm Compiler with libamath 19.0 and 19.1
Math performance measured by ELEFUNT
350 gfortran/libm 300 libamath 19.0 250 200 libamath 19.1
percentage of 19.0 percentage 150 - 100 50 0
Performance ALOG EXP PWR SIN NINT DLOG DEXP DPWR DSIN DNINT DRECI
69 © 2019 Arm Limited Cross-Platform tools Arm Forge and Arm Performance Reports By Choosing Arm, You Choose a State-of-the-art Solution
Available on the vast majority of HPC platforms, including Interoperable AMD, IBM, Intel, Nvidia… and of course Arm!
Fast, lightweight and transparent tools that help focus on Performant the real issues that count
Packed with the best features to slash the development Comprehensive overhead spent on debugging and optimising issues
71 © 2019 Arm Limited Arm Forge Professional A cross-platform toolkit for debugging and profiling
The de-facto standard for HPC development • Available on the vast majority of the Top500 machines in the world Commercially supported • Fully supported by Arm on x86, IBM Power, Nvidia GPUs, etc. by Arm
State-of-the art debugging and profiling capabilities • Powerful and in-depth error detection mechanisms (including memory Fully Scalable debugging) • Sampling-based profiler to identify and understand bottlenecks • Available at any scale (from serial to petaflopic applications)
Easy to use by everyone Very user-friendly • Unique capabilities to simplify remote interactive sessions 72 © 2019 Arm Limited • Innovative approach to present quintessential information to users Arm Performance Reports Characterize and understand the performance of HPC application runs
Gathers a rich set of data • Analyses metrics around CPU, memory, IO, hardware counters, etc. Commercially supported • Possibility for users to add their own metrics by Arm
Build a culture of application performance & efficiency awareness • Analyses data and reports the information that matters to users Accurate and astute • Provides simple guidance to help improve workloads’ efficiency insight
Adds value to typical users’ workflows • Define application behaviour and performance expectations • Integrate outputs to various systems for validation (e.g. continuous Relevant advice to avoid pitfalls integration)
73 © 2019 Arm Limited • Can be automated completely (no user intervention) Key highlights in Forge & Performance Reports Latest 19.0 version released in Dec 2018 Forge Performance DDT MAP Reports Creation of Allinea Studio Packaging A new solution for aarch64 platforms that includes the Arm Compiler, Arm Performance Libraries, and the former Allinea tools! Full support for IBM systems Full support for IBM systems Full support for IBM systems Platforms Arm v8 support Arm v8 support Arm v8 support CUDA 9 support CUDA 9 support CUDA 9 support Usability Improvements Optimizations for many-core Optimizations for many-core Improvements Memory debugging systems systems optimizations Python profiling Combined C/C++/Fortran and Backfill Custom Metrics Python performance analysis New Features Python Debugging On-kernel GPU profiling Ability to profile selected ranks Ability to profile selected ranks
74 © 2019 Arm Limited Forge and Performance Reports – Future roadmap Why do our tools matter and what will we focus on this year? Reduce migration costs Slash down code validation Provide capabilities and increase portability costs and time on demand Finding and using the right For every run in production, Too often, users are stopped in hardware is hard, even more so codes are run 3 to 5 times to their work by licence sizes because of porting and validate they meet standards. limitations. migration costs. We will assist the community We will work on providing We will keep providing cross- reduce their testing costs by capabilities to users on demand platform tools to enable choice promoting best practices and at any time. and innovation in HPC. tightening the link between tools agile continuous delivery.
75 © 2019 Arm Limited Forge/Performance Reports Roadmap 2018-2019 Key highlights for Forge/PR 19.1 and 19.2
Continuous work 19.1 19.2/20.0
• Support for latest software • Arithmetic evaluations of • Addition of a “burst mode” in environments (MPI, CPU metrics the tools compilers, etc.) • Assembly views to Forge • Simplify the integration of • Support for popular HPC • Integration with DynamoRIO tools within scripts systems (Intel, Arm, Power, for low-level instrumentation • Add the json, xml, csv GPUs…) of operations outputs of the “offline” tools • Developing exclusive features features in collaboration with vendors (e.g. HPE, etc.)
76 © 2019 Arm Limited SVE - Introduction, tools and workflow Scalable Vector Extension (SVE) A vector extension to the ARMv8-A architecture with some major new features
Gather-load and scatter-store Loads a single register from several non-contiguous memory locations.
1 2 3 4 + 5 5 5 5 Per-lane predication pred 1 0 1 0 = 6 2 8 4 Operations work on individual lanes under control of a predicate register.
for (i = 0; i < n; ++i) INDEX i n-2 n-1 n n+1 Predicate-driven loop control and management CMPLT n 1 1 0 0 Eliminate scalar loop heads and tails by processing partial vectors.
1 2 Vector partitioning and software-managed speculation + 1 2 0 0 pred 1 1 0 0 First Faulting Load instructions allow memory accesses to cross into invalid pages.
1 + 2 + 3 + 4 = 1 + 2 3 + 4 Extended floating-point horizontal reductions = = 3 + 7 = In-order and tree-based reductions trade-off performance and repeatability.
78 © 2019 Arm Limited SVE is Arm’s next generation SIMD ISA
1 2 3 4 for (i = 0; i < n; ++i) + 5 5 5 5 pred 1 0 1 0 INDEX i n-2 n-1 n n+1 = 6 2 8 4 CMPLT n 1 1 0 0 Predicate-driven loop Gather-load Per-lane predication and scatter-store control and management
1 2 1 + 2 + 3 + 4 =
1 + 2 3 + 4 + 1 2 0 0 = = pred 1 1 0 0 3 + 7 = Vector partitioning and Extended floating-point software-managed speculation horizontal reductions
79 © 2019 Arm Limited SVE: HPGMG & Lulesh
80 © 2019 Arm Limited SVE: Optimizing Stencil
• What are the effects of Vector Length Agnosticism? • How well suited is the the ISA to express the semantics of stencil codes? Baseline: Unroll j×" Unroll j×#, i×" Vectorise on k j j j
k i k i k i
Version $%$&$'( %,+- ./0 %,+- ./0 )($*+(),' )($*+(),' $%$&$'( Baseline 1234 7(6) 7(6) Unroll j2 2×1234 12(10) 6(5) Unroll i2j3 2×3×1234 28(22) 5.6(4.6)
81 © 2019 Arm Limited Open source support
• Arm actively posting SVE open source patches upstream • Beginning with first public announcement of SVE at HotChips 2016.
• Available upstream • GNU Binutils-2.28: released Feb 2017, includes SVE assembler & disassembler. • GCC 8: Full assembly, disassembly and basic auto-vectorization • GDB 8.2 SVE support • LLVM 7: Full assembly, disassembly • Linux kernel: since Mar 2017 • QEMU 3.1: SVE support (user-space and system mode) • Under upstream review • LLVM: since Nov 2016, as presented at LLVM conference.
82 © 2019 Arm Limited Compiler support
Feature Upstream GCC Upstream LLVM Arm Compiler 6 (For Arm Linux Compiler bare metal) (for Linux user-space) SVE asm and disasm Yes Yes Yes Yes SVE code generation Yes No Yes Yes Planned for 2019-20 SVE ACLE No No Yes Yes Planned for GCC10 Planned for 2019-20 (2020) Auto-vectorization Basic None Advanced Advanced More improvements Planned for 2019-20 planned for GCC9
83 © 2019 Arm Limited Getting ready for SVE
Port to Arm Get ready for SVE Tune for SVE • Port to current Arm hardware – • Port to SVE using QEMU and/or • On real SVE hardware Single node and multi-node ArmIE on current Arm hardware • Tune it for current Arm hardware
Co-work with Arm tools and professional services team
84 © 2019 Arm Limited Arm Instruction Emulator for SVE Develop tomorrow’s software on today’s hardware • Simple “black box” tool aimed at userspace software developers • $ armclang hello.c --march=sve $ ./a.out Illegal instruction $ armie –msve-vector-bits=256 -- ./a.out Hello • Runs userspace application binaries at close to native speed • runs multithreaded applications • transparent to system calls • Intercepts and emulates use of ARM instructions newer than hardware
85 © 2019 Arm Limited Arm Instruction Emulator Develop your user-space applications for future hardware today Start porting and tuning for future architectures early • Reduce time to market, Save development and debug time with Arm Develop software for support tomorrow’s hardware today Run 64-bit user-space Linux code that uses new hardware features on current Arm hardware • SVE support available now. Runs at close to • Tested with Arm Architecture Verification Suite (AVS) native speed Near native speed with commercial support • Emulates only unsupported instructions • Maintained and supported by Arm for a wide range of Arm-based SoCs
Commercially Supported by Arm
86 © 2019 Arm Limited DynamoRIO
Dynamic Binary Instrumentation Fast code translation in userspace Originally developed in MIT Now managed by Google Used for • profiling • valgrind-like checking • architecture emulation
87 © 2019 Arm Limited Key points of contact Visit www.arm.com/hpc-tools for further information
Product team Sales team David Lecomber Rob Rick and Andrew Westergren – Sr Director, Infrastructure tools Americas Ashok Bhat Marcin Krzysztofik – EMEA, India and China Sr Product manager – Compiler and Libraries Toshinori Kujiraoka – Japan Patrick Wohlschlegel Sr Product manager – Forge and Perf Reports
91 © 2019 Arm Limited