Using Arm Scalable Vector Extension to Optimize OPEN MPI

Using Arm Scalable Vector Extension to Optimize OPEN MPI Dong Zhong1,2, Pavel Shamis4, Qinglei Cao1,2, George Bosilca1,2, Shinji Sumimoto3, Kenichi Miura3, and Jack Dongarra1,2 1Innovative Computing Laboratory, The University of Tennessee, US 2fdzhong, [email protected], fbosilca, [email protected] 3Fujitsu Ltd, fsumimoto.shinji, [email protected] 4Arm, [email protected] Abstract— As the scale of high-performance computing (HPC) with extension instruction sets. systems continues to grow, increasing levels of parallelism must SVE is a vector extension for AArch64 execution mode be implored to achieve optimal performance. Recently, the for the A64 instruction set of the Armv8 architecture [4], [5]. processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance Unlike other SIMD architectures, SVE does not define the size of target architecture. Novel processor architectures, such as of the vector registers, instead it provides a range of different the Armv8-A architecture, introduce Scalable Vector Extension values which permit vector code to adapt automatically to the (SVE) - an optional separate architectural extension with a new current vector length at runtime with the feature of Vector set of A64 instruction encodings, which enables even greater Length Agnostic (VLA) programming [6], [7]. Vector length parallelisms. In this paper, we analyze the usage and performance of the constrains in the range from a minimum of 128 bits up to a SVE instructions in Arm SVE vector Instruction Set Architec- maximum of 2048 bits in increments of 128 bits. ture (ISA); and utilize those instructions to improve the memcpy SVE not only takes advantage of using long vectors but also and various local reduction operations. Furthermore, we propose enables powerful high vectorization features that can achieve new strategies to improve the performance of MPI operations significant speedup. Those features include but not limited to: including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for 1) using rich addressing mode which enables non-linear data a single node, but also achieve a more efficient communication access that can deal with non-contiguous data; scheme of message exchanging. The resulting efforts have been 2) providing a valuable set of horizontal reduction oper- implemented in the context of OPEN MPI, providing efficient and ations which applies to more types of reducible loop scalable capabilities of SVE usage and extending the possible carried dependencies including both logical, integer and implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting floating-point reductions; software stack under different scenarios with both simulator and 3) and permitting vectorization of loops with more complex Fujitsu’s A64FX processor demonstrates that the solution is at loop carried dependencies and more complex control the same time generic and efficient. flow. Index Terms—SVE, Vector Length Agnostic, ARMIE, datatype pack and unpack, non-contiguous accesses, local reduction Those extensions largely expand the usages of Arm architecture and increase opportunities for vector processing; as I.I NTRODUCTION a result, HPC platforms and software can benefit significantly The need to satisfy the scientific computing community’s from SVE features. increasing computational demands lead to larger HPC sys- Message Passing Interface (MPI) [8] is a popular and tems with more complex architectures, which provides more efficient parallel programming model for distributed memory opportunities to enrich multiple levels of parallelism. One systems widely used in scientific applications. As many scien- of the opportunities is to exploit data-level parallelism by tific applications operate on multi-dimensional data, manipu- code vectorization [1], and this causes the HPC systems to lating these data becomes complicated because the underlying equip with vector processors. Comparing to scalar processors, memory layout is not contiguous. The MPI standard proposes vector processors support Single Instruction Multiple Data [2] a rich set of interfaces to define regular and irregular memory (SIMD) Instruction Set Architectures and operate on one- patterns, the so called Derived Datatypes (DDT). dimensional arrays (vectors) rather than single elements. There DDT provides excellent functionality and flexibility by are efforts to keep improving the vector processors by in- allowing the programmer to create arbitrary (contiguous and creasing the vector length and adding new vector instructions. non-contiguous) structures from the MPI primitive datatypes. Intel Knights Landing [3] introduced the Advanced Vector It is also useful for constructing messages that contain values Extensions instruction set, AVX-512, which provides 512-bits- with different datatypes and sending non-contiguous data (sub- wide vector instructions and more vector registers; and Arm matrix and matrix with irregular shape [9]) which eliminates announced new Armv8 architecture embraced SVE together the overhead of sending and receiving multiple small messages and improves bandwidth utilization. Multiple small messages The rest of this paper is organized as follows. Section II can be constructed into a derived datatype and sent/received presents related researches taking advantage of Arm SVE as a single large message. for specific mathematics applications, together with a survey Once constructed and committed, an MPI datatype can be about optimizations of MPI DDT pack and unpack, and used as an argument for any point-to-point, collective, I/O, mpi local reduction by novel hardware. Section III provides and one-sided functions. With DDT, MPI datatype engine details about SVE features and related tools that can simulate automatically packs and unpacks data based on the datatype; SVE instructions and evaluate the performance. Section IV which is convenient for the user since it hides the low-level describes the implementation details of our generic memory details. However, the cost of packing and unpacking in the copy and reduction methods using Arm SVE intrinsics and datatype engine is high; to reduce this cost, MPI implemen- instructions. Section V shows our optimized implementation tations need to design more powerful and efficient pack and details in the scope of OPEN MPI takes advantage of Arm unpack strategies. We contribute our efforts to investigate and SVE intrinsics and instructions. Section VI describes the per- improve the performance of datatype pack and unpack. formance difference between OPEN MPI and SVE-optimized Computation-oriented collective operations like OPEN MPI and provides a distinct insights on the how SVE MPI Reduce perform reductions on data along with the can benefit OPEN MPI. communications performed by collectives. These collectives II.R ELATED WORK typically require intensive CPU compute resources, which force the computation to become the bottleneck and limit In this section, we survey related work on techniques its performance. However, with the presence of advanced taking advantage of advanced hardware or architectures. Pet- architecture technologies introduced with wide vector rogalli [11] gives instructions on how SVE can be used extension and specialized arithmetic operations, it calls for to replace and optimize some commonly used general C MPI libraries to provide state-of-the-art design for advanced functions. In a later work [12], it explores the usage of vector extension (SVE and AVX [3], [10]) based versions. SVE vector multiple instruction to optimize matrix multi- We tackle the above challenges and provide designs and plication in machine learning such as GEMM algorithm. In implementations for reduction operations, which are most another work [13], they leverage the characteristics of SVE commonly used by computation intensive collectives - to implement and optimize stencil computations, ubiquitous MPI Reduce, MPI Reduce local, MPI ALLreduce. We in scientific computing, which shows that SVE enables easy propose extensions to multiple MPI reduction methods to deployment of optimizations like loop unrolling, loop fusion, fully take advantage of the Arm SVE capabilities such as load trading or data reuse. However, all those work focus on vector product to efficiently perform these operations. using SVE for a specific application. In our work we study This paper makes the following contributions: SVE enabled features in a more comprehensive way, and also 1) presenting detailed analysis of how Arm SVE vector provides detailed analysis about the efficiency achievements Instruction Set Architecture (ISA) can be used to optimize of using SVE instructions. We are focusing on networking memory copy for both contiguous memory regions and runtime and not generic compute bound applications. non-contiguous memory regions; There have been several efforts to use similar techniques to 2) analyzing SVE hardware arithmetic instructions to speed improve MPI communication by optimizing pack and unpack up a variety types of reduction operations; procedure. Wu et al. [14] proposed GPU datatype engine 3) exploring the usage of Arm SVE vector rich mem- which offloads the pack and unpack work to GPU in order ory access pattern to increase the performance of MPI to take advantage of GPU’s parallel capability to provide datatype pack and unpack operations. We describe how high efficiency in-GPU pack

Using Arm Scalable Vector Extension to Optimize OPEN MPI

Vectorization Optimization

Pwny Documentation Release 0.9.0

Protecting Million-User Ios Apps with Obfuscation: Motivations, Pitfalls, and Experience

Vegen: a Vectorizer Generator for SIMD and Beyond

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Anatomy of Cross-Compilation Toolchains

Introduction on Vectorization

Cross-Compiling Linux Kernels on X86 64: a Tutorial on How to Get Started

Advanced Parallel Programming II

The Arms Race to Trustzone

MMX and SSE MMX Data Types

Compiler Auto-Vectorization with Imitation Learning