Introducon to Arm for network

stack developers Pavel Shamis/Pasha Principal Research Engineer

Mvapich User Group 2017

© 2017 Arm Limited Columbus, OH Outline

• Arm Overview • HPC Soware Stack • Porng on Arm • Evaluaon

2 © 2017 Arm Limited Arm Overview

© 2017 Arm Limited An introducon to Arm

Arm is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion Arm cores have been shipped since 1990.

Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instrucon set architecture (ISA) such as “ARMv8-A”) or a specific implementaon, such as “Cortex-A72”. …and our IP extends beyond the CPU

Partners who license an ISA can create their own implementaon, as long as it passes the compliance tests. 4 © 2017 Arm Limited A partnership business model

A business model that shares success Business Development • Everyone in the value chain benefits Arm Licenses technology to Partner • Long term sustainability SemiCo Design once and reuse is fundamental IP Partner Licence fee • Spread the cost amongst many partners Provider • Technology reused across mulple applicaons Partners develop • Creates market for ecosystem to target chips – Re-use is also fundamental to the ecosystem Royalty Upfront license fee OEM • Covers the development cost Customer Ongoing royales OEM sells • Typically based on a percentage of chip price consumer products • Vested interest in success of customers Approximately 1350 licenses More than 440 potenal 14.8bn+ Arm-powered chips in Grows by ~120 every year royalty payers 2015

5 © 2017 Arm Limited Partnership

6 © 2017 Arm Limited Range of SoCs addressing infrastructure

Highly Accelerated Massively Mulcore

QorIQ® Layerscape 2080A

One size does not fit all 7 © 2017 Arm Limited Serious Arm HPC deployments starng in 2017 Two big announcements about Arm in HPC in Europe:

8 © 2017 Arm Limited Japan

slides from Fujitsu at ISC’16

9 © 2017 Arm Limited Soware Stack

© 2017 Arm Limited Parallelism to enable opmal HPC performance § OpenMP § We are adding enhancements to the LLVM OpenMP implementaon to get beer AArch64 performance § Arm is acve member of the OpenMP Standards Commiee

§ Auto-vectorizaon § Arm acvely works on vectorizaon in GCC and LLVM, and encourages work with vectorizaon support in the community. § PathScale’s compiler has vectorizaon support built in

11 © 2017 Arm Limited Open source in the Arm HPC ecosystem

Over the past 12 months many more packages and applicaons have been successfully ported to Arm HPC

12 © 2017 Arm Limited

Linux / FreeBSD w/ AARCH64 support

12.04LTS & 14.04LTS ß Also 14.10 & 15.04 releases Debian 8 adds AARCH64 – April 2015 released

Fedora 22 released – May 2015 Enterprise Server for Arm CentOS Linux 7 for AArch64 Fedora 23 released – Nov 2015 7.2 BETA – Sept, 2015 GA – August 2015

OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit Arm July 2015 @ ISC

ß Engaged with FreeBSD foundaon / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – Nov. 2015

13 © 2017 Arm Limited HPC filesystems

Soware Status LUSTRE Ported HDFS Ported CEPH Ported BeeGFS Ported

14 © 2017 Arm Limited Workload and cluster managers

Soware Status IBM LSF Ported HP CMU Ported SLURM Ported Adapve Compung (Moab) Ported Altair PBS Works Ported OpenLava (LSF port) Ported

15 © 2017 Arm Limited

© 2017 Arm Limited Open source and commercial compilers

§ GCC § NAG § , C++, § Fortran § OpenMP 4.0 § OpenMP 3.1

§ LLVM § C, C++, Fortran § Arm C/C++/Fortran § OpenMP 3.1, (4.0 coming soon) Compiler § Fortran coming Q1 2017 § LLVM based § Includes SVE

17 © 2017 Arm Limited Arm C/C++ Compiler Commercially supported C/C++/Fortran compiler for Linux user-space HPC applicaons LLVM-based § Arm-on-Arm compiler § For applicaon development (not bare-metal/embedded)

Regularly pulls from upstream LLVM, adding: § SVE support in the assembler, disassembler, intrinsics and autovectorizer § Compiler Insights to support Arm Code Advisor

OpenMP § Uses latest open source (now Arm-opmized) LLVM OpenMP runme § Changes pushed back to the community

18 © 2017 Arm Limited Arm C/C++/Fortran Compiler Linux user-space compiler tailored for HPC on Arm • Maintained and supported by Arm for a wide range of Arm-based SoCs running leading Linux distribuons Commercially supported by Arm • Based on LLVM, the leading compiler framework

Latest features go into the commercial releases first

• Ahead of upstream LLVM by up to an year with latest performance improvement Latest features and patches performance opmizaons • SVE support in the assembler, disassembler, intrinsics and autovectorizer

OpenMP • Uses latest open source (now Arm-opmized) LLVM OpenMP runme Opmized OpenMP • Changes pushed back to the community 19 © 2017 Arm Limited

Arm C/C++ Compiler – usage

To compile C code:

% armclang -O3 file.c –o file

To compile C++ code:

% armclang++ -O3 file.cpp –o file

20 © 2017 Arm Limited Common armclang opons

Flag Descripon

--help Describes the most common opons supported by Arm C/C++ Compiler --vsn Displays version informaon and license details --version -O Specifies the level of opmizaon to use when compiling source files. The default is -O0 -c Performs the compilaon step, but does not perform the link step. Produces an ELF object .o file. Run armclang again, passing in the object files to link -o Specifies the name of the output file -fopenmp Use OpenMP -S Outputs assembly code, rather than object code. Produces a text .s file containing annotated assembly code

21 © 2017 Arm Limited Experimental tools to support SVE With Arm HPC Compiler, Instrucon Emulator and Code Advisor

Compile Emulate Analyse

Arm HPC Compiler Instrucon Emulator Code Advisor

C/C++/Fortran Runs userspace binaries Console or web-based output shows priorized for future Arm advice in-line with original source code. SVE via auto-vectorizaon, architectures on today’s intrinsics and assembly. systems.

Compiler Insight: Compiler Supported instrucons places results of compile- run unmodified. me decisions and analysis in the resulng binary. Unsupported instrucons are trapped and emulated.

22 © 2017 Arm Limited Arm Instrucon Emulator Run SVE binaries at near nave speed on exisng Armv8-A hardware

Trap-and-emulate of illegal userspace instrucons Armv8-A Binary Armv8-A SVE Binary § For applicaon development (not bare-metal/embedded like Arm Fast Models) § Navely supported instrucons run at full speed

§ Unsupported instrucons are faithfully emulated in soware Arm Instrucon Emulator

Converts SVE Full integraon with Arm Code Advisor instrucons to § nave Armv8-A Plugin allows Arm Instrucon Emulator to provide hotspot instrucons informaon and other metrics § Command-line integraon allows Arm Code Advisor Linux workflows to seamlessly integrate with Arm Instrucon Emulator Armv8-A

23 © 2017 Arm Limited Arm Code Advisor Combines stac and dynamic informaon to produce aconable insights

Performance Advice § Compiler vectorizaon hints § Compilaon flags advice § Fortran subarray warnings OpenMP instrumentaon § Insights from compilaon and runme § Compiler Insights are embedded into the applicaon

binary by the Arm HPC Compilers § OMPT interface used to instrument OpenMP runme Extensible Architecture § Users can write plugins to add their own analysis informaon § Data accessible via command-line, web browser and REST API to support new user interfaces 24 © 2017 Arm Limited

Libraries

© 2017 Arm Limited – Now on Arm

OpenHPC is a community effort to provide a common, Funconal Areas Components include verified set of open source packages for HPC Base OS RHEL/CentOS 7.1, SLES 12 deployments Administrave Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, Tools prun Arm’s parcipaon: Provisioning Warewulf Resource Mgmt. SLURM, Munge. Altair PBS Pro* • Silver member of OpenHPC I/O Services Lustre client (community version) • Arm is on the OpenHPC Technical Steering Commiee Numerical/Scienfic Boost, GSL, FFTW, Mes, PETSc, Trilinos, Hypre, Libraries SuperLU, Mumps

in order to drive Arm build support I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran interfaces), Adios Status: 1.3.1 release out now Compiler Families GNU (gcc, g++, gfortran) MPI Families OpenMPI, MVAPICH2 • All packages built on Armv8-A for CentOS and SUSE Development Tools Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy • Arm-based machines are being used for building and Performance Tools PAPI, IMB, mpiP, pdtoolkit TAU also in the OpenHPC build infrastructure 26 © 2017 Arm Limited Open source library AArch64 inbuilt tuning work

OpenBLAS • ARMv8 kernels included BLIS • BLIS developers have close relaonship with Arm • BLIS supports various Arm processors by default (e.g. Arm Cortex-A53, Cortex-A57 CPUs) • Also currently conducng Arm big.LITTLE development ATLAS • Work ongoing with Arm Research team • Cortex-A57/A53 patches went into ATLAS FFTW • Just works. NEON opons built into v3.3.5

27 © 2017 Arm Limited Arm Performance Libraries Opmized BLAS, LAPACK and FFT

Commercial 64-bit Armv8-A math libraries

• Commonly used low-level math rounes - BLAS, LAPACK and FFT Performance on par with best-in-class math libraries • Validated with NAG’s test suite, a de-facto standard Best-in-class performance with commercial support

• Tuned by Arm for Cortex-A72, Cortex-A57 and Cortex-A53 Commercially Supported by Arm • Maintained and supported by Arm for a wide range of Arm-based SoCs

• Including Cavium ThunderX and ThunderX2 CN99 cores Silicon partners can provide tuned micro-kernels for their SoCs • Partners can contribute directly through open source route Validated with NAG test suite • Parallel tuning within our library increases overall applicaon performance 28 © 2017 Arm Limited

RDMA

© 2017 Arm Limited RDMA Support

Mellanox OFED 2.4 and above supports Arm Linux Kernel 4.5.0 and above (maybe even earlier) OFED – No support Linux Distribuon – on going process

30 © 2017 Arm Limited UCX Framework • Collaboraon between industry, laboratories, and academia • Create open-source producon grade communicaon framework for HPC applicaons • Enable the highest performance through co-design of soware-hardware interfaces • Unify industry - naonal laboratories - academia efforts

API Performance oriented Producon quality

Exposes broad semancs that target Opmizaon for low-soware Developed, maintained, tested, and data centric and HPC programming overheads in communicaon path used by industry and researcher models and applicaons allows near nave-level performance community

Community driven Research Cross plaorm

Collaboraon between industry, The framework concepts and ideas are Support for Infiniband, , various laboratories, and academia driven by research in academia, shared memory (-64, Power, Arm), laboratories, and industry GPUs Co-design of Exascale Network

31 © 2017 Arm Limited UCX – a high-level overview

Applications

OpenSHMEM, UPC, CAF, X10, MPICH, Open-MPI, etc. Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc. Chapel, etc.

UC-P (Protocols) - High Level API Transport selection, cross-transrport multi-rail, fragmentation, operations not supported by hardware

Message Passing API Domain: PGAS API Domain: Task Based API Domain: I/O API Domain: tag matching, randevouze RMAs, Atomics Active Messages Stream

UC-T (Hardware Transports) - Low Level API UC-S RMA, Atomic, Tag-matching, Send/Recv, Active Message (Services) UCX Common utilities Transport for InfiniBand VERBs Transport for Transport for intra-node host memory communication Transport for Data driver Gemini/Aries Accelerator Memory Utilities drivers communucation stractures RC UD XRC DCT GNI SYSV POSIX KNEM CMA XPMEM GPU Memory Management

OFA Verbs Driver Cray Driver OS Kernel Cuda

Hardware

32 © 2017 Arm Limited OpenUCX v1.2

• The first official release from OpenUCX community

• hps://github.com/openucx/ucx/releases/tag/v1.2.0 • Features

• Support for InfiniBand and RoCE • Transports RC, UD, DC

• Support for Accelerated Verbs – 40% speedup on Arm compared to vanilla Verbs • Support for Cray Aries and Gemini

• Support for Shared Memory: KNEM, CMA, XPMEM, Posix, SySV • Support for x86, ARMv8 (with NEON), Power • Efficient memory polling – 36% increase in efficiency on Arm

• UCX interface is integrated with MPICH, OpenMPI, OSHMEM, ORNL-SHMEM, etc.

Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communicaon Semancs on Arm“

33 © 2017 Arm Limited Porng Experience

© 2017 Arm Limited What is AArch64?

ARM’s 64-bit instrucon set, part of ARMv8 64-bit pointers and registers 31 general purpose registers Fixed length (32-bit) instrucons Load/Store architecture Lile endian (big endian is an opon) Hardware floang point Advanced SIMD Weakly ordered memory

35 © 2017 Arm Limited My First Development Plaorm

36 © 2017 Arm Limited Autoconf for AArch64

./configure is broken for older builds of autoconf:

• “Invalid configuraon `aarch64-linux': machine `aarch64' not recognized”

Running autoconf/autoreconf with Autoconf releases aer 2012/02/10 should fix ./ configure

Alternavely you can upgrade the config.guess and config.sub files to the latest versions from:

• hp://git.savannah..org/r/config.git

37 © 2017 Arm Limited 64k Page Sizes

The linux kernel config for AArch64 supports 4k and 64k page sizes

• Most AArch64 linux distribuons now default to 64k pages in their kernel config

Shared libraries built to align to 4k pages will not load if their inializaon code does not happen to align to a 64k boundary

If you are building your own compiler toolchains be aware that binuls prior to 2.26 default to producing binaries aligned 4k pages

• The binuls source rpm/deb packages for the distribuons using 64k pages have the AArch64 page size patched to always use 64k

38 © 2017 Arm Limited Weakly Ordered Memory Model

Weakly ordered memory access means that changes to memory can be applied in any order as long as single-core execuon sees the data needed for program correctness Benefits:

• The processor can make many opmizaons to reduce memory access

– This has power (pushing bits is expensive) and memory bandwidth benefits • The opmizaons are transparent to single-core execuon Challenges:

• Synchronizaon of data between cores must be explicit • Popular legacy architectures (EG: x86_64, x86) provide “almost” strongly ordered memory access

– This means that exisng mul-core codes may be dependent on strongly ordered accesses

39 © 2017 Arm Limited Relaxed memory ordering

RDMA Write Payload Busy-wait Read Nofy

Write Barrier Read Barrier

RDMA Write Nofy Read Payload

Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introducon to the Arm and POWER relaxed memory models." Dra available from hp://www. cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).

40 © 2017 Arm Limited Weakly Ordered Memory In Porng Applicaons

Most parallel HPC applicaons we encountered used GCC’s libgomp (-fopenmp)

• These behaved correctly on AArch64 using GCC 5.2

Some HPC codes we came across had their own parallelizaon implementaons

• Usually based directly on top of pthreads

• Wrien to have more control over the threads of execuon and how they synchronize • Some had no problems working with AArch64’s weakly ordered memory system

• Others exhibited issues in mul-threaded modes that were parcularly hard to diagnose without a detailed invesgaon into how the mul-threaded mode was implemented

– Problems are almost always down to a lock-free interacon implementaon

– The key symptom is correct operaon on a strongly ordered architecture, failure on weakly ordered

41 © 2017 Arm Limited Weakly Ordered Memory In Porng MPI and PGAS

Typical MPI implementaon pialls: • MPI shared memory code • Collecve operaons within shared memory • RDMA polling code ! (YES, not just interacon between devices)

Typical PGAS/OpenSHMEM pialls: • All the above…and more • Memory (local and remote) synchronizaon rounes

42 © 2017 Arm Limited Weakly Ordered Memory In Porng Drivers

User and kernel level drivers for “OS bypass” devices • Memory barriers in doorbell flow

43 © 2017 Arm Limited Being explicit with your memory access

Ideally; use a modern synchronizaon implementaon to do it for you

• OpenMP, C++11 atomics, C mutexes, various other libraries

Otherwise: barrier assembler instrucons allow you to be explicit in what happens to memory before execuon connues:

Load-Acquire, Store-Release instrucons allow for atomic operaon without the use of explicit barriers

44 © 2017 Arm Limited Memory Barriers

DSB • Compleon semancs • Data Synchronizaon Barriers halt execuon unl:

• All explicit memory accesses before the instrucon complete

• All cache, branch predicon and TLB operaons before the instrucon complete • Interacon with external devices (PCIe doorbells)

• Device drivers

Examples hps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 45 © 2017 Arm Limited Memory Barriers - connued

DMB • ISH* domain on Linux • Poll-flag, barrier, data

Examples hps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 46 © 2017 Arm Limited Low-level mers

Low-level mers

• Typically found in benchmarks and MPI

• Code examples hps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35

47 © 2017 Arm Limited Cache line on Arm

Not all cache-lines are 64Byte !

• Implementaon dependent

• Don’t make assumpons about 64B cache line size

hp://xeroxnostalgia.com/duplicators/xerox-9200/

48 © 2017 Arm Limited Ideas for Opmizaon

© 2017 Arm Limited Network driver opmizaons

MLX5 – specially opmized transport implemented on top of ConnectX Hardware Abstracon Layer The layer inialize translates UCP request to InifniBand request and rings the doorbell • The code responsible for inializaon of the request was updated to leverage Arm vector instrucons (NEON): hps://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/ib/mlx5/ ib_mlx5.inl#L160

50 © 2017 Arm Limited OpenSHMEM Opmizaons

SHMEM_WAIT/SHMEM_WAIT_UNTIL block unl memory is updated by remote process

51 © 2017 Arm Limited WFE

Typically implemented as a busy-wait loop

• Arm Wait-For-Event (WFE) – provides an opportunity to pause the core unl the memory is updates (or an interrupt occurs) • It is used in linux spinlock and it is perfect fit for SHMEM

52 © 2017 Arm Limited WFE

53 © 2017 Arm Limited Preliminary Results

© 2017 Arm Limited Testbed

• 2 x Soiron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz • ConnectX-4 IB/VPI EDR (PCIe gen2 x8) • 16.04 • MOFED 3.3-1.5.0.0 • UCX [0558b41] • XPMEM [bdfcc52] • OSHMEM/OPEN-MPI [fed4849]

Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communicaon Semancs on Arm“

55 © 2017 Arm Limited Hardware Soware Stack Overview

56 © 2017 Arm Limited OpenUCX IB: MLX5 vs Verbs

��� ������� ��� ��� ������� ���� ���� ��� ���

��� ��� ��� ���� ��� ������ ����� ����� ��� 40% �� � ���� ��� ��� �� ����� ��� �� ���������� ��� �

������������ ���� ��� � ����� ��� � ��� �������� � � � �

�� �� �� ��� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���

��� ��� ��������� ��� ������ ������ ���������� ����� ���� ��� ���� ���� � �����

���� ���

���� � ���� ���

���� ���� ����� ��� ��� ������������ ���� �

��� ��� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� ��� ��� 57 © 2017 Arm Limited OpenUCX: XPMEM ��� ������� ��� ������� ���� ���� ��� ����� ��� ��� ����� �� �� ����� ����� �� ����� ��� ����� ��� ����� ��� � ������ ����� � ����� ��� � � ����� � ������������ ��� ����� �������� ���� ����� ����� ������ ����� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���

��� ��������� ��� ������ ������ ���������� ����� ����� ���

����� ���� ���� ����� ���� ����� ����� ��� ���� ����� ��� ���� ���� ���� ���� ���� ������������ ���� ���� ���� ���� ���� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� 58 © 2017 Arm Limited ��� ��� SHMEM_WAIT()

������� �� ����� ��������� ����� ��������� ����� ��������� ���� ������������ ����������� ����� ����� ����� ��� ������� ����� ������� 35% � ������� ������� 73% ��� ����� �����

����� ����� ����� �������������� �

����� ����� ����������� �����

������� ����� ��� �����

� � ��� ��������� ������� ��� ��������� ������� ��� ��������� ������� ������������ ������������ ������������ ������������ ������������ ������������ ��� ��� ���

59 © 2017 Arm Limited OpenSHMEM SSCA ���� ��������� ���

���

��� ��� ����� ��� ����

��������� ��� �����

��� ���� 7-30%

���

� � � � �� ��� 60 © 2017 Arm Limited OpenSHMEM GUPs ��������� ���� ��������� �������

������� 21%

������� ������ ��� ������� �������

������� ������� ��� ���� ������� ��� ����� ������� ��� ���� ������������� ������� ��� ����� �������������

������� � � � �� ���

61 © 2017 Arm Limited OpenSHMEM ISx ��� ��������� ��

�� ��� ����� ��� ���� ��

��

��

�� ���������

�� ����

� � � � �� ���

62 © 2017 Arm Limited Summary

© 2017 Arm Limited Resources for Porng to AArch64

ARMv8 instrucon set overview: hp://infocenter.arm.com/help/topic/com.arm.doc.den0024a/index.html Arm C Language Extensions hp://infocenter.arm.com/help/topic/com.arm.doc.ihi0053c/IHI0053C_acle_2_0.pdf Arm ABI: hp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html Cortex-A57 Soware Opmizaon Guide hp://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/index.html Introducon to Arm memory access ordering: hps://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introducon 64 © 2017 Arm Limited Developer website : www.arm.com/hpc

A HPC-specific microsite This is home to our HPC ecosystem offering: • technical reference material • how-to guides • latest news and updates from partners • downloads of HPC libraries • third-party soware recommendaons • web forum for community discussion and help

Parcipate and help drive the community

65 © 2017 Arm Limited The Arm trademarks featured in this presentaon are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respecve owners.

www.arm.com/company/policies/trademarks

66 © 2017 Arm Limited