Introduc on to Arm for network
stack developers Pavel Shamis/Pasha Principal Research Engineer
Mvapich User Group 2017
© 2017 Arm Limited Columbus, OH Outline
• Arm Overview • HPC So ware Stack • Por ng on Arm • Evalua on
2 © 2017 Arm Limited Arm Overview
© 2017 Arm Limited An introduc on to Arm
Arm is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion Arm cores have been shipped since 1990.
Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instruc on set architecture (ISA) such as “ARMv8-A”) or a specific implementa on, such as “Cortex-A72”. …and our IP extends beyond the CPU
Partners who license an ISA can create their own implementa on, as long as it passes the compliance tests. 4 © 2017 Arm Limited A partnership business model
A business model that shares success Business Development • Everyone in the value chain benefits Arm Licenses technology to Partner • Long term sustainability SemiCo Design once and reuse is fundamental IP Partner Licence fee • Spread the cost amongst many partners Provider • Technology reused across mul ple applica ons Partners develop • Creates market for ecosystem to target chips – Re-use is also fundamental to the ecosystem Royalty Upfront license fee OEM • Covers the development cost Customer Ongoing royal es OEM sells • Typically based on a percentage of chip price consumer products • Vested interest in success of customers Approximately 1350 licenses More than 440 poten al 14.8bn+ Arm-powered chips in Grows by ~120 every year royalty payers 2015
5 © 2017 Arm Limited Partnership
6 © 2017 Arm Limited Range of SoCs addressing infrastructure
Highly Accelerated Massively Mul core
QorIQ® Layerscape 2080A
One size does not fit all 7 © 2017 Arm Limited Serious Arm HPC deployments star ng in 2017 Two big announcements about Arm in HPC in Europe:
8 © 2017 Arm Limited Japan
slides from Fujitsu at ISC’16
9 © 2017 Arm Limited So ware Stack
© 2017 Arm Limited Parallelism to enable op mal HPC performance § OpenMP § We are adding enhancements to the LLVM OpenMP implementa on to get be er AArch64 performance § Arm is ac ve member of the OpenMP Standards Commi ee
§ Auto-vectoriza on § Arm ac vely works on vectoriza on in GCC and LLVM, and encourages work with vectoriza on support in the compiler community. § PathScale’s compiler has vectoriza on support built in
11 © 2017 Arm Limited Open source in the Arm HPC ecosystem
Over the past 12 months many more packages and applica ons have been successfully ported to Arm HPC
12 © 2017 Arm Limited
Linux / FreeBSD w/ AARCH64 support
12.04LTS & 14.04LTS ß Also 14.10 & 15.04 releases Debian 8 adds AARCH64 – April 2015 released
Fedora 22 released – May 2015 Red Hat Enterprise Linux Server for Arm CentOS Linux 7 for AArch64 Fedora 23 released – Nov 2015 7.2 BETA – Sept, 2015 GA – August 2015
OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit Arm July 2015 @ ISC
ß Engaged with FreeBSD founda on / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – Nov. 2015
13 © 2017 Arm Limited HPC filesystems
So ware Status LUSTRE Ported HDFS Ported CEPH Ported BeeGFS Ported
14 © 2017 Arm Limited Workload and cluster managers
So ware Status IBM LSF Ported HP CMU Ported SLURM Ported Adap ve Compu ng (Moab) Ported Altair PBS Works Ported OpenLava (LSF port) Ported
15 © 2017 Arm Limited Compilers
© 2017 Arm Limited Open source and commercial compilers
§ GCC § NAG § C, C++, Fortran § Fortran § OpenMP 4.0 § OpenMP 3.1
§ LLVM § C, C++, Fortran § Arm C/C++/Fortran § OpenMP 3.1, (4.0 coming soon) Compiler § Fortran coming Q1 2017 § LLVM based § Includes SVE
17 © 2017 Arm Limited Arm C/C++ Compiler Commercially supported C/C++/Fortran compiler for Linux user-space HPC applica ons LLVM-based § Arm-on-Arm compiler § For applica on development (not bare-metal/embedded)
Regularly pulls from upstream LLVM, adding: § SVE support in the assembler, disassembler, intrinsics and autovectorizer § Compiler Insights to support Arm Code Advisor
OpenMP § Uses latest open source (now Arm-op mized) LLVM OpenMP run me § Changes pushed back to the community
18 © 2017 Arm Limited Arm C/C++/Fortran Compiler Linux user-space compiler tailored for HPC on Arm • Maintained and supported by Arm for a wide range of Arm-based SoCs running leading Linux distribu ons Commercially supported by Arm • Based on LLVM, the leading compiler framework
Latest features go into the commercial releases first
• Ahead of upstream LLVM by up to an year with latest performance improvement Latest features and patches performance op miza ons • SVE support in the assembler, disassembler, intrinsics and autovectorizer
OpenMP • Uses latest open source (now Arm-op mized) LLVM OpenMP run me Op mized OpenMP • Changes pushed back to the community 19 © 2017 Arm Limited
Arm C/C++ Compiler – usage
To compile C code:
% armclang -O3 file.c –o file
To compile C++ code:
% armclang++ -O3 file.cpp –o file
20 © 2017 Arm Limited Common armclang op ons
Flag Descrip on
--help Describes the most common op ons supported by Arm C/C++ Compiler --vsn Displays version informa on and license details --version -O
21 © 2017 Arm Limited Experimental tools to support SVE With Arm HPC Compiler, Instruc on Emulator and Code Advisor
Compile Emulate Analyse
Arm HPC Compiler Instruc on Emulator Code Advisor
C/C++/Fortran Runs userspace binaries Console or web-based output shows priori zed for future Arm advice in-line with original source code. SVE via auto-vectoriza on, architectures on today’s intrinsics and assembly. systems.
Compiler Insight: Compiler Supported instruc ons places results of compile- run unmodified. me decisions and analysis in the resul ng binary. Unsupported instruc ons are trapped and emulated.
22 © 2017 Arm Limited Arm Instruc on Emulator Run SVE binaries at near na ve speed on exis ng Armv8-A hardware
Trap-and-emulate of illegal userspace instruc ons Armv8-A Binary Armv8-A SVE Binary § For applica on development (not bare-metal/embedded like Arm Fast Models) § Na vely supported instruc ons run at full speed
§ Unsupported instruc ons are faithfully emulated in so ware Arm Instruc on Emulator
Converts SVE Full integra on with Arm Code Advisor instruc ons to § na ve Armv8-A Plugin allows Arm Instruc on Emulator to provide hotspot instruc ons informa on and other metrics § Command-line integra on allows Arm Code Advisor Linux workflows to seamlessly integrate with Arm Instruc on Emulator Armv8-A
23 © 2017 Arm Limited Arm Code Advisor Combines sta c and dynamic informa on to produce ac onable insights
Performance Advice § Compiler vectoriza on hints § Compila on flags advice § Fortran subarray warnings OpenMP instrumenta on § Insights from compila on and run me § Compiler Insights are embedded into the applica on
binary by the Arm HPC Compilers § OMPT interface used to instrument OpenMP run me Extensible Architecture § Users can write plugins to add their own analysis informa on § Data accessible via command-line, web browser and REST API to support new user interfaces 24 © 2017 Arm Limited
Libraries
© 2017 Arm Limited – Now on Arm
OpenHPC is a community effort to provide a common, Func onal Areas Components include verified set of open source packages for HPC Base OS RHEL/CentOS 7.1, SLES 12 deployments Administra ve Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, Tools prun Arm’s par cipa on: Provisioning Warewulf Resource Mgmt. SLURM, Munge. Altair PBS Pro* • Silver member of OpenHPC I/O Services Lustre client (community version) • Arm is on the OpenHPC Technical Steering Commi ee Numerical/Scien fic Boost, GSL, FFTW, Me s, PETSc, Trilinos, Hypre, Libraries SuperLU, Mumps
in order to drive Arm build support I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran interfaces), Adios Status: 1.3.1 release out now Compiler Families GNU (gcc, g++, gfortran) MPI Families OpenMPI, MVAPICH2 • All packages built on Armv8-A for CentOS and SUSE Development Tools Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy • Arm-based machines are being used for building and Performance Tools PAPI, Intel IMB, mpiP, pdtoolkit TAU also in the OpenHPC build infrastructure 26 © 2017 Arm Limited Open source library AArch64 inbuilt tuning work
OpenBLAS • ARMv8 kernels included BLIS • BLIS developers have close rela onship with Arm • BLIS supports various Arm processors by default (e.g. Arm Cortex-A53, Cortex-A57 CPUs) • Also currently conduc ng Arm big.LITTLE development ATLAS • Work ongoing with Arm Research team • Cortex-A57/A53 patches went into ATLAS FFTW • Just works. NEON op ons built into v3.3.5
27 © 2017 Arm Limited Arm Performance Libraries Op mized BLAS, LAPACK and FFT
Commercial 64-bit Armv8-A math libraries
• Commonly used low-level math rou nes - BLAS, LAPACK and FFT Performance on par with best-in-class math libraries • Validated with NAG’s test suite, a de-facto standard Best-in-class performance with commercial support
• Tuned by Arm for Cortex-A72, Cortex-A57 and Cortex-A53 Commercially Supported by Arm • Maintained and supported by Arm for a wide range of Arm-based SoCs
• Including Cavium ThunderX and ThunderX2 CN99 cores Silicon partners can provide tuned micro-kernels for their SoCs • Partners can contribute directly through open source route Validated with NAG test suite • Parallel tuning within our library increases overall applica on performance 28 © 2017 Arm Limited
RDMA
© 2017 Arm Limited RDMA Support
Mellanox OFED 2.4 and above supports Arm Linux Kernel 4.5.0 and above (maybe even earlier) OFED – No support Linux Distribu on – on going process
30 © 2017 Arm Limited UCX Framework • Collabora on between industry, laboratories, and academia • Create open-source produc on grade communica on framework for HPC applica ons • Enable the highest performance through co-design of so ware-hardware interfaces • Unify industry - na onal laboratories - academia efforts
API Performance oriented Produc on quality
Exposes broad seman cs that target Op miza on for low-so ware Developed, maintained, tested, and data centric and HPC programming overheads in communica on path used by industry and researcher models and applica ons allows near na ve-level performance community
Community driven Research Cross pla orm
Collabora on between industry, The framework concepts and ideas are Support for Infiniband, Cray, various laboratories, and academia driven by research in academia, shared memory (x86-64, Power, Arm), laboratories, and industry GPUs Co-design of Exascale Network APIs
31 © 2017 Arm Limited UCX – a high-level overview
Applications
OpenSHMEM, UPC, CAF, X10, MPICH, Open-MPI, etc. Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc. Chapel, etc.
UC-P (Protocols) - High Level API Transport selection, cross-transrport multi-rail, fragmentation, operations not supported by hardware
Message Passing API Domain: PGAS API Domain: Task Based API Domain: I/O API Domain: tag matching, randevouze RMAs, Atomics Active Messages Stream
UC-T (Hardware Transports) - Low Level API UC-S RMA, Atomic, Tag-matching, Send/Recv, Active Message (Services) UCX Common utilities Transport for InfiniBand VERBs Transport for Transport for intra-node host memory communication Transport for Data driver Gemini/Aries Accelerator Memory Utilities drivers communucation stractures RC UD XRC DCT GNI SYSV POSIX KNEM CMA XPMEM GPU Memory Management
OFA Verbs Driver Cray Driver OS Kernel Cuda
Hardware
32 © 2017 Arm Limited OpenUCX v1.2
• The first official release from OpenUCX community
• h ps://github.com/openucx/ucx/releases/tag/v1.2.0 • Features
• Support for InfiniBand and RoCE • Transports RC, UD, DC
• Support for Accelerated Verbs – 40% speedup on Arm compared to vanilla Verbs • Support for Cray Aries and Gemini
• Support for Shared Memory: KNEM, CMA, XPMEM, Posix, SySV • Support for x86, ARMv8 (with NEON), Power • Efficient memory polling – 36% increase in efficiency on Arm
• UCX interface is integrated with MPICH, OpenMPI, OSHMEM, ORNL-SHMEM, etc.
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communica on Seman cs on Arm“
33 © 2017 Arm Limited Por ng Experience
© 2017 Arm Limited What is AArch64?
ARM’s 64-bit instruc on set, part of ARMv8 64-bit pointers and registers 31 general purpose registers Fixed length (32-bit) instruc ons Load/Store architecture Li le endian (big endian is an op on) Hardware floa ng point Advanced SIMD Weakly ordered memory
35 © 2017 Arm Limited My First Development Pla orm
36 © 2017 Arm Limited Autoconf for AArch64
./configure is broken for older builds of autoconf:
• “Invalid configura on `aarch64-linux': machine `aarch64' not recognized”
Running autoconf/autoreconf with Autoconf releases a er 2012/02/10 should fix ./ configure
Alterna vely you can upgrade the config.guess and config.sub files to the latest versions from:
• h p://git.savannah.gnu.org/r/config.git
37 © 2017 Arm Limited 64k Page Sizes
The linux kernel config for AArch64 supports 4k and 64k page sizes
• Most AArch64 linux distribu ons now default to 64k pages in their kernel config
Shared libraries built to align to 4k pages will not load if their ini aliza on code does not happen to align to a 64k boundary
If you are building your own compiler toolchains be aware that binu ls prior to 2.26 default to producing binaries aligned 4k pages
• The binu ls source rpm/deb packages for the distribu ons using 64k pages have the AArch64 page size patched to always use 64k
38 © 2017 Arm Limited Weakly Ordered Memory Model
Weakly ordered memory access means that changes to memory can be applied in any order as long as single-core execu on sees the data needed for program correctness Benefits:
• The processor can make many op miza ons to reduce memory access
– This has power (pushing bits is expensive) and memory bandwidth benefits • The op miza ons are transparent to single-core execu on Challenges:
• Synchroniza on of data between cores must be explicit • Popular legacy architectures (EG: x86_64, x86) provide “almost” strongly ordered memory access
– This means that exis ng mul -core codes may be dependent on strongly ordered accesses
39 © 2017 Arm Limited Relaxed memory ordering
RDMA Write Payload Busy-wait Read No fy
Write Barrier Read Barrier
RDMA Write No fy Read Payload
Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduc on to the Arm and POWER relaxed memory models." Dra available from h p://www. cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).
40 © 2017 Arm Limited Weakly Ordered Memory In Por ng Applica ons
Most parallel HPC applica ons we encountered used GCC’s libgomp (-fopenmp)
• These behaved correctly on AArch64 using GCC 5.2
Some HPC codes we came across had their own paralleliza on implementa ons
• Usually based directly on top of pthreads
• Wri en to have more control over the threads of execu on and how they synchronize • Some had no problems working with AArch64’s weakly ordered memory system
• Others exhibited issues in mul -threaded modes that were par cularly hard to diagnose without a detailed inves ga on into how the mul -threaded mode was implemented
– Problems are almost always down to a lock-free thread interac on implementa on
– The key symptom is correct opera on on a strongly ordered architecture, failure on weakly ordered
41 © 2017 Arm Limited Weakly Ordered Memory In Por ng MPI and PGAS
Typical MPI implementa on pi alls: • MPI shared memory code • Collec ve opera ons within shared memory • RDMA polling code ! (YES, not just interac on between devices)
Typical PGAS/OpenSHMEM pi alls: • All the above…and more • Memory (local and remote) synchroniza on rou nes
42 © 2017 Arm Limited Weakly Ordered Memory In Por ng Drivers
User and kernel level drivers for “OS bypass” devices • Memory barriers in doorbell flow
43 © 2017 Arm Limited Being explicit with your memory access
Ideally; use a modern synchroniza on implementa on to do it for you
• OpenMP, C++11 atomics, C mutexes, various other libraries
Otherwise: barrier assembler instruc ons allow you to be explicit in what happens to memory before execu on con nues:
Load-Acquire, Store-Release instruc ons allow for atomic opera on without the use of explicit barriers
44 © 2017 Arm Limited Memory Barriers
DSB • Comple on seman cs • Data Synchroniza on Barriers halt execu on un l:
• All explicit memory accesses before the instruc on complete
• All cache, branch predic on and TLB opera ons before the instruc on complete • Interac on with external devices (PCIe doorbells)
• Device drivers
Examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 45 © 2017 Arm Limited Memory Barriers - con nued
DMB • ISH* domain on Linux • Poll-flag, barrier, data
Examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 46 © 2017 Arm Limited Low-level mers
Low-level mers
• Typically found in benchmarks and MPI
• Code examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35
47 © 2017 Arm Limited Cache line on Arm
Not all cache-lines are 64Byte !
• Implementa on dependent
• Don’t make assump ons about 64B cache line size
h p://xeroxnostalgia.com/duplicators/xerox-9200/
48 © 2017 Arm Limited Ideas for Op miza on
© 2017 Arm Limited Network driver op miza ons
MLX5 – specially op mized transport implemented on top of ConnectX Hardware Abstrac on Layer The layer ini alize translates UCP request to InifniBand request and rings the doorbell • The code responsible for ini aliza on of the request was updated to leverage Arm vector instruc ons (NEON): h ps://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/ib/mlx5/ ib_mlx5.inl#L160
50 © 2017 Arm Limited OpenSHMEM Op miza ons
SHMEM_WAIT/SHMEM_WAIT_UNTIL block un l memory is updated by remote process
51 © 2017 Arm Limited WFE
Typically implemented as a busy-wait loop
• Arm Wait-For-Event (WFE) – provides an opportunity to pause the core un l the memory is updates (or an interrupt occurs) • It is used in linux spinlock and it is perfect fit for SHMEM
52 © 2017 Arm Limited WFE
53 © 2017 Arm Limited Preliminary Results
© 2017 Arm Limited Testbed
• 2 x So iron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz • ConnectX-4 IB/VPI EDR (PCIe gen2 x8) • Ubuntu 16.04 • MOFED 3.3-1.5.0.0 • UCX [0558b41] • XPMEM [bdfcc52] • OSHMEM/OPEN-MPI [fed4849]
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communica on Seman cs on Arm“
55 © 2017 Arm Limited Hardware So ware Stack Overview
56 © 2017 Arm Limited OpenUCX IB: MLX5 vs Verbs
��� ������� ��� ��� ������� ���� ���� ��� ���
��� ��� ��� ���� ��� ������ ����� ����� ��� 40% �� � ���� ��� ��� �� ����� ��� �� ���������� ��� �
������������ ���� ��� � ����� ��� � ��� �������� � � � �
�� �� �� ��� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���
��� ��� ��������� ��� ������ ������ ���������� ����� ���� ��� ���� ���� � �����
���� ���
���� � ���� ���
���� ���� ����� ��� ��� ������������ ���� �
��� ��� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� ��� ��� 57 © 2017 Arm Limited OpenUCX: XPMEM ��� ������� ��� ������� ���� ���� ��� ����� ��� ��� ����� �� �� ����� ����� �� ����� ��� ����� ��� ����� ��� � ������ ����� � ����� ��� � � ����� � ������������ ��� ����� �������� ���� ����� ����� ������ ����� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���
��� ��������� ��� ������ ������ ���������� ����� ����� ���
����� ���� ���� ����� ���� ����� ����� ��� ���� ����� ��� ���� ���� ���� ���� ���� ������������ ���� ���� ���� ���� ���� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� 58 © 2017 Arm Limited ��� ��� SHMEM_WAIT()
������� �� ����� ��������� ����� ��������� ����� ��������� ���� ������������ ����������� ����� ����� ����� ��� ������� ����� ������� 35% � ������� ������� 73% ��� ����� �����
����� ����� ����� �������������� �
����� ����� ����������� �����
������� ����� ��� �����
� � ��� ��������� ������� ��� ��������� ������� ��� ��������� ������� ������������ ������������ ������������ ������������ ������������ ������������ ��� ��� ���
59 © 2017 Arm Limited OpenSHMEM SSCA ���� ��������� ���
���
��� ��� ����� ��� ����
��������� ��� �����
��� ���� 7-30%
���
� � � � �� ��� 60 © 2017 Arm Limited OpenSHMEM GUPs ��������� ���� ��������� �������
������� 21%
������� ������ ��� ������� �������
������� ������� ��� ���� ������� ��� ����� ������� ��� ���� ������������� ������� ��� ����� �������������
������� � � � �� ���
61 © 2017 Arm Limited OpenSHMEM ISx ��� ��������� ��
�� ��� ����� ��� ���� ��
��
��
�� ���������
�� ����
�
�
�
� � � � �� ���
62 © 2017 Arm Limited Summary
© 2017 Arm Limited Resources for Por ng to AArch64
ARMv8 instruc on set overview: h p://infocenter.arm.com/help/topic/com.arm.doc.den0024a/index.html Arm C Language Extensions h p://infocenter.arm.com/help/topic/com.arm.doc.ihi0053c/IHI0053C_acle_2_0.pdf Arm ABI: h p://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html Cortex-A57 So ware Op miza on Guide h p://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/index.html Introduc on to Arm memory access ordering: h ps://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introduc on 64 © 2017 Arm Limited Developer website : www.arm.com/hpc
A HPC-specific microsite This is home to our HPC ecosystem offering: • technical reference material • how-to guides • latest news and updates from partners • downloads of HPC libraries • third-party so ware recommenda ons • web forum for community discussion and help
Par cipate and help drive the community
65 © 2017 Arm Limited The Arm trademarks featured in this presenta on are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respec ve owners.
www.arm.com/company/policies/trademarks
66 © 2017 Arm Limited