OpenSHMEM on Arm Pavel Shamis/Pasha Principal Research Engineer
OpenSHMEM Workshop 2017
© 2017 Arm Limited Annapolis, MD OpenSHMEM on Arm ?
2 © 2017 Arm Limited Arm servers from mul ple manufacturers
3 © 2017 Arm Limited Arm Overview
© 2017 Arm Limited An introduc on to Arm
ARM is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion ARM cores have been shipped since 1990.
Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instruc on set architecture (ISA) such as “ARMv8-A”) or a specific implementa on, such as “Cortex-A72”. …and our IP extends beyond the CPU
Partners who license an ISA can create their own implementa on, as long as it passes the compliance tests. 5 © 2017 Arm Limited A partnership business model
A business model that shares success Business Development • Everyone in the value chain benefits ARM Licenses technology to Partner • Long term sustainability SemiCo Design once and reuse is fundamental IP Partner Licence fee • Spread the cost amongst many partners Provider • Technology reused across mul ple applica ons Partners develop • Creates market for ecosystem to target chips – Re-use is also fundamental to the ecosystem Royalty Upfront license fee OEM • Covers the development cost Customer Ongoing royal es OEM sells • Typically based on a percentage of chip price consumer products • Vested interest in success of customers Approximately 1350 licenses More than 440 poten al 14.8bn+ ARM-powered chips in Grows by ~120 every year royalty payers 2015
6 © 2017 Arm Limited Partnership
7 © 2017 Arm Limited Range of SoCs addressing infrastructure
Highly Accelerated Massively Mul core
QorIQ® Layerscape 2080A
One size does not fit all 8 © 2017 Arm Limited Serious Arm HPC deployments star ng in 2017 Two big announcements about ARM in HPC in Europe:
9 © 2017 Arm Limited Japan
slides from Fujitsu at ISC’16
10 © 2017 Arm Limited Integra on with Network Interconnects
© 2017 Arm Limited CCIX
Accelerators and Network (NIC/HCA/etc.) as a first class “ci zen” in the system Seamless process and accelerator hardware cache coherence support Low-latency and high-bandwidth Allow in-line accelera on
• Bump in the wire processing (network packet processing, storage accelera on, etc.) Allows “off-line” accelera on (co-processor model) Driver-less / interrupt-less usage model h p://www.ccixconsor um.com 12 © 2017 Arm Limited
Scale-up server node
DMC-620 DMC-620 DMC-620 DMC-620
Accelerator
Smart CoreLink CMN-600 CoreLink CMN-600
Network
Persistent Memory DMC-620 DMC-620 DMC-620 DMC-620
Shared virtual memory system
13 © 2017 Arm Limited CCIX mul chip connec vity and topologies
New class of interconnect providing high performance, low Accelerator latency for new accelerators use cases
Smart • CCIX defines 25GT/s (3x performance*) Switch Network • Examining 56GT/s (7x performance*) and beyond Compute Node Persistent • Enabling low latency via light transac on layer Memory Flexible, scalable interconnect topologies
• Flexible point-to-point, daisy chained and switched topologies Simplified deployment by leveraging exis ng PCIe hardware and so ware infrastructure
• Runs on exis ng PCIe transport layer and management stack
• Coexist with legacy PCIe designs
* Note: Based on PCIe Gen3 Performance
14 © 2017 Arm Limited Building CCIX devices
Cadence® IP for CCIX Cadence Controller & PHY IP • Built upon silicon proven PCIe solu ons
IP products:
• Controller IP – Provides the CCIX transac on and data link layers. • PHY IP – Provides the high performance SERDES physical layer suppor ng speeds up to 25Gpbs. • Verifica on IP – Provides the necessary test infrastructure to verify CCIX designs.
15 © 2017 Arm Limited Cadence CCIX integra on
CML converts CHI to Low latency CCIX Support up to 25Gbps vs Example CMN-600 mesh design CCIX messages transaction layer 16Gbps PCIe Gen4
DMC-620 DMC-620
Cadence IP
CXS
) CML CCIX Layer
Transac on 25Gpbs CoreLink CMN-600 XP 16 Lanes
Data Link Layer AXI PHY (up to
RNI PCIe Layer
Transac on
DMC-620 DMC-620
PCIe IP connects to a CMN IO interface via AXI
16 © 2017 Arm Limited Gen-Z
All data is accessed by some form of a Read or a Write Example of reads: DDR Row + Column Read, PCI DMA Read, SCSI Write, Socket Read, File Read, RDMA Read Example of writes: DDR Row + Column Write, PCI DMA Write, SCSI Read, Socket Write, File Write, RDMA Write The Goal: Simplify world to memory seman c Reads & Writes
17 © 2017 Arm Limited Gen-Z Overview
An open, standards-based, scalable, system interconnect and protocol. Op mized to support memory seman c communica ons Breaks Processor-Memory Interlock Split controller model • Memory controller – Ini ates high-level requests—Read, Write, Atomic, Put / Get, etc. – Enforces ordering, reliability, path selec on, etc. • Media controller – Abstracts memory media – Supports vola le / non-vola le / mixed-media – Performs media-specific opera ons – Executes requests and returns responses – Enables data-centric compu ng (accelerator, compute, etc.)
18 © 2017 Arm Limited So ware Eco-system
© 2017 Arm Limited Released Planned Arm HPC ecosystem roadmap Concept
AppliedMicro X-Gene 1 & 2 Qualcomm Centriq AppliedMicro X-Gene 3 Hardware AMD Sea le Phy um Mars Fujitsu – Post K (SVE) Cavium ThunderX Cavium ThunderX2
OpenHPC 1.2 ARM Op mized Rou nes Open-Source ARM Op mized Rou nes – vector versions so ware Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM – Flang
ARM C/C++ Compiler – ahead of LLVM trunk ARM Fortran Compiler
ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruc on Emulator
Allinea DDT and MAP Rogue Wave TotalView ISV so ware NAG Library & Compiler ISV so ware PathScale ENZO 2016 2017 Future 20 © 2017 Arm Limited Linux / FreeBSD w/ AARCH64 support
12.04LTS & 14.04LTS ß Also 14.10 & 15.04 releases Debian 8 adds AARCH64 – April 2015 released
Fedora 22 released – May 2015 Red Hat Enterprise Linux Server for ARM CentOS Linux 7 for AArch64 Fedora 23 released – Nov 2015 7.2 BETA – Sept, 2015 GA – August 2015
OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit ARM July 2015 @ ISC
ß Engaged with FreeBSD founda on / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – Nov. 2015
21 © 2017 Arm Limited Open source and commercial compilers
§ GCC § PathScale § C, C++, Fortran § C, C++ Fortran § OpenMP 4.0 § OpenACC § OpenMP 4.0
§ LLVM § NAG § C, C++ § Fortran § OpenMP 3.1, (4.0 coming soon) § OpenMP 3.1 § Fortran coming Q1 2017
§ ARM C/C++ Compiler § LLVM based § Includes SVE
22 © 2017 Arm Limited 23 © 2017 Arm Limited OpenSHMEM
© 2017 Arm Limited Hardware So ware Stack Overview
25 © 2017 Arm Limited XPMEM
26 © 2017 Arm Limited XPMEM
XPMEM KNEM CMA MMAP System Call No Yes Yes Yes Low Latency V X X V
High Bandwidth V V V V
Any Virtual Address V V V X
Open Source V V V V Linux default X X V V support
XPMEM - Cross Memory A ach
• The code was updated to compile and run on ARMv8 (h ps://github.com/hjelmn/xpmem)
• 3rd party kernel module based on SGI/Cray XPMEM and maintained by LANL
27 © 2017 Arm Limited XPMEM
28 © 2017 Arm Limited VERBs API support on Arm
• Besides bug fixes not much work was required • Mellanox OFED 2.4 and above supports Arm • Linux Kernel 4.5.0 and above (maybe even earlier) • Linux Distribu on Support – on going process • OFED – no ARMv8 support
29 © 2017 Arm Limited UCX
30 © 2017 Arm Limited OpenUCX v1.2
• The first official release from OpenUCX community
• h ps://github.com/openucx/ucx/releases/tag/v1.2.0 • Features
• Support for InfiniBand and RoCE • Transports RC, UD, DC
• Support for Accelerated Verbs – 40% speedup on ARM compared to vanilla Verbs • Support for Cray Aries and Gemini
• Support for Shared Memory: KNEM, CMA, XPMEM, Posix, SySV • Support for x86 (with AVX), ARMv8 (with NEON), Power • Efficient memory polling – 36% increase in efficiency on ARM
• UCX interface is integrated with MPICH, OpenMPI, OSHMEM, ORNL-SHMEM, etc.
31 © 2017 Arm Limited Relaxed memory ordering
Memory Barriers
• Mul thread environment
• So ware-hardware interac on • You can “fish” for these bugs in MPI implementa ons around Eager-RDMA and shared memory protocols
RDMA Write Payload Busy-wait Read No fy
Write Barrier Read Barrier
RDMA Write No fy Read Payload
Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduc on to the ARM and POWER relaxed memory models." Dra available from h p://www. cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).
32 © 2017 Arm Limited Memory Barriers
DSB • Comple on seman cs • Data Synchroniza on Barriers halt execu on un l:
• All explicit memory accesses before the instruc on complete
• All cache, branch predic on and TLB opera ons before the instruc on complete • Interac on with external devices (PCIe doorbells)
• Device drivers
Examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 33 © 2017 Arm Limited Memory Barriers - con nued
DMB • ISH* domain on Linux • Poll-flag, barrier, data
Examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 34 © 2017 Arm Limited Low-level mers
Low-level mers
• Typically found in benchmarks and MPI
• Code examples h ps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35
35 © 2017 Arm Limited Cache lines on ARM
Not all cache-lines are 64Byte !
• Implementa on dependent
• Don’t make assump ons about 64B cache line size
h p://xeroxnostalgia.com/duplicators/xerox-9200/
36 © 2017 Arm Limited MLX5 UCX Transport
MLX5 – specially op mized transport implemented on top of ConnectX Hardware Abstrac on Layer The layer ini alize translates UCP request to InifniBand request and rings the doorbell • The code responsible for ini aliza on of the request was updated to leverage ARM vector instruc ons (NEON): h ps://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/ib/mlx5/ ib_mlx5.inl#L160
37 © 2017 Arm Limited OSHMEM
38 © 2017 Arm Limited OSHMEM
OSHMEM – a very thin layer on top of UCX UCP API Worked out of the box, besides minor bugs in the integra on layer
39 © 2017 Arm Limited OSHMEM
SHMEM_WAIT/SHMEM_WAIT_UNTIL block un l memory is updated by remote process
Typically implemented as a busy-wait loop
• ARM Wait-For-Event (WFE) – provides an opportunity to pause the core un l the memory is updates (or an interrupt occurs)
• It is used in linux spinlock and it is perfect fit for SHMEM
40 © 2017 Arm Limited Preliminary Results
© 2017 Arm Limited Testbed
• 2 x So iron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz • ConnectX-4 IB/VPI EDR (PCIe gen2 x8) • Ubuntu 16.04 • MOFED 3.3-1.5.0.0 • UCX [0558b41] • XPMEM [bdfcc52] • OSHMEM/OPEN-MPI [fed4849]
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communica on Seman cs on ARM“
42 © 2017 Arm Limited OpenUCX IB: MLX5 vs Verbs
��� ������� ��� ��� ������� ���� ���� ��� ���
��� ��� ��� ���� ��� ������ ����� ����� ��� 40% �� � ���� ��� ��� �� ����� ��� �� ���������� ��� �
������������ ���� ��� � ����� ��� � ��� �������� � � � �
�� �� �� ��� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���
��� ��� ��������� ��� ������ ������ ���������� ����� ���� ��� ���� ���� � �����
���� ���
���� � ���� ���
���� ���� ����� ��� ��� ������������ ���� �
��� ��� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� ��� ��� 43 © 2017 Arm Limited OpenUCX: XPMEM ��� ������� ��� ������� ���� ���� ��� ����� ��� ��� ����� �� �� ����� ����� �� ����� ��� ����� ��� ����� ��� � ������ ����� � ����� ��� � � ����� � ������������ ��� ����� �������� ���� ����� ����� ������ ����� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� � � � � ����� ����� ����� �� �� �� ������ ������ ������ ��� ��� ��� ������� ������� ���� ����� ����� ��� ���
��� ��������� ��� ������ ������ ���������� ����� ����� ���
����� ���� ���� ����� ���� ����� ����� ��� ���� ����� ��� ���� ���� ���� ���� ���� ������������ ���� ���� ���� ���� ���� � � � �� �� �� ��� ��� ��� ���� ���� ���� ���� ����� ����� ����� ����� ����� ������ ������ ������ ������ ������ ������ ������ ������� ������� ������� ������� ����� ��������� 44 © 2017 Arm Limited ��� ��� SHMEM_WAIT()
������� �� ����� ��������� ����� ��������� ����� ��������� ���� ������������ ����������� ����� ����� ����� ��� ������� ����� ������� 35% � ������� ������� 73% ��� ����� �����
����� ����� ����� �������������� �
����� ����� ����������� �����
������� ����� ��� �����
� � ��� ��������� ������� ��� ��������� ������� ��� ��������� ������� ������������ ������������ ������������ ������������ ������������ ������������ ��� ��� ���
45 © 2017 Arm Limited OpenSHMEM SSCA ���� ��������� ���
���
��� ��� ����� ��� ����
��������� ��� �����
��� ���� 7-30%
���
� � � � �� ��� 46 © 2017 Arm Limited OpenSHMEM GUPs ��������� ���� ��������� �������
������� 21%
������� ������ ��� ������� �������
������� ������� ��� ���� ������� ��� ����� ������� ��� ���� ������������� ������� ��� ����� �������������
������� � � � �� ���
47 © 2017 Arm Limited OpenSHMEM ISx ��� ��������� ��
�� ��� ����� ��� ���� ��
��
��
�� ���������
�� ����
�
�
�
� � � � �� ���
48 © 2017 Arm Limited Summary
Open UCX over Mellanox InfiniBand reaches 90% of prac cal bandwidth with a single 2Ghz ARM core Open UCX with MLX5 transport on Arm demonstrate 40% increase in message rate when compared to Verbs transport OpenSHMEM shmem_wait demonstrate 35% decrease in number of cycles using ARM WFE instruc on GUPs and SSCA OpenSHMEM codes demonstrate between 7-35% accelera on in execu on me using Open UCX MLX5 transport
49 © 2017 Arm Limited + =
50 © 2017 Arm Limited The Arm trademarks featured in this presenta on are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respec ve owners.
www.arm.com/company/policies/trademarks
51 © 2017 Arm Limited