Ref. Ares(2019)558413 - 31/01/2019

HORIZON 2020 TOPIC FETHPC-02-2017 Transition to Exascale Computing

Exascale Programming Models for Heterogeneous Systems 801039

D4.1 Report on state of the art of novel compute elements and gap analysis in MPI and GASPI

WP4: Productive computing with FPGAs, GPUs and low-power

Date of preparation (latest version): [DATE] Copyright© 2018-2021 The EPiGRAM-HS Consortium

The opinions of the authors expressed in this document do not necessarily reflect the official opinion of the EPiGRAM-HS partners nor of the European Commission. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 2

DOCUMENT INFORMATION

Deliverable Number D4.1 Report on state of the art of novel compute elements Deliverable Name and gap analysis in MPI and GASPI Due Date 31/01/2019 (PM5) Deliverable Lead FhG Martin Kühn, Dominik Loroch, Carsten Lojewski, Valeria Bartsch (Fraunhofer ITWM), Gilbert Netzer (KTH), Authors Daniel Holmes (EPCC), Alan Stokes (EPCC), Oliver Brown (EPCC) Valeria Bartsch, Fraunhofer ITWM, Responsible Author [email protected] Keywords Novel compute, programming models WP/Task WP4 /Task 4.3 Nature R Dissemination Level PU Final Version Date 31/01/2019 Sergio Rivas-Gomez (KTH), Oliver Brown (EPCC), Luis Reviewed by Cebamanos (EPCC) MGT Board Approval YES

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 3

DOCUMENT HISTORY

Partner Date Comment Version First draft (Multi/Many core, GPU gap 0.1 FhG 04.12.2018 analysis, specific HW for DL) Extension to Multi/Many core for ARM 0.2 FhG, KTH 19.12.2018 and RISC5, FPGA section, Introduction added Executive Summary, Introduction into FhG 28.12.2018 0.3 GPUs added FhG 04.01.2019 Glossary added, Conclusion added 0.4 MPI gap analysis added, table on GPUs FhG 09.01.2019 added, GASPI gap analysis for FPGAs 0.5 added EPCC 09.01.2019 Added 4.1 Introduction to Beyond V-N 0.6 FhG, KTH, Implement changes to address comments by 25.01.2019 0.7 EPCC reviewers Discussion of comments on BE and FhG, KTH 28.01.2019 0.8 big.LITTLE, etc. KTH 31.01.2019 Final version 1.0

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 4

Executive Summary

The purpose of this document is to survey the state of the art of novel compute elements and do a gap analysis in MPI and GASPI. Currently a major hurdle in large supercomputers is the huge amount of energy necessary to operate these systems. The power consumption and efficiency of some novel compute devices is very attractive and therefore it is expected to see a rise in heterogeneous HPC cluster in future. For this deliverable we choose to analyse the following types of compute elements: Multi and Many Core architectures, Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs) and specialised hardware for (DL). The choice is motivated by the current market share, uptake in HPC clusters and impact that some of the novel compute elements would have on the HPC systems.

The current hardware trends are the following:

• Multi core architectures are still used in many of the supercomputers in the Top500 list. However currently the product pipeline of the main suppliers is uncertain. MPI and GASPI support for such architectures is mature. • GPUs are becoming more important in HPC and are already predominant in DL applications. Part of the success of GPUs can be explained by the rise of the CUDA programming language, which delivers easy-to-use interfaces for application developers. The integration of GPUs as accelerators in the HPC network stack can be done on various levels. Basic MPI and GASPI functionalities exist, a tighter integration is possible, but the benefit is questionable. • FPGAs are known to be difficult to program. This has been one of the biggest hurdles for the integration of FPGAs in HPC architectures. High Level Synthesis (HLS) tools start to become mature, making FPGAs easier-to-program. Currently there is no notable MPI nor GASPI support for FPGAs. However, this situation is changing. For example, for GASPI there are efforts dedicated for a better support in the scope of the EuroExa project. • Neuromorphic computing is well suited for neural networks and very energy-efficient. The first commercial neuromorphic platform could be used as accelerator for a limited number of applications. There is no MPI nor GASPI support foreseen for this type of compute element. • Specific hardware for deep learning is on the rise. There is an emergence of highly efficient hardware IP on servers/accelerators on D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 5

one hand and edge/SoC on the other hand. For HPC, accelerators are the better alternative for DL applications due to their superior bandwidth and performance. To summarize: while the MPI and GASPI support is mature for Multi and some Many Core architectures, most other accelerators (GPUs, FPGAs, etc.) are not tightly integrated in HPC workflows. In the scope of EPiGRAM-HS, we care especially about the integration into inter-node networks. The integration of novel accelerators in HPC network stacks can be done on various levels depending on how tight such an integration is intended to be. Most accelerators (such as GPUs or FPGAs) are connected via PCI-Express, though a direct integration into a -centre network, like Ethernet or InfiniBand, is possible. The steps towards an integration of accelerators in HPC workflows are described (from loose to tight):

only possible with host-CPU with a multi-level communication hierarchy at node-level. • Zero-copy data transfers possible by exposing GPU and FPGA resources as memory. • Modification of the network stack to program the NIC to directly transfer data using peer-to-peer transactions, control of communication still at host (e.g. GPUDirect-RDMA technology for GPUs). • Own identity (rank) on interconnect, allows accelerator to initiate on its own (GPUs: dCUDA technology [dCUDA]). • Creation of custom communication instances.

Currently the integration of GPUs is the most sophisticated compared to FPGAs or any other novel accelerators. For upcoming accelerators, the components allowing a tighter integration still need to be developed.

In the scope of the EPiGRAM-HS project, the gap analysis of MPI and GASPI will be used to guide our future work. Though not all identified gaps can and will be closed within the duration of the project, we will focus on the most important accelerators (GPUs and FPGAs) and the requirements of the project’s applications.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 6

Contents

1 Introduction ...... 7 1.1 Glossary ...... 9 2 Multi Core / Many Core ...... 10 2.1 Introduction to the hardware ...... 10 2.2 Instruction-Set Architecture Considerations ...... 13 2.3 Micro Architecture Implementations ...... 14 2.4 Gap analysis for MPI ...... 14 3 Graphical Units ...... 17 3.1 Introduction to the hardware ...... 17 3.2 Gap analysis ...... 18 4 Field Programmable Gate-Arrays ...... 21 4.1 Introduction to the hardware ...... 21 4.2 FPGA usage as accelerators in data-center applications ...... 22 4.3 Integration of FPGAs and HPC Network Stacks ...... 23 4.4 Gap analysis ...... 24 5 Beyond von Neumann ...... 26 5.1 Introduction to the hardware ...... 26 5.2 Gap analysis ...... 28 6 Specific hardware for Deep Learning ...... 29 6.1 Introduction to the hardware ...... 29 6.2 Chips for Edge Devices ...... 29 6.3 Chips for Accelerator Devices ...... 32 6.4 Programming Devices ...... 33 6.5 Gap analysis ...... 34 7 Conclusion and Future Work ...... 36 8 References ...... 38 8.1 References for Multi / Many Core hardware ...... 38 8.2 References for GPUs ...... 38 8.3 References for FPGAs ...... 38 8.4 References for Beyond von-Neumann ...... 39 8.5 References used for specific hardware for DL ...... 40 A. Distributed programming models ...... 44 D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 7

1 Introduction

Hardware support for exascale is widely expected to be accomplished by increased energy-friendly heterogeneity (i.e. specialized resources). Very often programming models specialized for certain hardware 1 are combined with programming models targeting internode communication. EPiGRAM-HS relies on the results concerning interoperability and composability of programming models developed in the scope of the INTERTWinE project2. In the scope of EPiGRAM-HS, we will focus our work on the inter-node programming models, which enable extreme scale computing. Most high-performance codes are based on only two distributed programming models: message passing and partitioned global address space (PGAS). In EPiGRAM-HS, these models are represented by MPI (Message Passing Interface) and GASPI (Global Address Space Programming Interface). MPI is one of the oldest and most ubiquitous distributed programming models. Traditionally in message passing the sender and the receiver are present throughout the whole communication, i.e. MPI relies on two-sided and synchronized communication. Nowadays several communication concepts are provided in MPI. The GASPI communication standard is newer and naturally supports RDMA (Remote ) capabilities of HPC hardware. It runs one-sided, asynchronous communication instructions on an address space partitioned into segments that can be either local or remote. The asynchronous approach allows an overlap of computation and communication. While there are many implementations of MPI, there is only one implementation of the GASPI standard, called GPI-2. A description of MPI and GASPI can be found in Appendix A. The deliverable will focus on the following elements:

• multi / many core (, ARM, RISC5) • GPUs • FPGAs • Neuromorphic computing • Specific hardware for Deep Learning

While Multi Core architectures are wide-spread and GPUs are nowadays part of the largest HPC systems, FPGAs, neuromorphic computing and some of the specific hardware for Deep Learning have not yet established themselves as part of HPC mainstream applications. This is due to the fact that novel hardware

1 Programming models specialized for a certain type of hardware are e.g. : CUDA for GPUs, OpenCL for FPGAs, threading for multicore machines 2 INTERTWinE project: https://www.intertwine-project.eu/ D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 8 accelerators typically lack easy programmability, an efficient integration into inter-node networks and / or support for diverse HPC applications. Nonetheless it is important to evaluate novel hardware components for their future use in HPC systems. For FPGAs there are several EC-funded HPC projects (such as ExaNoDe3 or EuroExa4 ) evaluating FPGAs for HPC workflows. This is due to the rise of high-level synthesis (HLS) programming tools which ease the programmability of FPGAs. Neuromorphic computing and some of the specific hardware for Deep Learning are more difficult to integrate into HPC workflows. Here our analysis is more speculative. Deep Learning is a new and rising field of computing which converges with HPC. The specific challenges of Deep Learning give rise to new hardware approaches. It is not yet clear which of the approaches will be successful or find a sustainable niche market. However, the potential impact of such accelerators is quite high, so it makes sense to monitor the market and build a strategy for the integration into HPC workflows. The same is true for neuromorphic computing, where we see the first large systems (such as SpiNNaker 5 ) and commercial approaches (such as TrueNorth/ SyNAPSE) appearing. In each section, we describe the current hardware elements and give an outlook on future developments as announced by vendors. The following points have guided the HW description for each of the described accelerators:

1. Architecture description 2. Performance 3. Energy Consumption 4. Network Interface to distributed systems & bandwidth 5. direct / host communication 6. Uptake in industry

At the end of each section describing a specified hardware compute element, the gaps in MPI and GASPI are analysed and described. The gap analysis has been guided by the following questions:

• Are distributed systems achievable (concerning the given latency, bandwidth, and other restrictions)? • Is a direct communication or a host communication possible? • Is the novel hardware already used with either MPI or GASPI?

3 ExaNoDe project: http://exanode.eu/ 4 EuroExa project: https://euroexa.eu/ 5 SpiNNaker project: http://apt.cs.manchester.ac.uk/projects/SpiNNaker D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 9

The gap analysis will be used in the EPiGRAM-HS project to close some of the identified gaps and to make sure that applications using MPI and GASPI can achieve extreme .

1.1 Glossary ASIC: Application-Specific (hardware) CPU: (hardware) DL: Deep Learning () DMA: Direct Memory Access FPGA: Field Programmable (hardware) HBM: High Bandwidth Memory HCAS: Host Channel Adapter GASPI: Global Address Space Programming Interface (communication standard) GPI-2: Global Address Space Programming Interface (implementation of GASPI) GPGPU: General Purpose GPU (hardware) GPU: (hardware) HDL: Hardware Description Language (tool for FPGA programming) HLS: High Level Synthesis (tool for FPGA programming) HPC: High Performance Computing IP: Intellectual Property IP: Internet Protocol ISA: Instruction Set Architecture LUT: Lookup Table MAC: Multiply-ACcumulate MPI: Message Passing Interface (communication standard) NIC: Network Interface Controller (hardware) NVDLA: Deep Learning Accelerator PCI: Peripheral Interconnect (hardware) PGAS: Partitioned Global Address Space (communication concept) RDMA: Remote Direct Memory Access RTL: Register Transfer Level (methodology of FPGA programming) SIMD: Single Instruction Multiple Data SoC: (hardware)

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 10

2 Multi Core / Many Core

2.1 Introduction to the hardware The Multi Core architecture is still the default architecture in High Performance Computing as can be easily gathered from the Top500 list (see Table 1). The term Multi refers to the fact, that each CPU has several independent compute entities often referred to as general purpose core or general purpose CPU. Multi Core CPUs are very flexible, versatile, easy to program and they deliver a decent level of performance. They are the all-rounder of HPC. The term Many Core on the other hand usually describes a hardware architecture that comprises a high number of compute entities, higher than the typical Multi Core CPUs that is. Each of these entities is typically less powerful than a common general purpose core. The power of this architecture stems from the aggregated compute power of many entities. The goal of Many Core architectures is to use resources like transistors, energy or cost in general in the most efficient way possible. The disadvantages are a higher of explicit parallelism that the programmer must handle and the necessity of an application to provide that high level of parallelism. The distinction between the Many Core and the general-purpose GPU architecture is fluent because both follow the same motivations and technical principles as explained in the following. From a technical point of view there is therefore no clear distinction between the two.

Processor Share of Systems Performance Share (Rmax) (Many Core) 0.2% 6.6% Intel Xeon Phi (Many Core) 3.6% 15.6% Intel Xeon (Multi Core) 91.6% 61.2% Other Multi Core Processors 4.6% 16.6% Table 1: Share of different multi core and many core systems in the Top500 list

The main point of Many Core versus Multi Core CPUs is efficiency. Efficiency is an important issue for supercomputers for two reasons. Firstly, the aggregated power consumption and investment costs are higher for bigger , so the operators of supercomputers are sensitive to efficiency. Secondly supercomputers are often used by few applications that use a considerable part of the machine. The development costs of these applications are constant while the benefit of efficiency rises with the amount of computational resources used. This means that the effort of sophisticated and manual optimizations that is necessary to get the efficiency from Many Core CPUs is especially worthwhile to muster on big applications that are running on supercomputers. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 11

Today’s programming languages still think in terms of threads which execute a well-defined sequence of commands. However, this sequence is only partially known at compile time and the commands have data dependencies that have to be met at runtime. The classical general purpose Multi Core CPU uses several sophisticated strategies to cope with these dependencies to speed up the single execution. Among the most prominent is a multi-level hierarchy to improve latency and bandwidth of data accesses. They are supported by hardware prefetchers that reduce the latencies of memory accesses assisted by the so-called branch prediction that predicts upcoming commands in the threads. Out of order execution and highly superscalar cores are additional means to exploit implicit parallelism within the threads to speed up their execution. All these measures have in common that they aim primarily at a reduction of latencies of a single compute thread. They make the cores an all-rounder and are especially an advantage for applications that are difficult to parallelize, think for example of an ILU preconditioner. The downside is that they are expensive in terms of transistors and energy consumption. So, the approach of a Many Core CPU is to downsize or omit these performance enhancers and use the saved budget to build even more cores. Since the Many Core does less things automatically, the programmer must step in and do more things manually. For example, using a local instead of a transparent, coherent cache, or using software branching hints and software prefetching instead of relying on the respective hardware counterpart. As explained earlier, high is cheaper to get than low latency so the Many Core architecture concentrates on the former accepting higher amounts of parallelism in return. One downside of this deal is that the application has to deliver enough parallel tasks that are data independent. What’s more, the overhead of communication and synchronization typically rises with the number of threads, so we have a trade-off here. On one hand the total compute power rises with more but simpler cores, but on the other hand the communication overhead rises too. As the communication overhead strongly depends on the specific there are applications that clearly profit from a Many Core CPU while others do not. The other downside is that more often than not the programmer has to orchestrate the data flow much more explicitly than on a general purpose CPU. This leads to higher efforts in software optimization, more hardware specific code, less readable code, more error prone code and more extensive benchmarking and debugging. More recent representatives are the Xeon Phi processors Knights Corner (KNC) and Knights Landing (KNL) from Intel. Their approach is a simpler core that is still quite easily programmable like a typical Many Core. It has smaller but nevertheless transparent caches and 16 GB high bandwidth memory D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 12 that can be programmed manually or used as a transparent L4 cache. In that sense it can be considered a Many Core “light”. However, the performance was not good enough to be a base for exascale clusters. So, like the well performing but difficult to program Cell BE the easy to program but not well performing Xeon Phi was buried. The successor Knights Hill (KNH) has been cancelled, but some KNL clusters are still in the Top500 list (see Table 1). Some components of the Xeon Phi architecture might be reused for future Intel processors but currently there are no announcements for new Intel Many Core processors. Some years ago, the Sunway SW26010 architecture was developed by the NRCPC. The CPU consists of a group of four core groups (CG) connected by a network on chip (NoC). Each CG consists of a cluster of 64 (8x8) compute processing elements (CPE) connected by a mesh network, one management processing element (MPE) and one main accessing 8GB DDR3 memory. The MPE is a more complex core and ideal for managing the communication as it supports interrupt functions and out of order execution. The CPE is a simpler core running only in user mode with no support for interrupt functions. The SIMD vector length is 256 bit of which only 128 bits are used for single precision floating point operations. There is one compute pipeline per core with FMA providing 8 FLOP per cycle single or double precision. The CPE has a 16 kB L1 instruction cache and no L1 data cache. Instead it has a 64 kB software controlled scratch pad memory (SPM). A DMA engine is used to transfer the data between the SPM and the main memory. Additionally, a register level communication (RLC) mechanism exists, that allows direct communication between the CPE that are in the same row or in the same column of the mesh network. Currently the SW26010 is used by one system on the TOP500 list. The system was ranked first place on the list in 2017 and still remains within the top 10 systems today. The availability of the processor is as unclear as its commercial success. The future product pipeline of its producer is also unclear. Currently the European Union plans to create a new Many Core Processor in the European Processor Initiative (EPI). Unfortunately, little information is publicly available on the processor’s specifications at the time of writing. According to a presentation by Jean-Marc Denis (ATOS) at the HPC User Forum in Detroit 2018, the EPI is moving from ARM to RISC-V in the next 5 - 10 years. RISC-V is not mature enough at the moment. A first generation of the European processor is supposed to be developed by the end of 2020 or beginning of 2021, which can be used in one of Europe’s first exascale machines.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 13

2.2 Instruction-Set Architecture Considerations The Instruction Set Architecture (ISA) of a processor, either a core of a general- purpose CPU or of a GPGPU, is the interface between hardware and software. Both the programmer’s or compiler’s ability to efficiently describe a complex problem in terms of elementary instructions as well as the hardware design’s ability to carry out those instructions depend on the richness of the ISA. While decoupling techniques such as splitting instructions into several micro- operations or fusing several instructions into a single operation have efficiently worked for generic, scalar instructions, the desire to increase the amount of work described by a single instruction and hence amortize the cost of fetching and decoding has required extensions beyond the reach of these techniques. As a result, extensions for both single instruction, multiple data (SIMD) vector operations and lately also for simple matrix-multiplication operations, have been added to existing ISAs. In this respect Intel has pursued an evolutionary development, adding new instructions to the ISA with new processor generations that expose the improved hardware capabilities. This has resulted in various SIMD instruction extensions for fixed length vectors from 128 to 512 bits (MMX, SSE, AVX) as well as specific capabilities like the vector neural network instructions (VNNI) announced for future Intel processors [Intel 2018a]. The burden of efficiently utilizing these extensions is placed on software, which has to discover the level of support of each extension on a given processor as well as select appropriate implementations or that make use of the discovered capabilities. Since optimal support may require tuning and maintaining different implementations for each ISA extension, many software packages bet instead on support from optimized libraries, e.g. Intel’s Math Kernel Library (MKL), or compilers or simply settle on a commonly supported subset and ignore further extensions. A different approach was taken by ARM with their scalable vector extension (SVE) instruction set that complements the existing 128-bit NEON SIMD extension [Stephens 2017a, Yoshida 2018a]. The SVE instructions do not operate on an architecturally fixed vector length, instead support is provided to formulate programs in a vector-length agnostic way as well as providing means for software to discover the currently used vector-length. The upside of this design choice is that a single executable version of a procedure may suffice on a large number of different hardware implementations. The downside is that only algorithms that can efficiently be formulated without consideration of vector-length can utilize this strategy. Other algorithms, that may include specific optimizations for a certain fixed vector length, will have to still supply a number of different implementation choices and decide at run-time upon which variants to use. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 14

2.3 Micro Architecture Implementations A wide range of different micro architectures and implementations that can understand a common ISA can co-exist, often offering widely different levels of performance. Simple in-order like Intel’s Atom processors or ARM’s Cortex-A35 may be suitable for low-power applications, while complex out-of-order multi-core processors like Intel’s Xeon SP or Cavium’s ThunderX2 processors are used in server applications. The observation that the same software can execute on both low-power and high-performance implementations of the same ISA is the basis for ARM’s big.LITTLE technology that combines a number of small, low-power cores together with larger, high-performance cores in a single multi-core SoC [ARM 2013a]. This technology is, however, only used in mobile devices as a way to boost energy-efficiency in standby operation, while server and HPC processors rely on high-performance CPUs in conjunction with external high-throughput accelerators like GPGPUs. For software both approaches require different optimizations and in the case of accelerators also different programs to fully utilize the performance on such heterogeneous systems. Additional care may also be needed in the case of big.LITTLE type of approaches to prevent a thread of execution being executed on the wrong kind of core and causing massive load-imbalance in parallel HPC applications. This could require more advanced scheduling techniques that account for the application-wide impact of local thread placement. Furthermore, underlying communications and scheduling software may have to support information gathering, e.g. a MPI or GASPI implementation may have to gather statistics on waiting times, to allow a scheduler to improve its strategy.

2.4 Gap analysis

2.4.1 MPI MPI is strongly -oriented, which leads programmers to use a “flat MPI” model for Multi Core and Many Core. MPI does also have support for multi- threading, including a fully thread-safe mode, allowing hybrid models where MPI is combined with some other programming model that handles threads within each MPI process. The choice between these approaches is still more of an art than an engineering discipline. Partly this is because benchmarking the thread-safe mode of MPI against the other modes is often done by comparing non-concurrent and concurrent micro-benchmark codes. For example, comparing a ping-pong benchmark using single-threaded MPI with the same benchmark using multi-threaded MPI. This test instructs MPI to protect itself against multiple threads accessing MPI concurrently during a benchmark that does not need this protection and compares against a benchmark that does not D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 15 contain similar protection in user code. This type of comparison tends to suggest a large overhead in the thread-safe MPI mode, but does not offset that against the equivalent thread synchronization in the user application for a concurrent code. A single-threaded application that uses the single-threaded MPI mode is highly likely to perform better than a multi-threaded application using the thread-safe MPI mode. However, a multi-threaded application code using the funnelled or serialized MPI modes is much less likely to perform better than when using the thread-safe MPI mode. Comparisons between “flat MPI” (one MPI process per core) and hybrid MPI (one thread per core, with fewer MPI processes, e.g. one per NUMA region) mostly favour the “flat MPI” approach. This is likely to be due to the implementation overheads in the MPI library and in the threading runtime. The implementations the thread-safe mode in major MPI libraries are improving of late in response to demand from users for better support of hybrid programming. For example, MPICH is currently introducing internal endpoints to increase the achievable concurrency within MPI and thereby lower synchronization overheads. There is ongoing debate regarding the best way to expose this additional concurrency to programmers. Opinions range from no change to the current MPI interface, through thread-oriented interface changes (partitioned communication, also known as Finepoints), to process-oriented interface changes (user-visible, dynamic Endpoints).

2.4.2 Gap analysis for GASPI The GASPI standard is well adapted to Multi Core and Many Core architectures because it is fully thread safe – it naturally supports hybrid SMP / distributed memory implementations. The predominant GASPI implementation GPI-2 is a quite lightweight implementation. What is needed to build GPI-2 is a full C-Compiler and/or an F90 Fortran compiler. Additionally, an interface to the underlying network is needed. GPI-2 supports two kinds of interconnects. The InfiniBand network is connected via the IBVERBS library. The standard Ethernet network is connected via the IP- stack. So, as long as a coherent memory exists and the dependencies stated above are fulfilled, GPI-2 should work more or less out of the box. This includes of course the standard Multi Core CPUs and some Many Core CPUs like e.g. the Intel Knights Landing. If these conditions stated above are not fulfilled, which might be the case for some Many Core CPUs, the situation depends strongly on the specific hardware layout. Let us assume in the following that a coherent memory exists in principle but that the compute entities are not directly connected to it. Let us further assume that the compute entities have a smaller local scratch pad memory that is connected by a network on chip or a DMA engine to the coherent global D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 16 memory. This would basically be the situation we have seen in previous Many Core architectures like the Cell BE or the Sunway SW26010 processor (see Section 2.1). Then two scenarios should be considered. In the first scenario the network communication is managed on the level of the global memory. This could, for example, mean that a general purpose management core that fulfils all the requirements explained earlier exists, and executes a service thread. A read transfer would be handled as follows. An application core would notify the management core that a specific range of memory is needed. The management core would execute the specific GASPI routines to transfer data to the global memory. After arrival the application core would transfer the data from global memory to the scratch pad memory. In this scenario GPI-2 would work out of the box. The downside is that the programmer would have to manually orchestrate not only the transfers of the “the last mile” to local scratchpad memory, but also the communication with the management core. This might be an efficient way to communicate but it is also cumbersome and error prone. In the second scenario the Many Core would directly use the GASPI routines. This is easier to handle for the programmer, but it would require a very specific and extensive adaptation of the GPI-2 library to the underlying hardware. As the communication inside the Many Core CPU is done automatically, it would be important to have a well matched infrastructure that supports efficient memory transfers and synchronization primitives. Otherwise a considerable part of the communication efficiency that GASPI can usually provide would be lost on that “last mile”. For example, one important aspect would be the implementation of GASPI notifications. Notifications are guaranteed to not arrive before the respective payload. It would be challenging to transfer the notification to the application core without having to resort to polling mechanisms. A second challenge would be to maintain the order of data transfers and payload without significant delays. If this scenario makes any sense clearly depends on hardware details of the specific Many Core architecture. In conclusion, GPI-2 and therefore GASPI is already perfectly adapted to Multi Core CPUs. In many cases, Many Core architectures should profit from GPI-2 without too many modifications. Depending on hardware specifications and specific use cases, more extensive adaptation might be necessary.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 17

3 Graphical Processor Units

3.1 Introduction to the hardware Graphics processing units (GPUs) are specialized processors designed in the early 90s to rapidly accelerate memory transactions (textures, device memory, framebuffers etc.) as well as 2D and 3D transformations. These hardware features were used in the beginning to lessen the work of the CPU and to speed up video and graphics workloads. Over the last 10 years, the hardwired GPU-Pipeline became more and more programmable, and the linear algebra units as well as the fast memory subsystem were utilized for general purpose tasks. This trend is still ongoing and general purpose computing on graphics processing units (GPGPU) is applied in many domains. Today GPUs are the de-facto standard when it comes to massive . The leading systems in the Top500 list are using GPUs to accelerate floating point operations within numerical computation kernels. These high performance cluster applications currently offer the highest raw and per dollar. Due the fact that GPUs are using a fine-grained parallel approach internally (thousands of GPU- threads), there are some drawbacks when it comes to real parallel applications with complex communication patterns, namely:

• to subdivide real-world problems into such small work units that all of the GPU cores can be held busy all the time with useful work • to keep the synchronization and communication overhead in light of applications with complex communication patterns or workloads with small load-balancing units.

GPUs generally delegate more complex control tasks to the host system, and while it is possible to directly program at least certain InfiniBand NICs from a GPU, research shows that a hybrid approach with some involvement of the host system CPUs offers the best performance in today’s systems [dCUDA]. This dependency between GPU and host system leads into a multi-level communication hierarchy in current systems at the node-level. Starting at the GPU, the communication has to pass the PCI- and system-bus as well as the network and everything again in reverse at the remote side. Nvidia has tried to overcome this limitation with their proprietary NVLINK technology, but only IBM has adopted this communication interface within their latest Power- CPUs today. Better scalability can be reached with NVLink-based systems like Nvidia’s DGX-2. Within these GPU sub-clusters with up to 16 GPUs, much more complex communication patterns are possible, but data transport between D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 18 those sub-cluster nodes suffers from the same limitations as single GPU to GPU communications over networks like InfiniBand or Ethernet. Beside the communication problems at the GPU level, the specialization of modern GPUs can be a huge burden for programmers. The latest generation of Nvidia’s GPUs has different specialized hardware units and is no longer focussed on single or double precision floating performance alone. In addition to the graphics units, there are physics engines, CUDA cores, tensor cores, and ray tracing cores available. To get the maximum performance out of these GPUs, a programmer has to use and program several units in parallel. This can be a challenging task for most of the users in a multi-level cluster-like environment.

3.2 Gap analysis

3.2.1 MPI Typically, MPI libraries support GPUs by permitting memory addresses that refer to GPU memory for the ‘’ arguments in communication API calls. This permits the programmer to specify the desired communication, e.g. from GPU memory on one node to GPU memory at another node. The implementation options available to MPI library developers include copying the message data via the host CPU or using hardware or system support for more direct routing. Technologies such as GPU-Direct and NCCL can be used when appropriate to implement data transfers and collective aggregation operations involving GPU memory via MPI communication semantics. One possible improvement to this approach is to implement critical MPI functions using native GPU code so that those functions can be called within a GPU kernel. The current assumption is that the host CPU makes all MPI function calls, even when those calls involve only GPU memory and GPU-to- network hardware routing. Avoiding any involvement of the host CPU has several potential benefits:

• Frees the CPU to do other work, such as overlapped computation, communication, or I/O operations • Removes the transfer of control (both GPU-to-CPU to call the MPI function and CPU-to-GPU to restart the GPU kernel code) from the critical • Eliminates the need to map portions of GPU memory into the virtual address space of the CPU

A further suggestion in this direction is to add new ranks so that GPUs are represented as different MPI processes, distinct from their host CPU(s). This is D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 19 akin to the current practice of representing each CPU core with a different MPI process/rank (the “flat MPI” model) or representing each NUMA region with its own MPI process/rank (one popular variant of the “hybrid MPI” model). In this way, each GPU is treated as a distinct NUMA region, albeit with more computational power and heterogeneous internal hardware and connection infrastructure to the rest of the system. Support for heterogeneous MPI processes is at a nascent stage in MPI. Most MPI libraries recognize that some process-to-process connections are different to others, e.g. by implementing a hierarchy of shared-memory intra-node transports that are different to the network-fabric inter-node transport. However, there is still a basic assumption in most MPI implementations that all MPI processes are homogeneous in all other respects. The implications of breaking this assumption are not yet fully clear.

3.2.2 GASPI Native accelerator support with GPI-2/GASPI is realized by mapping device memory into the global virtual address space of a parallel application. In the start-up phase the host will normally set up all data structures (ranks, memory regions, topology, etc.) needed to run a parallel job. As the memory for the data structures can subsist on the device itself, efficient communication patterns can be triggered directly at device level (i.e. CUDA thread triggered etc.). To enable hardware components from different vendors to interact with each other, extra and independent arbitrator kernel modules or OS-extensions must be installed. The most stable and best performing hardware combination today is GPU-Direct on InfiniBand. The following paragraph shows how this setup is integrated into GPI-2/GASPI today. GPU-Direct enables peer-to-peer communications between different devices on subsystem buses like PCI-e or NVLINK. The performance of GPU-Direct RDMA depends on a lot of different factors (i.e. PCI-e configs and settings) and is not able to get into the range of direct host-to-host RDMA via HCAS (Host Channel Adapter). Furthermore, the design and implementation of compute kernels (running on the accelerators) can influence the performance parameters significantly. The integration of peer-to-peer RDMA into GPI-2/GASPI is therefore based on (and restricted to) memory segments. Nevertheless, this clean and portable interface allows for any kind of synchronization and data transport between hosts and accelerators. If a production system is used and third party kernel modules and/or OS- Extensions cannot be installed, Nvidia’s NCCL (NVIDIA, 2018) can be an alternative, albeit one that imposes a performance penalty. With this approach, D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 20 all hardware to hardware interactions are resolved by forwarding these operations to the host. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 21

4 Field Programmable Gate-Arrays

4.1 Introduction to the hardware In contrast to the instruction based approach utilized by CPUs and GPGPUs, where a complex problem is decomposed into a number of threads of elementary instructions, Field-Programmable Gate-Arrays (FPGAs) require the decomposition of the problem into a network of combinatorial logic, and . The defining feature of this approach is that all of these logic elements work concurrently, and any necessary synchronization has to be created by the interconnection of these elements. The resulting design workflow is similar to the design of application specific integrated circuits (ASICs) and FPGAs are often used as runtime programmable substitutes for hard-wired ASICs, especially in low-volume applications or when design changes in already deployed products are anticipated. To realize combinatorial logic elements, FPGAs store the truth-tables of the corresponding Boolean functions in small static RAM (SRAM) based lookup- tables (LUTs). To connect those LUTs with each other and with registers that are used to implement sequential logic pre-defined wires are used that are stitched together with the help of programmable , or multiplexors. In this way arbitrary complex digital logic designs can be constructed as long as they can be realized with the finite resources provided in a given FPGA chip or chips. Since using LUTs and programmable multiplexors is relatively expensive in terms of chip area and power when compared to hardwired ASIC-style implementations, modern FPGAs also contain a number of hardwired composite macro blocks to realize for instance larger RAM memories, or arithmetic functions like multipliers or adders. Since the hardware design of an FPGA is fixed at manufacturing time, and important aspect of designing logic for FPGAs is to formulate the design in such a way that it can be efficiently implemented with the provided resources. For example, if a design requires a 1100-word deep buffer, but the FPGAs RAM blocks have a size of 1024 words, then 2 blocks will be needed to implement the buffer and about 46% of the possible capacity will be unused. In terms of hardware resources, large FPGAs from both Intel [Intel 2018a] and [Xilinx 2018a] offer several millions of LUTs, a comparable number of registers, thousands of hardware multiplier units and several hundred million bits of on-chip SRAM memories. Both manufacturers have also announced FPGAs with several gigabits of on-package high- bandwidth memory (HBM2). The most widely used approach to design digital logic using FPGAs is to use the register-transfer-level (RTL) methodology also popular to design ASICs. In this method the function of the design is specified at the level of clock-cycles and described in a hardware description language (HDL) like VHDL or Verilog. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 22

High-productivity approaches to RTL design like Chisel [Bachrach 2012a] attempt to automate most of the mechanical aspects of RTL design, while intellectual property (IP) blocks are used to share and reuse designs of whole sub-modules perhaps between otherwise unrelated organizations. Finally, IP- compilers allow generation of RTL descriptions for regular problems like checksum computations, or complex arithmetic functions like sine or cosine. In contrast, high-level synthesis (HLS) workflows automate the creation of RTL level design directly from timing abstract behavioural or algorithmic-level descriptions. The input to these tools, offered from both Intel, Xilinx and third parties, is often given in an imperative programming language like C, OpenCL, or Matlab. The biggest challenge in this respect is the fact that algorithms are often formulated in an inherently sequential style suitable for instruction based processing and the tools have to extensively use techniques like -unrolling to extract the required fine-grained concurrency.

4.2 FPGA usage as accelerators in data-center applications Perhaps the biggest challenge to the usage of FPGAs in data-centres is the integration with other components, like servers and network devices, usually required in a data-centre application. Here often a thin interface layer in the form of a shell design in the FPGA and a corresponding software component consisting of necessary drivers and libraries executing on the server(s) is provided either by the hardware or tool manufacturers or sometimes also by a specialized design team. Often HLS tools, especially those offered for FPGAs, have the ability to integrate with these predefined shells and can create the necessary interface code both on the software and hardware side, thus effectively hiding these intricate aspects of data-transfer. In terms of hardware-integration, the most straightforward option is to connect FPGAs directly via PCI-Express to the CPUs in a server, the same way as with GPUs. Support for this form is present in the HLS tools offered by both Intel and Xilinx, as well as by third-party vendors such as BittWare or Gidel. Direct integration into a data-centre network, like Ethernet, is also possible, and for instance fast key-value stores using this model are offered by Algologics. Amazon also offers F1 instances that use Xilinx FPGAs in this way. A more ambitious approach is used by Microsoft in their Catapult projects. In the first version [Putnam 2014a] FPGAs were connected to servers by PCI- Express, and a direct FPGA-to-FPGA network was added to be able to build a distributed accelerator to utilize the aggregate compute and memory capabilities of 48 FPGAs. The second version of the Catapult hardware [Caulfield 2016a] uses instead a hybrid approach where the FPGAs are not only connected to the server via PCIe, but additionally interposed in the Ethernet D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 23 connection from the server to the top-of-rack . This strategy extends the possible reach of inter-FPGA communication to the whole data-centre and allows the FPGA to be used as software-defined network accelerator. Finally, the Novo-G# project uses a cluster of FPGAs as an accelerator machine for molecular dynamics applications [George 2016a].

4.3 Integration of FPGAs and HPC Network Stacks The most basic level of integration is simple co-existence of the network stack and the FPGA accelerator. This is by and large possible today as it only requires that no conflicting dependencies, for instance on different versions of the same shared libraries, exist. In this model only the hosts participate in the communication. Any data that needs to be transferred between the network and the FPGA is first staged in a host buffer and copied in a second operation. A step towards tighter integration would be to implement zero-copy data- transfers by suitably exposing FPGA resources for instance as memory and modifying the network stack to program the NIC to directly transfer data using peer-to-peer transactions. This model is similar to Nvidia’s GPU-direct RDMA technology. Control of the communication is still handled by the host CPU and the FPGA cannot initiate data transfers on its own. Potential pitfalls would be the capabilities of the NIC and the rather limited capacity of the on-chip SRAM of the FPGA, which could require data-transfers to first be directed towards off-chip DRAM directly attached to the FPGA, thereby consuming valuable external memory bandwidth. A limited way to participate in flow-control between the NIC and FPGA or a model in which the FPGA actively pushes or pulls data from the NIC could help mitigate this issue. So far, no known implementations of this technology for MPI or GASPI exist. A next logical step would be to elevate the FPGA to become a first-class citizen with own identity (rank) on the interconnect network. This would allow the accelerator to initiate communications on its own and also to directly utilize the high-speed network to reach other FPGA-based accelerators. It is likely that some assistance from the host-side is necessary to implement this model. The approach is similar to dCUDA technology. Finally, semantic understanding of the HPC communications primitives by HLS tools could allow the synthesis of specialized hardware to create custom communications instances. One example would be that two co-located MPI ranks both participating in a distributed algorithm could implement communication between themselves in the form of a custom FIFO channel. In that way, it would be possible to use message-passing algorithms as basis for FPGA accelerators. In the time of this project, we consider such approaches as unlikely due to the required extension of the HLS tools. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 24

4.4 Gap analysis

4.4.1 MPI To some extent, MPI libraries already make use of FPGAs. Where the FPGA is used to offload some of the software protocol into hardware, e.g. the point- to-point matching rules, then MPI can simply delegate its implementation to that FPGA hardware offload. This approach is seen in Portals 4 and the ConnectX series of products from Mellanox. No substantial change to MPI is necessary to achieve this type of usage of FPGAs. Using FPGAs as accelerators poses the same problems to MPI as the usage of GPUs as accelerators. Specifically, MPI could be enabled to accept FPGA memory mapped into CPU and critical MPI functions could be engineered into the FPGA so that it can act natively as an MPI process/rank without the involvement of a host CPU.

4.4.2 GASPI Due to their flexibility, FPGAs can change from implementing hardware-like NIC devices to become special accelerators by replacing their configuration bitstream. Thus, they can be seen as “hardware-on-demand” for special purpose tasks. GASPI on the other hand needs a compute environment that is much too complex for a complete implementation on FPGAs. Beside high level network protocols for the communication part a lot of control and flow management functionality must be developed. By splitting the total design into a pre-manufactured APU-Part and Memory-Part together with a FPGA-Part, a powerful one-sided communication subsystem can be implemented as shown in Figure 1 and Figure 2 below.

Figure 1: Communication and subsystem APU part D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 25

Figure 2: Communication subsystem FPGA part

By using the AXI Bus as a (NOC), different IPs (commercial and non-commercial) and varying hardware devices can be combined to form a complex system for one-sided communication in much shorter development times as before. To run GASPI on FPGAs in native mode, the following components and devices need to be implemented. Most of the development work will be done within the EuroExa project6.

• OS driver and extensions • Firmware, Bios, Devicetree • Memory layout (DDR, HBM, BlockRam, etc.) • AXI interface (CCI) • Several AXI interfaces • BlockRam FIFOS • Self-programming DMA controllers • Flow management and headers (routing/switching) • Arbiter modules • GBit transceivers and LWL • Concurrent access management APU/FPGA • HLS2AXI mappings: To trigger communications directly out of FPGA-Kernels

6 EuroExa project: https://euroexa.eu/ D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 26

5 Beyond von Neumann

5.1 Introduction to the hardware

5.1.1 Neuromorphic architectures – SpiNNaker The SpiNNaker platform [APT 2018] is a general purpose, computing platform that has been designed to simulate biological neural networks. Spiking neural networks (SNNs) are the third generation of computational model designed to closely mimic the behaviour of a brain [Maass 1997], often referred to as a neuromorphic computing platform within the computational neuroscience field. SpiNNaker has performed well within this field – last year the team was able to demonstrate a microcortical column running on neuromorphic hardware for the first time, and they are now working on improving its performance. The SpiNNaker platform has evolved over 15 years from conception to the completion of the 1 million processor machine in November 2018. In computational neuroscience, neurons within an SNN usually have approximately 10,000 individual incoming connections from other neurons and can have many more outgoing connections. To simulate these networks in real time, large amounts of parallelism is needed, and an efficient communication fabric to transmit these “I have spiked” messages. The SpiNNaker platform achieves this through a novel hardware architecture, consisting of a collection of SpiNNaker chips wired together in a torus mesh network. Each chip can connect to 6 local neighbours through a bespoke . SpiNNaker chips consume 1W when operating at full capacity, and consist of 18-core ARM968 processors, which run at 200 MHz. Each processor has 32 KB of instruction memory (ITCM) and 64 KB of local RAM (DTCM). Each chip has 128 MB of synchronous DRAM (SDRAM) which is accessible to all

Figure 3: A SpiNNaker chip in detail. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 27 processors on the chip. When processors wish to communicate, they send packets to the SpiNNaker router which forwards them in a multicast fashion to the destinations. SpiNNaker only supports integer computations, and has no floating point units (FPUs). This is a consequence of the desired energy budget, as FPUs consume more power. A detailed graphical representation of a SpiNNaker chip can be seen in Figure 3. Given that even simple biological neural networks have very many neurons, energy efficiency is a key feature of the SpiNNaker platform. A SpiNNaker board contains 48 chips, and 3 FPGAs that support communications between boards, but can also be used to process incoming packets. For example, injecting external device feeds into the network. The total power usage of a single board is just 60W, 1W per chip plus 12W for everything else. In November 2018 a million processor machine, constructed by connecting together 1200 boards, was successfully booted. Operating at full load the entire machine uses only approximately 120kW. The machine is pictured in Figure 4.

Figure 4: A one million core SpiNNaker machine.

It is in essence a large heterogeneous system, with three principle components – ARM CPUs, a network, and FPGAs. This is conceptually similar to already existing experimental supercomputer architectures [Akram 2018]. A specialist software stack has been created [SpiNNaker 2018] to support application development on the system, which requires the application to be represented as an acyclic . In the graph, nodes represent computation, and edges represent communication of data. This can be a challenging programming model for the application developer to work with.

5.1.2 Neuromorphic architectures – other IBM’s TrueNorth [Merolla 2014] and Intel’s Loihi [Davies 2018] commercial neuromorphic platforms both have similar hardware approaches to SpiNNaker, but are set apart by their dedication to specific applications – specifically neural networks. One could envisage their use as accelerators, but their more limited application makes them less attractive than SpiNNaker. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 28

Alternative platforms such as BrainScaleS [BrainScaleS 2018] and Reconfigurable On-Line Learning Spiking (ROLLS) [Qiao 2015] utilise a more limited analogue computing models making them even less suitable for integration with traditional HPC architectures.

5.2 Gap analysis

5.2.1 MPI There is currently no MPI library implementation for any of the neuromorphic hardware architectures, or for any other “beyond Von Neumann” architectures.There are no immediate problems with implementing the majority of MPI communication functionality on the SpiNNaker architecture. However, the fundamental hardware communication operation in SpiNNaker is broadcast rather than point-to-point send/receive, which will require a conceptual shift for MPI library developers.

5.2.2 GASPI There is currently no GASPI implementation for any of the neuromorphic hardware architectures, or for any other “beyond Von Neumann” architectures. At the point of writing such architectures have not yet reached a critical mass when it makes sense to spend effort on a GASPI implementation.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 29

6 Specific hardware for Deep Learning

6.1 Introduction to the hardware There is a high interest in deep learning (DL) applications across industry and service providers. Two main markets can be identified:

• Cloud Services: applications in fields like network routing, querying, language translation; • IoT and the edge: Apps with high popular appeal, like augmented and virtual reality, face detection, voice recognition, autonomous driving, but also applications interesting for industry like smart sensors (i.e. process control)

Computation patterns in DL are highly homogeneous, therefore they are good candidates to be accelerated on hardware. Early adopters are Nvidia GPUs (CUDA, cuDNN, cuBLAS, etc.) which are nowadays quasi standard in DL accelerators. Although GPUs area GPU very fast, they are not as power efficient as they could be for the given operations. Also, the power consumption of regular GPUs is far too high for mobile devices and other edge devices. This has triggered the emergence of highly efficient hardware IP, which can be subdivided into the above two categories (i.e. server and edge). This survey gives an overview of novel hardware, which targets DL acceleration. It is evaluated on usefulness for HPC systems.

6.2 Chips for Edge Devices Table 2 gives an overview of the latest, emerging hardware for DL on the edge7. Typical are multiple, heterogeneous CPU cores, which have different performance-power trade-offs, which minimizes energy consumption. RAM memory consumes a relative high amount of energy. Therefore, for mobile devices, specialized low-power memory (LP DDR) is used, which has lower power consumption at the cost of a lower memory bandwidth. SoCs usually have external low-power DDR memory, which supports memory bandwidths in the order of <10 GB/s.

7 The charts given here report peak performance, i.e. maximum theoretically obtainable number of operations per second (FLOPS, or OPS in case of integer data). Peak Performance is not the performance obtainable for a given task. Usually 60%-90% of peak performance can be obtained for optimized code.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 30

Hardware Technology Performance Power Cores ML Accelerator Qualcomm 10nm 2.1TOPS 1W 8x Kryo Hexagon VLIW Vision CPU processor (not Intelligence Hexagon dedicated for ML) Platform DSP [qualcomm] Adreno GPU Kirin 7nm 8 TOPS 5 TOPs/W 4x ARM Cambricon Neural 980 [kirin] A76 Processing Unit 4x ARM A55 Mali G76 Cambricon 1M 7nm 5 TOPS 2xVortex Neural Engine [appleA12] CPU (HP) 4xTempest CPU (LP) 4 core GPU 8 core Neural Engine Bitmain 12nm 3 TOPS ~40W 2x ARM Neural Processing Sophon 1x MCU Unit BM1682 1x TPU [bitmain] Intel 16nm 4 TOPS <0.5W 2x SPARC SHAVE Processor Movidius V8 Myriad X 16x [] SHAVE Proc. NVidia 12nm 30 TOPS 30W 8x Carmel NVidia Deep Xavier (ARM Learning [xavier] based) Accelerator Volta Tensor Core GPU (512 cores) NVDLA ARM ML 7nm 4.6 TOPS >3 IP Core Based on NVidia Processor TOPS/W Deep Learning (Project Accelerator Trillium) [trillium]

Table 2: Chips for edge devices D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 31

Since the main market are mobile phones and smart cameras, often those systems have an additional video processing subsystem or Image Signal Processor (ISP). Lately, mobile devices are also used for resource demanding video games, so an embedded GPU can also be found on several chips. Low energy consumption has the highest priority for mobile devices, which run on batteries. A small size of the physical chip is also important. These two factors mainly influence the design. The devices are System on Chip (SoC), which comprise of a combination of at least one general purpose computation unit (a CPU) and several specialized subsystems, one of which can be a DL accelerator. All subsystems are integrated monolithically on a single . Another strategy to use DL algorithms on SoCs is not to introduce another DL subsystem, but to distribute the DL workload on the existing accelerators in a smart way. Qualcomm's chips, for example, do not have a dedicated DL accelerator, but they use their Hexagon (DSP), which is a VLIW processor. For the programmers, Qualcomm for example offers a Neural Processing SDK which distributes the workload of an application automatically. NVidia have open-sourced their DL accelerator IP (NVDLA), making it available for any hardware developer to integrate into their chips. NVidia is using this IP in their own product, Xavier. One of the first big adopters is ARM, which develops their ML Processor in their Project Trillium using the NVidia Deep Learning Accelerator (NVDLA) IP. The target application is inference, not training. The idea is to run pre-trained models on the embedded device. Model reduction techniques can be used to lower the memory size of the model parameters and the computational effort. DL subsystems for mobile phones are a novelty and it seems that software developers struggle to make sensible use of those in applications that are not related to image processing. From the HPC point of view, the power consumption and efficiency of SoC devices is very attractive. For example, with the power consumption of a single NVidia Volta GPU of 300W, which achieves 120 Teraflops, one could obtain a peak performance of over one Petaflop when using multiple Huawei's Kirin 980 chips. However, the communication overhead from hundreds of SoC devices and the low off-chip bandwidth makes this setup impractical. Memory bandwidth is very crucial to DL computations, since the operations per byte of data are low. In conclusion, using SoCs for DL computations in a HPC setup is not very reasonable. The target market of SoCs is clearly not HPC, so there is very little hope that memory bandwidth will increase in near future. It is not advised to use SoCs for HPC systems.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 32

6.3 Chips for Accelerator Devices Table 3 gives an overview of Chips for accelerator devices.

Name Technology Performance Power Memory BW Cambricon 16nm 166.4 TOPS 110W 102.4 GB/s MLU-100 [cambricon] Baidu Kunlun 14nm 260 TOPS >100W 512 GB/s Google TPU 16/12nm? 90 TOPS per 200W? per 2.4 TB/s? [googleTPU] Chip Chip 420 TOPS per board Intel Nervana 28nm 38 TOPS 210W 400 GB/s off- Lake Crest actual chip [nervana] NVidia Volta 12nm 120 TOPS 300W 900 GB/s V100 [volta] NVidia Turing 12nm 8.1 TFLOPS 70W 320 GB/s T4 260 TOPS [turing] INT4 Graphcore 16nm >200 TOPS 300W 384 GB/s off- Colossus chip [colossus] 90 TB/s on chip NEC SX- 16nm 2.45TFLOPS <300W 1.2 TB/s Aurora TSUBASA [tsubasa] Huawei 7nm 512 TOPS 350W - Ascend 256 TFLOPS 910/310 [ascend] Table 3: Chips for accelerator devices

A recent trend is that many service providers such as Facebook, Google, Alibaba, Baidu, etc. are designing their own ASICs for their data-centre workloads. There is a strong trend in China to become independent of western D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 33

IP. Cambricon, a Chinese startup with the aim to make China more independent from foreign IP, also designed the DL IP for Huawei's Kirin 980 SoC. Google's TPU is probably the most powerful AI chip used in production. The latest devices are third generation and there are few details published. The data in the table are a conjecture by nextplatform.com. There are several chips per board resulting in 420 TFLOPS per board. There is a Google edge TPU variant in development for edge applications. The architectures are quite similar. There are several cores which can calculate hundreds of MAC operations in parallel, coupled with a scheduling unit. Scheduling can be smart, for example with zero-gating (ARM Trillium), which skips multiplications with zero, saving energy. Also, there can be additional units on the chip for other functions than MAC, for example the activation functions (e.g. the Programmable Layer Engine in Trillium). GPUs reformulate DL operations into huge matrix-matrix operations and calculate those on several thousand processing cores in parallel. Usually, this requires the inputs and filters to be transformed into a layout which allows the GPU to calculate the operation efficiently. Especially for convolution operations, matrix entries need to be repeated several times in memory, which is not efficient. NVidia Volta has additional computation units (TensorCores), which calculate matrix-matrix operations faster, but not more efficient than in the way described above. An entirely different approach is done by Graphcores Colossus. There is no global memory. The chip comprises of over 2000 processor tiles with 256 KB of local memory each, 600 MB in total. Processing is done in a two- phase cycle, first local computation, then all-to-all communication between all the processor tiles with an aggregate bandwidth of 90 TB/s. There is no offloading to RAM, since all the information is kept in local memory. For some applications (LSTM inference), this architecture is over 100 times faster than a GPU. The big disadvantage is that the entire model needs to fit into 600 MB of local memory. Compared to SoC devices, the efficiency is worse, but the total performance is orders of magnitude higher. The latest chips use HBM memory with very wide memory interfaces, allowing for bandwidths of several 100 GB/s. Most accelerators use some sort of to access the data (i.e. global vs. local memory), which hides memory latencies. The superior memory bandwidth and performance make hardware accelerators clearly the better alternative for DL applications in large clusters.

6.4 Programming Devices For all accelerators, it does not matter if for edge devices or servers, the manufacturer provides some kind of SDK, which transforms the neural network D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 34 model into a representation that can be understood by the accelerator. The common approach is to import a pre-trained model from a framework like Tensorflow or Caffe, either in their native formats or an open format like ONNX. A runtime engine on the device (like Tensorflow Lite for mobile devices) can than execute the transformed model. This tool flow is often possible for inference only. The accelerator devices are less accessible for training, since the manufacturer does not provide the direct means to program them for this task. For training, many frameworks support NVidia GPUs, making them the de- facto standard accelerators for DNN training. In the case of TensorFlow and TPUs, the support for the accelerator is given by the framework itself directly. Operations are outsourced to the accelerator and the communication is handled by the framework.

6.5 Gap analysis

6.5.1 MPI Novel edge and DL compute devices are not explicitly supported in MPI. However, the most likely route to incorporating such devices would be to view them in the same way as GPUs, i.e. either mediate MPI communication via a host CPU or produce native implementations of critical MPI functions that execute directly on the device.

6.5.2 GASPI Support for Edge Devices No support for the novel specific DL hardware exists in GASPI/GPI-2. In the case of SoCs GASPI support would be possible if a CPU core would handle the GASPI communication. The generation and transfer of the messages would generate a computational overhead on the CPU, which also increases the latency of the messages. Additionally, the bandwidth to a SoC is much smaller than what one can expect of a PCIe card. It is very unlikely that future generation SoCs will increase in communication bandwidth or include a RDMA capable core, since they are meant for the mobile market. Very bad scaling properties make SoCs a bad option for HPC environments. Thus, we do not think that the identified gap in GASPI/GPI-2 is relevant for HPC systems.

Support for Accelerator devices A support for accelerator devices in GASPI/GPI-2 would be similar to the support of GPUs. Some of the accelerators mentioned are actually GPUs (such as NVidia Volta and NVidia Turing). D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 35

When supporting new accelerators in GPI-2 one has to distinguish between the case that the RDMA capable NIC can communicate directly with the accelerator memory or indirectly via the host system. In the direct setup, the accelerator triggers its own transfers. The memory of the accelerator would be part of the available GASPI address space. In the indirect setup, the host system would trigger a transfer for the accelerator, but the memory of the accelerator would still be mapped to the global address space, so there are no copies to host memory required. The support of novel DL accelerator devices depends on the requirements and needs of the GASPI user community as well as the accessibility of the novel hardware.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 36

7 Conclusion and Future Work

The deliverable has analysed gaps of GASPI and MPI concerning the novel hardware accelerators surveyed in this deliverable, namely Many Core, GPUs, FPGAs, neuromorphic computing, and specific hardware for Deep Learning. For MPI the following conclusions can be drawn about the accelerator integration:

• Multi and Many Core architectures: MPI allows the choice between a “flat MPI” approach and multi-threading including a thread-safe mode. There is an ongoing debate regarding the best way to expose concurrency to programmers. • GPUs: MPI libraries permit memory addresses that refer to GPU memory for the data buffer arguments in communication API calls using the host CPU or using hardware or system support for more direct routing. GPU-direct and NCCL can be used. Proposals for a tighter integration similar to CPU cores are suggested. • FPGAs: MPI is already offloading some of its software protocol into hardware (e.g. in Portals 4 and ConnectX). Otherwise using FPGAs poses the same issue to MPI as the usage of GPUs as accelerators. • Specific hardware for DL applications: Novel edge and DL compute devices are not explicitly supported by MPI. Support would be similar to that of GPUs.

EPiGRAM-HS will further investigate Finepoints vs Endpoints and evaluate if either are an improvement to flat MPI when applied to multi-core, many-core, or GPU architectures. EPiGRAM-HS will, in collaboration with WP2, investigate using FPGAs to offload more MPI protocols. EPiGRAM-HS will follow the work of the University of Manchester as they attempt to implement MPI natively on the SpiNNaker architecture and provide consultancy for that effort as needed. For GASPI the following conclusions can be drawn about the accelerator integration:

• Multi and Many Core architectures: Being thread-safe GPI-2 is already perfectly adapted to Multi Core CPUs. In many cases Many Core architectures should profit from GPI-2 without too many modifications. Depending on the hardware specifications of future compute elements, some adaptation might be necessary. • GPUs: The integration of RDMA into GPI-2/GASPI is based on memory segments. A clean and portable interface allows for any D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 37

kind of synchronization and data transport between hosts and accelerators. • FPGAs: Plans to integrate GPI directly on FPGAs for point to point connections between FPGAs will be realised in the scope of the EuroExa project. • Specific hardware for DL applications: Support for accelerator devices in GASPI/GPI-2 would be similar to the support of GPUs.

All in all the integration of accelerators is quite advanced in GASPI/GPI-2 and no new developments for a tighter integration are expected within the scope of the EPiGRAM-HS project. EPiGRAM-HS will profit however from some new developments for the GPI-2 FPGA support in the scope of the EuroExa project. Neuromorphic computing is not based on familiar von-Neumann computing architectures. Due to its disruptive nature communication concepts such as GASPI or MPI will not hold anymore. Neuromorphic computing will be very loosely coupled, if at all, to HPC clusters. Most likely as a novel type of accelerators in the future, separating concerns between the more traditional HPC cluster and the neuromorphic compute clusters.

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 38

8 References

8.1 References for Multi / Many Core hardware [ARM 2013a] ARM, big.LITTLE Technology: The Future of Mobile, accessed December 2018, https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Future_of_ Mobile.pdf?_ga=2.165659071.1680287691.1544806408-1854080846.1544196127

[Intel 2018a] Intel, Intel Architecture Instruction Set Extensions and Future Features Programming Reference, 319433-035, October 2018, https://software.intel.com/sites/default/files/managed/c5/15/architecture- instruction-set-extensions-programming-reference.pdf

[Stephens 2017a] N. Stephens et al., "The ARM Scalable Vector Extension," in IEEE Micro, vol. 37, no. 2, pp. 26-39, Mar.-Apr. 2017. doi: 10.1109/MM.2017.35

[Yoshida 2018a] Toshio Yoshida, “Fujitsu High Performance CPU for the Post- K ”, Hotchips 30, August 21, 2018

8.2 References for GPUs [dCUDA] T. Gysi, J. Baer, T. Hoefler, “dCUDA: Hardware Supported Overlap of Computation and Communication“, 2016, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, Utah

8.3 References for FPGAs [Bachrach 2012a] J. Bachrach et al., "Chisel: Constructing hardware in a Scala embedded language," DAC Design Automation Conference 2012, San Francisco, CA, 2012, pp. 1212-1221. doi: 10.1145/2228360.2228584

[Caulfield 2016a] A. M. Caulfield et al., "A cloud-scale acceleration architecture," 2016 49th Annual IEEE/ACM International Symposium on (MICRO), Taipei, 2016, pp. 1-13. doi: 10.1109/MICRO.2016.7783710, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7783710&isnumb er=7783693

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 39

[George 2016a] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng and C. Yang, "Novo-G#: Large-scale with direct and programmable interconnects," 2016 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, 2016, pp. 1-7. doi: 10.1109/HPEC.2016.7761639, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7761639&isnumb er=7761574

[Intel 2018a] Intel, Intel Stratix 10 GX/SX Product Family Overview Table, Gen-1023-1.6, accessed December 2018, https://www.intel.com/content/dam/altera- www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf

[Nane 2016a] R. Nane et al., "A Survey and Evaluation of FPGA High-Level Synthesis Tools," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591-1604, Oct. 2016, doi: 10.1109/TCAD.2015.2513673, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7368920&isnumb er=7563474

[Putnam 2014a] Andrew Putnam et al., A reconfigurable fabric for accelerating large-scale datacenter services. SIGARCH Comput. Archit. News 42, 3 (June 2014), 13-24. DOI: https://doi.org/10.1145/2678373.2665678

[Xilinx 2018a] Xilinx, UltraScale+ FPGAs Product Tables and Product Selection Guide, XMP103(v1.14), accessed December 2018, https://www.xilinx.com/support/documentation/selection-guides/ultrascale- plus-fpga-product-selection-guide.pdf

8.4 References for Beyond von-Neumann [APT 2018] APT Advanced Processor Technologies Research Group, University of Manchester, SpiNNaker Home Page, accessed January 2019, http://apt.cs.manchester.ac.uk/projects/SpiNNaker/

[Maass 1997] W. Maass, “Networks of Spiking Neurons: The Third Generation of Neural Network Models”, Neural Networks, 10 (1997), pp. 1659-1671, https://doi.org/10.1016/S0893-6080(97)00011-7

[Akram 2018] W. Akram, T. Hussain, and E. Ayguade, “FPGA and ARM processor based supercomputing”, 2018 International Conference on Computing, D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 40

Mathematics and Engineering Technologies (iCoMET), Sukkur, 2018, pp. 1-5, doi: 10.1109/ICOMET.2018.8346363

[SpiNNaker 2018] Software for SpiNNaker, accessed January 2019, http://spinnakermanchester.github.io/

[Merolla 2014] Merolla et al., “A million spiking-neuron integrated circuit with a scalable communication network and interface”, Science, 345 6197 (2014), pp. 668-673, doi: 10.1126/science.1254642

[Davies 2018] Davies et al., “Loihi: A Neuromorphic with On-Chip Learning”, IEEE Micro, 38 1 (2018), pp. 82-89, doi: 10.1109/MM.2018.112130359

[BrainScaleS 2018] BrainScaleS – Neuromorphic processors, accessed January 2019, http://www.artificialbrains.com/brainscales

[Qiao 2015] Qiao et al., “A reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses”, Frontiers in Neuroscience, 9 141 (2015), doi: 10.3389/fnins.2015.00141

8.5 References used for specific hardware for DL [qualcomm] Information about Qualcomm Vision Intelligence Platform: • https://www.qualcomm.com/products/snapdragon-850-mobile- compute-platform • https://www.qualcomm.com/invention/research/projects/deep- learning (05.11.18) • https://developer.qualcomm.com/docs/snpe/overview.html • https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp- processor • https://wccftech.com/qualcomms-snapdragon-855-mass- production-q4-and-feature-npu-features-specifications-launch/ • https://www.golem.de/news/qualcomm-snapdragon-855-wird-zum- snapdragon-8150-1808-136081.html • https://www.qualcomm.com/products/platforms/consumer- electronics • https://www.qualcomm.com/products/snapdragon-845-mobile- platform

[kirin] Information about Huawei Kirin: D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 41

• https://www.golem.de/news/huawei-kirin-980-7-nm-chip-hat-doppelte- ai-performance-1808-136322.html • https://consumer.huawei.com/en/press/news/2018/huawei-launches- kirin-980-the-first-commercial-7nm-soc/ • https://fuse.wikichip.org/news/1297/cambricon-reaches-for-the-cloud- with-a-custom-ai-accelerator-talks-7nm-ips/ • https://www.anandtech.com/show/13298/hisilicon-announces-the-kirin- 980-first-a76-g76-on-7nm [appleA12] Apple A12: • https://www.macwelt.de/a/was-wir-vom-a12-chip-in-kommenden- iphones-erwarten-koennen,3439556 • https://www.apple.com/de/iphone-xs/a12-bionic/ • https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review- unveiling-the-silicon-secrets/2 • https://medium.com/syncedreview/ai-chip-duel-apple-a12-bionic-vs- huawei-kirin-980-ec29cfe68632

[cambricon] Cambricon MLU100: • https://fuse.wikichip.org/news/1297/cambricon-reaches-for-the-cloud- with-a-custom-ai-accelerator-talks-7nm-ips/ • https://www.golem.de/news/cambricon-mlu100-entwickler-von- huaweis-npu-bringt-ai-beschleuniger-1805-134605.html • https://www.anandtech.com/show/12815/cambricon-makers-of- huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card

[bitmain] Bitmain Sophon: • https://sophon.ai/ • https://en.wikichip.org/wiki/bitmain/sophon

[googleTPU] Information about Google TPU: • https://cloud.google.com/tpu/docs/tpus • https://cloud.google.com/edge-tpu/ • https://cloud.google.com/tpu/ • https://www.golem.de/news/machine-learning-google-bringt-mini-tpu- zur-modell-anwendung-1807-135707.html • https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu- 3-0-ai-/ • https://arxiv.org/pdf/1704.04760.pdf

[nervana] Intel Nervana: D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 42

• https://www.golem.de/news/nervana-nnp-l1000-spring-crest-soll- dreifache-ai-leistung-aufweisen-1805-134549.html • https://www.hpcwire.com/2018/05/24/intel-pledges-first-commercial- nervana-product-spring-crest-in-2019/ • https://newsroom.intel.com/editorials/artificial-intelligence-requires- holistic-approach/

[movidius] Information about Intel Movidius: • https://www.movidius.com/myriad2 • https://www.movidius.com/myriadx • https://www.golem.de/news/movidius-myriad-x--ai-chip-schafft-4- teraops-1708-129727.html • https://www.anandtech.com/show/11771/intel-announces-movidius- myriad-x-vpu

[volta] Nvidia Tesla V100: • https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta- v100-datasheet-letter-fnl-web.pdf • https://www.heise.de/newsticker/meldung/Tesla-V100-Nvidia- uebergibt-erste-Volta-Rechenkarten-an-Deep-Learning-Forscher- 3781130.html

[turing] NVidia Turing T4: • https://nvidianews.nvidia.com/news/nvidia-announces-record-adoption- of-new-turing-t4-cloud-gpu?ncid=em-ded- 64998&mkt_tok=eyJpIjoiTm1VM1pHSmtPREprWW1NeiIsInQiOiJhO VdaZ0pxOHZJNHdGZ2NCSlBORzNIUzJJR3JKcVYwa0xtNHJUUEFy MkVDcnBkbUQwTW01YlFkUk90QzgyVHFJV1ZzNk9iNDYzdkhpWX paKzBKbm5oK2tpS2VTdXl4Mm1sSm45K3RVVHBkNEEzMTBiZTltR mU3R09PVUVpd3hPKyJ9

• https://www.nvidia.com/en-us/data-center/tesla-t4/

• https://cloud.google.com/blog/products/compute/google-cloud-first-to- offer-nvidia-tesla-t4-gpus

[xavier] Information about NVidia Xavier: • https://fuse.wikichip.org/news/1618/hot-chips-30-nvidia-xavier-soc/ • https://en.wikichip.org/wiki/nvidia/microarchitectures/nvdla • https://www.forbes.com/sites/moorinsights/2018/08/24/nvidia-reveals- xavier-soc-details/ D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 43

• http://nvdla.org/ • https://nvidianews.nvidia.com/news/nvidia-and-arm-partner-to-bring- deep-learning-to-billions-of-iot-devices • https://wccftech.com/nvidia-drive-xavier-soc-detailed/

[trillium] Information about ARM Trillium: • https://nvidianews.nvidia.com/news/nvidia-and-arm-partner-to-bring- deep-learning-to-billions-of-iot-devices

[colossus] Graphcore Colossus: • https://www.graphcore.ai/ • http://www.eenewsanalog.com/news/graphcores-two-chip-colossus- close-launch

[tsubasa] NEC SX-Aurora TSUBASA: • https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html • https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora- vector-engine/

[ascend] Huawei Ascend 910/310: • https://www.golem.de/news/ascend-910-310-huawei-ai-chip-soll-google- und-nvidia-schlagen-1810-137047.html • https://mybroadband.co.za/news/technology/279227-huawei-ascend- 910-ai-chip-unveiled-the-greatest-computing-density-on-a-single- chip.html

D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 44

A. Distributed programming models

MPI MPI has assumed a dominant and ubiquitous role in programming HPC systems for the last 30 years. It represents the distributed-memory message-passing programming model using a standardized library-based API. MPI enables high- performance message-based communication by providing two-sided, one-sided, and collective messaging functionality, process subsets with virtual topologies, user-defined datatypes, parallel file I/O, and an extensive introspection capability to support tools, such as debuggers and trace analyzers. Substantial effort has been invested in developing and maintaining open-source, widely available implementations, such as MPICH and OpenMPI. In particular, significant work has focused on identifying and reducing or removing barriers to scalability.

GASPI GASPI stands for Global Address Space Programming Interface and is a Partitioned Global Address Space (PGAS) API. It aims at extreme scalability, high flexibility and failure tolerance for parallel computing environments. GASPI aims to initiate a paradigm shift from bulk-synchronous two-sided communication patterns towards an asynchronous communication and execution model. To that end, GASPI leverages remote completion and one- sided RDMA driven communication in a Partitioned Global Address Space. The asynchronous communication allows a perfect overlap between computation and communication. The main design idea of GASPI is to have a lightweight API ensuring high performance, flexibility and failure tolerance. GPI-2 is an open source implementation of the GASPI standard, freely available to application developers and researchers.