D4.1 Report on State of the Art of Novel Compute Elements and Gap Analysis in MPI and GASPI

Ref. Ares(2019)558413 - 31/01/2019 HORIZON 2020 TOPIC FETHPC-02-2017 Transition to Exascale Computing Exascale Programming Models for Heterogeneous Systems 801039 D4.1 Report on state of the art of novel compute elements and gap analysis in MPI and GASPI WP4: Productive computing with FPGAs, GPUs and low-power microprocessors Date of preparation (latest version): [DATE] Copyright© 2018-2021 The EPiGRAM-HS Consortium The opinions of the authors expressed in this document do not necessarily reflect the official opinion of the EPiGRAM-HS partners nor of the European Commission. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 2 DOCUMENT INFORMATION Deliverable Number D4.1 Report on state of the art of novel compute elements Deliverable Name and gap analysis in MPI and GASPI Due Date 31/01/2019 (PM5) Deliverable Lead FhG Martin Kühn, Dominik Loroch, Carsten Lojewski, Valeria Bartsch (Fraunhofer ITWM), Gilbert Netzer (KTH), Authors Daniel Holmes (EPCC), Alan Stokes (EPCC), Oliver Brown (EPCC) Valeria Bartsch, Fraunhofer ITWM, Responsible Author [email protected] Keywords Novel compute, programming models WP/Task WP4 /Task 4.3 Nature R Dissemination Level PU Final Version Date 31/01/2019 Sergio Rivas-Gomez (KTH), Oliver Brown (EPCC), Luis Reviewed by Cebamanos (EPCC) MGT Board Approval YES D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 3 DOCUMENT HISTORY Partner Date Comment Version First draft (Multi/Many core, GPU gap 0.1 FhG 04.12.2018 analysis, specific HW for DL) Extension to Multi/Many core for ARM 0.2 FhG, KTH 19.12.2018 and RISC5, FPGA section, Introduction added Executive Summary, Introduction into FhG 28.12.2018 0.3 GPUs added FhG 04.01.2019 Glossary added, Conclusion added 0.4 MPI gap analysis added, table on GPUs FhG 09.01.2019 added, GASPI gap analysis for FPGAs 0.5 added EPCC 09.01.2019 Added 4.1 Introduction to Beyond V-N 0.6 FhG, KTH, Implement changes to address comments by 25.01.2019 0.7 EPCC reviewers Discussion of comments on Cell BE and FhG, KTH 28.01.2019 0.8 big.LITTLE, etc. KTH 31.01.2019 Final version 1.0 D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 4 Executive Summary The purpose of this document is to survey the state of the art of novel compute elements and do a gap analysis in MPI and GASPI. Currently a major hurdle in large supercomputers is the huge amount of energy necessary to operate these systems. The power consumption and efficiency of some novel compute devices is very attractive and therefore it is expected to see a rise in heterogeneous HPC cluster in future. For this deliverable we choose to analyse the following types of compute elements: Multi and Many Core architectures, Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs) and specialised hardware for Deep Learning (DL). The choice is motivated by the current market share, uptake in HPC clusters and impact that some of the novel compute elements would have on the HPC systems. The current hardware trends are the following: • Multi core architectures are still used in many of the supercomputers in the Top500 list. However currently the product pipeline of the main suppliers is uncertain. MPI and GASPI support for such architectures is mature. • GPUs are becoming more important in HPC and are already predominant in DL applications. Part of the success of GPUs can be explained by the rise of the CUDA programming language, which delivers easy-to-use interfaces for application developers. The integration of GPUs as accelerators in the HPC network stack can be done on various levels. Basic MPI and GASPI functionalities exist, a tighter integration is possible, but the benefit is questionable. • FPGAs are known to be difficult to program. This has been one of the biggest hurdles for the integration of FPGAs in HPC architectures. High Level Synthesis (HLS) tools start to become mature, making FPGAs easier-to-program. Currently there is no notable MPI nor GASPI support for FPGAs. However, this situation is changing. For example, for GASPI there are efforts dedicated for a better support in the scope of the EuroExa project. • Neuromorphic computing is well suited for neural networks and very energy-efficient. The first commercial neuromorphic platform could be used as accelerator for a limited number of applications. There is no MPI nor GASPI support foreseen for this type of compute element. • Specific hardware for deep learning is on the rise. There is an emergence of highly efficient hardware IP on servers/accelerators on D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 5 one hand and edge/SoC on the other hand. For HPC, accelerators are the better alternative for DL applications due to their superior bandwidth and performance. To summarize: while the MPI and GASPI support is mature for Multi and some Many Core architectures, most other accelerators (GPUs, FPGAs, etc.) are not tightly integrated in HPC workflows. In the scope of EPiGRAM-HS, we care especially about the integration into inter-node networks. The integration of novel accelerators in HPC network stacks can be done on various levels depending on how tight such an integration is intended to be. Most accelerators (such as GPUs or FPGAs) are connected via PCI-Express, though a direct integration into a data-centre network, like Ethernet or InfiniBand, is possible. The steps towards an integration of accelerators in HPC workflows are described (from loose to tight): • Communication only possible with host-CPU with a multi-level communication hierarchy at node-level. • Zero-copy data transfers possible by exposing GPU and FPGA resources as memory. • Modification of the network stack to program the NIC to directly transfer data using peer-to-peer transactions, control of communication still at host (e.g. GPUDirect-RDMA technology for GPUs). • Own identity (rank) on interconnect, allows accelerator to initiate communications on its own (GPUs: dCUDA technology [dCUDA]). • Creation of custom communication instances. Currently the integration of GPUs is the most sophisticated compared to FPGAs or any other novel accelerators. For upcoming accelerators, the components allowing a tighter integration still need to be developed. In the scope of the EPiGRAM-HS project, the gap analysis of MPI and GASPI will be used to guide our future work. Though not all identified gaps can and will be closed within the duration of the project, we will focus on the most important accelerators (GPUs and FPGAs) and the requirements of the project’s applications. D4.1: Report on state of the art of novel compute elements and gap analysis in MPI and GASPI 6 Contents 1 Introduction ............................................................................................................... 7 1.1 Glossary ............................................................................................................. 9 2 Multi Core / Many Core .......................................................................................... 10 2.1 Introduction to the hardware ........................................................................... 10 2.2 Instruction-Set Architecture Considerations ..................................................... 13 2.3 Micro Architecture Implementations ................................................................ 14 2.4 Gap analysis for MPI ....................................................................................... 14 3 Graphical Processor Units ........................................................................................ 17 3.1 Introduction to the hardware ........................................................................... 17 3.2 Gap analysis ..................................................................................................... 18 4 Field Programmable Gate-Arrays ............................................................................ 21 4.1 Introduction to the hardware ........................................................................... 21 4.2 FPGA usage as accelerators in data-center applications .................................. 22 4.3 Integration of FPGAs and HPC Network Stacks ............................................. 23 4.4 Gap analysis ..................................................................................................... 24 5 Beyond von Neumann .............................................................................................. 26 5.1 Introduction to the hardware ........................................................................... 26 5.2 Gap analysis ..................................................................................................... 28 6 Specific hardware for Deep Learning ....................................................................... 29 6.1 Introduction to the hardware ........................................................................... 29 6.2 Chips for Edge Devices ..................................................................................... 29 6.3 Chips for Accelerator Devices ........................................................................... 32 6.4 Programming Devices ....................................................................................... 33 6.5 Gap analysis ..................................................................................................... 34 7 Conclusion and Future Work ..................................................................................

D4.1 Report on State of the Art of Novel Compute Elements and Gap Analysis in MPI and GASPI

Accenture AI Inferencing in Action

GPU Developments 2018

Persistent Memory for Artificial Intelligence

AI Accelerator Latencies in Hybrid Vehicular Simulation

Unified Inference and Training at the Edge

Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions

Network on Chip for FPGA Development of a Test System for Network on Chip

Survey and Benchmarking of Machine Learning Accelerators

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles