Enhancing Programmability in Noc-Based Lightweight Manycore Processors with a Portable MPI Library
Total Page:16
File Type:pdf, Size:1020Kb
Enhancing Programmability in NoC-Based Lightweight Manycore Processors with a Portable MPI Library João Fellipe Uller, João Souto, Pedro Henrique Penna, Márcio Castro, Henrique Freitas, Jean-François Méhaut To cite this version: João Fellipe Uller, João Souto, Pedro Henrique Penna, Márcio Castro, Henrique Freitas, et al.. En- hancing Programmability in NoC-Based Lightweight Manycore Processors with a Portable MPI Li- brary. XXI Simpósio em Sistemas Computacionais de Alto Desempenho, Oct 2020, Online, Brazil. hal-02975094 HAL Id: hal-02975094 https://hal.archives-ouvertes.fr/hal-02975094 Submitted on 22 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Enhancing Programmability in NoC-Based Lightweight Manycore Processors with a Portable MPI Library Joao˜ Fellipe Uller1, Joao˜ Vicente Souto1, Pedro Henrique Penna2;3, Marcio´ Castro1, Henrique Freitas3, Jean-Franc¸ois Mehaut´ 2 1Laboratorio´ de Pesquisa em Sistemas Distribu´ıdos (LaPeSD) Universidade Federal de Santa Catarina (UFSC) – Brazil 2Computer Architecture and Parallel Processing Team (CArT) Pontif´ıcia Universidade Catolica´ de Minas Gerais (PUC Minas) – Brazil 3Laboratoire d’Informatique de Grenoble (LIG) Universite´ Grenoble Alpes (UGA) – France [email protected], [email protected], [email protected], [email protected] [email protected], [email protected] Abstract. The performance and energy efficiency provided by lightweight many- cores is undeniable. However, the lack of rich and portable support for these processors makes software development challenging. To address this problem, we propose a portable and lightweight MPI library (LWMPI) designed from scratch to cope with restrictions and intricacies of lightweight manycores. We integrated LWMPI into a distributed OS that targets these processors and evalu- ated it on the Kalray MPPA-256 processor. Results obtained with three applica- tions from a representative benchmark suite unveiled that LWMPI achieves sim- ilar performance scalability in comparison with the low-level vendor-specific API narrowed for MPPA-256, while exposing a richer programming interface. 1. Introduction Lightweight manycore processors surged to address demands on high-performance and energy efficiency [Francesquini et al. 2015]. On the one hand, to deliver high- performance and scalability, these processors rely on a distributed memory architec- ture and a rich Network-on-Chip (NoC). On the other hand, to achieve energy ef- ficiency, they are built with simple low-power Multiple Instruction Multiple Data (MIMD) cores and Scratchpad Memories (SPMs) with no hardware coherency sup- port. Moreover, they exploit heterogeneity by combining cores with different capabili- ties. Some industry-successful examples of lightweight manycores are the Kalray MPPA- 256 [de Dinechin et al. 2013a], the Adapteva Epiphany [Olofsson 2016] and the Sunway SW26010 [Fu et al. 2016]. While the aforementioned architectural features make lightweight manycores more scalable than other parallel processors in both performance and energy efficiency, they introduce several challenges in software programmability. For instance, the dis- tributed memory architecture requires a non-trivial software design to handle data access across multiple physical address spaces. Hence, software should explicitly fetch data from remote memories to local ones to be manipulated [Francesquini et al. 2015]. Further- more, the small amount of on-chip memory demands software to explicitly tile the work- ing data set into chunks and locally manipulate them one at a time [Souza et al. 2017]. Additionally, it is up to the software to take care of data caching and replication to boost performance. Finally, the rich NoC exposes mechanisms for asynchronous programming to overlap communication with computation [Hascoet¨ et al. 2017]; and hand-operated routing to guarantee uniform communication latencies. Currently, two approaches are employed to address programmability chal- lenges in lightweight manycores: Operating Systems (OSes) [Kluge et al. 2014, Asmussen et al. 2016, Penna et al. 2019] and baremetal runtime sys- tems [de Dinechin et al. 2013b, Varghese et al. 2014, Richie et al. 2017]. The former is meant to bridge critical programmability gaps and hardware intricacies. The latter aims to expose a rich, performance-oriented programming environment, narrowed to the underlying architecture. While these two approaches are effective for some use cases, they have a significant duality drawback. Application development directly on top of OS interfaces yields to software portability across architectures, but the actual programming interface provided is complex and retard the software development process. In contrast, baremetal and vendor-specific runtime systems expose richer interfaces that accelerate the development process, but they exclusively concern to the software stack ecosystem of a specific lightweight manycore. As an immediate consequence, software written on top of these higher-level interfaces end up to be non-portable. The software stack for lightweight manycores lacks in programmability, once it fails to provide support for both fast development process and software portability. In this work, we address the programmability and portability challenges in lightweight manycores by combining both approaches: a lightweight implementation of the Message Passing Interface (MPI) standard (named LWMPI) on top of Nanvix, a Portable Oper- ating System Interface (POSIX)-compliant distributed OS that targets lightweight many- cores [Penna et al. 2019]. LWMPI is compatible with the MPI specification and can be extended to support new features and other OSes with little effort. To assess LWMPI with representative computing workloads, we carried out exper- iments with three applications extracted from the CAP Bench suite [Souza et al. 2017]. All experiments were executed on the Kalray MPPA-256 processor, a baremetal lightweight manycore. Our results unveiled that the proposed implementation delivers similar performance scalability when compared with a vendor-specific low-level Appli- cation Programming Interface (API) for the Kalray MPPA-256, while exposing a richer programming interface. The remainder of this work is organized as follows. In Section 2, we cover the background on lightweight manycores. In Section 3, we present our proposal. In Sec- tion 4, we detail our evaluation methodology. In Section 5, we discuss our experimental results. In Section 6 we discuss related works. In Section 7, we draw our conclusions. 2. Lightweight Manycore Processors In this section, we cover the background on lightweight manycore processors. First, we present an architectural discussion on these processors and how they differ from other par- allel architectures. Then, we discuss about the current support for software development in lightweight manycores. 2.1. Architectural Blueprints Lightweight manycores have an endeavour to deliver high performance and energy effi- ciency in a single die. To achieve this, these processors rely on the following architectural features: (i) thousands of low-power cores; (ii) MIMD capability; (iii) tightly-coupled groups of cores (aka clusters); (iv) distributed memory architecture and small local mem- ories; (v) reliable and fast NoCs for message-passing; and (vi) heterogeneous processing capabilities in I/O and computing clusters. To provide substantial insight on lightweight manycores, we consider in this paper an industry-successful example of such type of processor: the Kalray MPPA- 256 [de Dinechin et al. 2013a]. However, the following discussion extends to other lightweight manycores [Olofsson 2016, Fu et al. 2016]. Figure 1a presents an overview of this processor. Overall, Kalray MPPA-256 integrates 288 cores disposed into 20 clusters. Each cluster is composed of heterogeneous and limited hardware capabilities to perform different roles. For instance, I/O Clusters have four Resource Managers (RMs), four NoC interfaces, and 4 MB local Static Random Access Memory (SRAM) to exchange data with external resources and internal clusters. Differently, Compute Clusters have one RM, 16 Processing Elements (PEs), one NoC interface, and only 2 MB local SRAM to run user workloads. Cores within the cluster share and have uniform access to hardware resources. Communication between clusters is explicitly and exclusively achieved by ex- changing hardware-level messages through two NoCs because address spaces are phys- ically separated. Specifically, The Control NoC (C-NoC) enables synchronization and small control messages handover, whereas the Data NoC (D-NoC) supports arbitrary- sized data exchanges. At this point, the I/O heterogeneity among clusters becomes more evident. I/O Clusters have direct access to the attached Dynamic Random Access Mem- ory (DRAM) or a device, while Compute Clusters must tile their data into messages and send them through the NoC using an I/O Cluster as an intermediary to access these re- sources. To improve communication