Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions
Total Page:16
File Type:pdf, Size:1020Kb
Towards A Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions Millad Ghane Sunita Chandrasekaran Margaret S. Cheung Department of Computer Science Department of Computer and Information Physics Department University of Houston Sciences University of Houston TX, USA University of Delaware Center for Theoretical Biological Physics, [email protected],[email protected] DE, USA Rice University [email protected] TX, USA [email protected] Abstract 1 Introduction An ever-growing diversity in the architecture of modern super- Heterogeneity has become increasingly prevalent in the recent computers has led to challenges in developing scientifc software. years given its promising role in tackling energy and power con- Utilizing heterogeneous and disruptive architectures (e.g., of-chip sumption crisis of high-performance computing (HPC) systems [15, and, in the near future, on-chip accelerators) has increased the soft- 20]. Dennard scaling [14] has instigated the adaptation of heteroge- ware complexity and worsened its maintainability. To that end, we neous architectures in the design of supercomputers and clusters need a productive software ecosystem that improves the usability by the HPC community. The July 2019 TOP500 [36] report shows and portability of applications for such systems while allowing how 126 systems in the list are heterogeneous systems confgured every parallelism opportunity to be exploited. with one or many GPUs. This is the prevailing trend in current gen- In this paper, we outline several challenges that we encountered eration of supercomputers. As an example, Summit [31], the fastest in the implementation of Gecko, a hierarchical model for distributed supercomputer according to the Top500 list (June 2019) [36], has shared memory architectures, using a directive-based program- two IBM POWER9 processors and six NVIDIA Volta V100 GPUs. ming model, and discuss our solutions. Such challenges include: Computer architects have also started to integrate accelerators 1) inferred kernel execution with respect to the data placement, and conventional processors on one single chip; hence, moving 2) workload distribution, 3) hierarchy maintenance, and 4) memory from node-level parallelism, e.g., modern supercomputers, to chip- management. level parallelism, e.g., System-on-chips (SoC). Most likely, in the We performed the experimental evaluation of our implementa- near future, computing nodes will possess chips with diverse type tion by using the Stream and Rodinia benchmarks. These bench- of computational cores and memory units on them [23]. Figure 1a marks represent several major scientifc software applications com- displays potential and envisioned schematic architecture of a fu- monly used by the domain scientists. Our results reveal how the ture SoC. Fat cores in the fgure are characterized by their sophis- Stream benchmark reaches a sustainable bandwidth of 80 GB/s and ticated branch prediction units, deep pipelines, instruction-level 1.8 TB/s for single Intel Xeon Processor and four NVIDIA V100 parallelism, and other architectural features that optimize the serial GPUs, respectively. Additionally, the srad_v2 in the Rodinia bench- execution to its full extent. They are similar to the conventional mark reaches the 88% speedup efciency while using four GPUs. processors in current systems. Thin cores, however, are the alterna- tive class of cores, which have a less complex design, consume less CCS Concepts • Computer systems organization → Hetero- energy, and have higher memory bandwidth and throughput. Such geneous (hybrid) systems; cores are designed to boost the performance of the data parallelism Keywords Hierarchy, Heterogeneous, Portable, Shared Memory, algorithms [4]. Programming Model, Abstraction The increasing discrepancy between memory bandwidth and computation speed [39] has led computer architects seek disrup- ACM Reference Format: tive methods like Processing-In-Memory (PIM) [22, 24] to solve Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung. 2020. To- this problem. PIM-enabled methods bring memory modules closer wards A Portable Hierarchical View of Distributed Shared Memory Systems: to the processing elements and place them on the same chip to Challenges and Solutions. In The 11th International Workshop on Program- minimize the data transfer with of-chip components. Such adapta- ming Models and Applications for Multicores and Manycores (PMAM’20 ), tion is manifesting itself in the form of die-stacked memories (e.g., February 22, 2020, San Diego, CA, USA. ACM, New York, NY, USA, 10 pages. 3D Stacked Memories [27, 38]) and scratchpad memories [6]. Fig- htps://doi.org/10.1145/3380536.3380542 ure 1a shows such on-chip memory organization. The memory modules on the chip form a network among themselves to ensure Permission to make digital or hard copies of all or part of this work for personal or the data consistency among themselves [30], and the data trans- classroom use is granted without fee provided that copies are not made or distributed fer among computational cores on the chip is enabled with the for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM network-on-the-chip (NoC) component [8, 17]. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, To that end, we are in dire need of a simple yet robust model to to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from [email protected]. efciently express the implicit memory hierarchy and utilize the PMAM’20 , February 22, 2020, San Diego, CA, USA parallelization opportunity in potential up and coming exascale © 2020 Association for Computing Machinery. systems [37] while improving the programmability, usability, and ACM ISBN 978-1-4503-7522-1/20/02...$15.00 portability of the scientifc applications for the exascale era. htps://doi.org/10.1145/3380536.3380542 PMAM’20 , February 22, 2020, San Diego, CA, USA Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung (a) (b) (c) System-on-a-Chip Off-chip NVRAM High Capacity, Low Bandwidth Memory NoC DRAM NVM Low Capacity, High Bandwidth DRAM Off-chip Memory 3D-Stacked 3D-Stacked 3D-Stacked Local Memory DRAM Shared Memory Memory Memory per Fat Core Memory of 3D-Stacked 3D-Stacked 3D-Stacked Thin Cores Memory Memory Memory NVRAM 3D Stacked On-chip Memory Shared Memory Network-on-Chip Among Thin Cores NoC Application Application FPGA Specific Specific Scratchpad Fat Cores Accelerators Memories / L3 Cache Non-Volatile RAM Thin Thin Fat Fat Fat FC1 FC2 FC3 FC4 SMT1 SMT2 SMT3 SMT4 Core Core Cores Cores Core Thin Cores / Fat Fat Core Core Thin Thin Fat Accelerators Cores Cores Core Figure 1. (a) A system-on-a-chip (SOC) processor, (b) an abstraction machine model (AMM), adopted from [4], for an exascale-class compute node, and (c) its corresponding Gecko model. Every component on the model in (b) has a corresponding location in the Gecko model (c). Virtual locations are colored in gray. With simplicity and portability as their main goal, Ghane et machine models (AMMs) to describe the computer architectures. al. [18] proposed a hierarchical model, Gecko, that helps the soft- AMMs are intended as communication aids between computer ware developers to abstract the underlying hardware of the com- architects and software developers to study the performance trade- puting systems and create portable solutions. To demonstrate the ofs of a model. Figure 1b shows a promising AMM for potential feasibility of Gecko, they also implemented Gecko as directive- heterogeneous exascale systems [4]. This model has the potential based programming model with the same name. Now that Gecko to represents both node- and chip-level systems. is in place, this paper discusses the model’s feature sets while ad- Figure 1c shows the equivalent Gecko model for the abstract dressing challenges along the way. Our main contributions in this model of Figure 1b. The fat cores, depicted with FCx, have the po- paper are to declare those challenges and provide our solutions for tential to represent both the conventional processors in current them. Those challenges are: systems (e.g., IBM POWER9 in Summit) and modern on-chip pro- 1. Given a set of variables scattered in various locations on the cessors in the SOC processor (e.g., “fat cores” in Figure 1a). Other hierarchy tree, in which of these locations Gecko would exe- components in Figure 1c — locations in Gecko’s terminology — have cute a kernel? We propose a novel algorithm, Most Common similar physical manifestations in real scenarios. Descendent (MCD), to address this challenge. Discussed in Section 3. 2.2 Brief Overview of Gecko 2. Having chosen a location, how does Gecko distribute exe- The principal components of a Gecko model are its locations. They cution among the children of a location? We propose fve are an abstraction of available memory and computational resources distribution policies to address this scenario. Discussed in in a system and are connected to each other, similar to a hierarchi- Section 4. cal tree structure. Each location represents a particular memory 3. How does Gecko’s runtime library maintain the tree hier- subsystem; e.g., the host memory, the device memory on the ac- archy and the workload distribution internally? We use a celerators, and so on. Potential memory locations can be grouped Reverse Hash Table to propose our solutions. Discussed in together and form a virtual location. The virtual locations in our Section 5. model have no physical manifestation in the real world. Their role 4. Considering the dynamism that we introduce, how is mem- is to simplify the management of similar locations and to minimize ory allocated since the targeted location is only known at the code modifcations. Figure 1c is an example of a Gecko model for execution time? We use Gecko’s multiCoreAlloc to address the SOC shown in Figure 1a.