Experiences at Scale with PGAS Versions of a Hydrodynamics Application

Experiences at scale with PGAS versions of a Hydrodynamics Application A. C. Mallinson, S. A. Jarvis W. P. Gaudin, J. A. Herdman Performance Computing and Visualisation High Performance Computing Department of Computer Science Atomic Weapons Establishment University of Warwick, UK Aldermaston, UK [email protected] {Wayne.Gaudin,Andy.Herdman}@awe.co.uk ABSTRACT 1. INTRODUCTION In this work we directly evaluate two PGAS programming Due to power constraints it is recognised that emerging HPC models, CAF and OpenSHMEM, as candidate technologies system architectures continue to enhance overall computa- for improving the performance and scalability of scientific tional capabilities through significant increases in hardware applications on future exascale HPC platforms. PGAS ap- concurrency [14]. With CPU clock-speeds constant or even proaches are considered by many to represent a promising reducing, node-level computational capabilities are being research direction with the potential to solve some of the ex- improved through increased CPU core and thread counts. isting problems preventing codebases from scaling to exas- At the system-level, however, aggregate computational ca- cale levels of performance. The aim of this work is to better pabilities are being advanced through increased overall node inform the exacsale planning at large HPC centres such as counts. Effectively utilising this increased concurrency will AWE. Such organisations invest significant resources main- therefore be vital if existing scientific applications are to taining and updating existing scientific codebases, many of harness the increased computational capabilities available which were not designed to run at the scales required to in future supercomputer architectures. reach exascale levels of computational performance on future system architectures. We document our approach for imple- It has also been recognised that the main memory capacity menting a recently developed Lagrangian-Eulerian explicit available per CPU core is likely to continue to decrease in hydrodynamics mini-application in each of these PGAS lan- future system architectures. Additionally for many appli- guages. Furthermore, we also present our results and expe- cations memory bandwidth and latency are already the key riences from scaling these different approaches to high node resource constraints limiting performance. Potentially fur- counts on two state-of-the-art, large scale system architec- ther necessitating the need to scale applications to higher tures from Cray (XC30) and SGI (ICE-X), and compare node counts in order to utilise greater aggregate memory their utility against an equivalent existing MPI implemen- resources. tation. It is therefore highly likely that the ability to effectively Categories and Subject Descriptors scale applications across multi-petascale or exascale plat- [Computing methodologies]: Massively parallel and high- forms will be essential if these classes of machine are to be performance simulations; [Software and its engineer- utilised for improved science. Irrespective of the nodal hard- ing]: Software organization and properties|Interoperabil- ware employed in a particular supercomputer architecture, ity, Software performance there is a common requirement for improving the scalability of communication mechanisms within future systems. These trends present a significant challenge to large HPC centres General Terms such as the Atomic Weapons Establishment (AWE), as in Languages, Performance, Measurement order to reach exascale levels of performance on future HPC platforms, many of their applications require significant scal- Keywords ability improvements. PGAS, Co-array Fortran, OpenSHMEM, MPI, HPC, Exas- cale, Hydrodynamics The established method of utilising current supercomputer architectures is based on an MPI-only approach, which utilises a two-sided model of communication for both intra- and inter-node communication. It is argued that this programming approach is starting to reach its scalability limits due to increasing nodal CPU core counts and the increased con- gestion caused by the number of MPI tasks involved in a large-scale distributed simulation [8]. Unlike Message Passing, PGAS (Partitioned Global Address Space) based approaches such as CAF (Co-array Fortran), OpenSHMEM or UPC (Unified Parallel C) rely on a lightweight one-sided communication model and a global memory ad- larger more complex production codes [18]. Thus their use dress space. This model represents a promising area of re- provides a viable way to develop and evaluate new methods search for improving the performance and scalability of ap- and technologies. plications as well as programmer productivity. Additionally they may also help to reduce the overall memory footprint Weak- and strong-scaling experimental scenarios are likely of applications through e.g. the elimination of communi- to be important on multi-petascale machines. The former cation buffers, potentially leading to further performance are likely to scale well as the communication to computation advantages. The DARPA1 recently funded the development ratio remains relatively constant. While the latter is more of several PGAS-based languages through the HPCS2 pro- challenging because the amount of computation per node re- gramme, including X10 [28], Chapel [11] and Fortress [6]. duces with increased scale and communications eventually dominate. Historically, the use of a shared, globally acces- Historically, effectively utilising a PGAS-based approach of- sible memory has not been vital for this class of application ten required the use of a proprietary interconnect technol- and treating each core/thread as a separate address space ogy, incorporating explicit hardware support, such as those has been a valid strategy. commercialised in the past by Cray and Quadrics. Although the body of work which examines PGAS-based applications In this work we utilise a simplified but still representative on these technologies is still relatively small, substantially structured, explicit hydrodynamic mini-app known as Clover- less works exists which examines their performance on sys- Leaf [18] to investigate the utility of these PGAS-based ap- tems constructed from commodity-based technologies such proach for this class of application. CloverLeaf forms part of as Infiniband. It is the view of the authors that this anal- the R&D Top 100 award winning Mantevo software suite [3]. ysis may become increasingly important in the future given We document our experiences porting the existing MPI code- that Intel recently procured both the Cray Aries and QLogic base to CAF and OpenSHMEM and compare the perfor- Infiniband interconnect technologies and the potential for mance and scalability of each approach, under a strong- these technologies to converge within future Intel system- scaling experimental scenario, on two state-of-the-art system on-a-chip designs. Research is therefore needed to assess architectures and vendor implementations. the relative merits of PGAS-based programming models and future hardware evolutions to ensure that the performance Specifically, in this paper we make the following key contri- of scientific applications is optimised. To date, insufficient butions: work has been conducted to directly evaluate both the CAF • We document the implementation of each of our Open- and OpenSHMEM models at scale on current high-end HPC SHMEM and CAF versions of CloverLeaf; 10 and 8 dis- platforms, particularly within the sphere of explicit Lagrangian tinct versions respectively. With the aim that this in- Eulerian hydrodynamics, which is a key focus of this work. formation will be useful to developers of future Open- SHMEM and CAF applications. Large HPC sites are required to make significant investments maintaining their existing production scientific codebases. • We present a performance analysis of these versions, at Production codes are usually legacy codes which are gener- considerable scale (up to 49,152 cores), to provide both ally old, large and inflexible, with many man-years of de- a comparison of each programming model but also to velopment resources invested in them. This effort cannot be assess how the communication constructs within each discarded and applications cannot be rewritten from scratch can best be incorporated into parallel applications. for the latest hardware. These organisations are therefore Based on these results we also make recommendations challenged with the task of deciding how best to develop to improve the OpenSHMEM specification and poten- their existing applications to effectively harness future HPC tially future CAF compiler and runtime systems. system architectures. • We provide a performance comparison of our PGAS versions against an equivalent MPI implementation in The optimal approach is unknown and likely to be appli- order to assess their utility against the dominant parad- cation and machine architecture specific. Evaluating the igm used in existing parallel scientific applications. strengths of each potential approach using existing production codes is problematic due to their size and complexity. • Finally, we analyse the performance of our PGAS im- Together with limited development and financial resources, plementations on two systems incorporating different this complexity means that it is not possible to explore every interconnect topologies and technologies. Namely the software development option for each HPC application. De- Cray Aries (Dragonfly) interconnect in the XC30 plat- cisions made

Experiences at Scale with PGAS Versions of a Hydrodynamics Application

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Introduction to Parallel Programming

An Open Standard for SHMEM Implementations Introduction

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Introduction to Parallel Computing

Bridging Parallel and Reconfigurable Computing with Multilevel PGAS and SHMEM+

Distributed Shared Memory – a Survey and Implementation Using Openshmem

NUG Monthly Meeting

A Space Exploration Framework for GPU Application and Hardware Codesign

Parallel Computing, Models and Their Performances

Benchmarking Parallel Performance on Many-Core Processors

SIMD Programmingprogramming