General-Purpose Performance Portable Programming Models for Productive Exascale Computing

Area Exam: General-Purpose Performance Portable Programming Models for Productive Exascale Computing Alister Johnson Department of Computer and Information Sciences University of Oregon Spring 2020 Abstract best way to program them. In addition, the wide va- riety of architectures means code written for one archi- Modern supercomputer architectures have grown increas- tecture may not run on another. Over the last couple ingly complex and diverse since the end of Moore's law decades, portable programming models, such as OpenMP in the mid-2000s, and are far more difficult to program and MPI, have matured to the point that most code can than their predecessors. While HPC programming mod- now be ported to multiple architectures with minimal ef- els have improved such that applications are now gen- fort from developers. However, these models do not guar- erally portable between architectures, their performance antee that ported code will run well on a new architec- can still vary wildly, and developers now need to spend ture. a great deal of time tuning or even rewriting their ap- Performance tuning and optimization are now the most plications for each new machine to get the performance expensive parts of porting an application. The explo- they need. New performance portable programming mod- sion of new architectures also included an explosion of els aim to solve this problem and give high performance micro-architectures that can have wildly different perfor- on all architectures with minimal effort from developers. mance characteristics, and to make the best use of micro- This area exam will survey many of these proposed architectural features, developers often have to rewrite general-purpose programming models, including libraries, large portions of their applications. For very large ap- parallel languages, directive-based language extensions, plications (100,000+ lines of code), this is prohibitively and source-to-source translators, and compare them in expensive, and developers need a better option. terms of use cases, performance, portability, and de- Even when developers could rewrite their application, veloper productivity. It will also discuss compiler and they often want to run on multiple architectures, and general-purpose language standard (e.g., C++) support must therefore keep multiple versions of their code. Main- for performance portability features. taining multiple versions of the same code can be prohibitively expensive as well, since fixes and updates must be added to each version and these versions can diverge 1 Introduction as time goes on. Keeping multiple code paths within the same version runs into the same problem, so developers Since the end of Moore's law and Dennard scaling in the in this situation also need more options. mid-2000s, chip designers are hitting the physical limits of In short, porting applications to new architectures cur- what can be done with single- and multi-core processors. rently requires excessive developer effort and portable The high performance computing (HPC) community has programming solutions are not enough: we need perfor- realized simply scaling up existing hardware will no longer mance portable programming models that will increase work. As a result, there has been an explosion of new su- developer productivity. This paper will discuss several po- percomputer architectures in the last two decades that tential general-purpose, performance portable program- attempt to innovate around these physical limits to reach ming models that have been introduced in recent years, higher levels of performance. Many of these new architec- including: tures use accelerators, such as graphics processing units (GPUs) or Intel's Xeon Phi, as well as traditional CPUs, { Libraries, such as RAJA and SkelCL (Section 3) to achieve higher performance than CPUs alone. { Parallel languages, like Chapel and Cilk (Section 4) Unfortunately, these types of architectures are very difficult to program, and there is still no consensus on the { Directive-based language extensions, including 1 OpenACC and OpenMP (Section 5) research in many domains. It will allow simulations to have higher resolution, scientific computations to get re- { Source-to-source translators, such as Omni and Clacc sults more quickly, and machine learning applications to (Section 6) train on more data. Current supercomputers can do on the order of 100 This paper will prioritize discussion of models that are petaFLOPs (1 exaFLOP = 1,000 petaFLOPs). Summit popular in the HPC community, are currently under de- (Oak Ridge National Lab) can do ∼150{200 petaFLOPs; velopment, were recently proposed, and/or provide some Sierra (Lawrence Livermore National Lab) and Sunway novelty, and future directions such models could take, TaihuLight (National Supercomputing Center in Wuxi, where applicable. This paper will not discuss: domain China) can both do ∼100{125 petaFLOPs [171]. Aurora, specific languages (DSLs), since these by definition are arriving at Argonne National Lab in 2021, will theoreti- not general-purpose (although there have been promis- cally be an exascale machine. China and Japan are aim- ing results in using DSLs for performance portability); ing to have their own exascale machines by 2020 and 2023, autotuning or other machine learning-based methods for respectively [65]. automatic optimization, since those are meant to be used with the programming models, but are not programming models themselves; or generic programming frameworks 2.1.1 Goals for Exascale such as MapReduce or StarPU, since these are not truly programming models, but efficient runtime systems. The U.S. Department of Energy (DoE) set a goal to build In addition, Section 7 will discuss current compiler an exascale machine that has a hardware cost of less than support for parallelism and potential improvements that $200M and uses less than 20MW of power [149]. Aurora these models could make use of. Section 8 will summarize will not meet the cost goal, and it's still unknown if it and compare these models in terms of use cases, perfor- will meet the power goal, but there is hope for future mance, portability, and developer productivity. Section machines. 2 will begin by providing background information, and Other (implicit) goals for exascale computing include Section 9 will conclude. (1) making exascale machines \easy" to program, (2) ver- The main sections of this paper (3, 4, 5, and 6) will ifying that these machines can do a \useful" exaFLOP, be structured as follows. A brief introduction will be and (3) verifying they can perform sustained exaFLOPs. given, then each programming model will be described. The first of these is also a goal of performance portability, Each model will be discussed in terms of the \3 P's" and will be discussed further in Sec. 2.3 (on productiv- (portability, performance, and productivity) and given a ity). (qualitative, subjective) \P3 Ranking" based on how well As for (2) and (3), the origin of these questions goes it meets the criteria for being performance portable and back to how supercomputer performance is measured. productive, as defined later in Sec. 2.4. All of this will The measurement method used by Top500 is the LIN- be summarized in two tables at the end. PACK Benchmark, a dense linear algebra solver [170]. LINPACK has been criticized for being overly specific and thus not representative of real applications that will be 2 Background run on these machines. For example, LINPACK does not account for data transfers, which is one of the bottlenecks The first part of this section will describe the goals and on current machines. Very few applications can achieve challenges of exascale computing, as well as how it re- even close to the peak performance of LINPACK because lates to performance portability. The second part will dis- they cannot make use of all the floating point units on a cuss background information on performance portability, chip and/or they have to wait on data movement [73]. including its most common mathematical definition and The HPC community wants an exascale machine that several criticisms of that definition. The section will con- can perform a \useful" exaFLOP with a real application clude with some definitions and brief descriptions of pop- that has these kinds of problems. If the machine can ular non-(performance) portable programming models, only do exaFLOPs with highly tuned, compute-bound which will be compared to various performance portable programs like LINPACK, that isn't helpful for domain sci- models throughout this paper. entists, whose applications are much more varied. If the machine cannot perform sustained exaFLOPs, but only 2.1 Exascale Computing burst to an exaFLOP under some circumstances (e.g., the kind of dense math performed by LINPACK), that The aim of exascale computing is to build a machine that also isn't helpful. Building an exascale machine that can can do 1018 floating point operations in a second | 1 meet goals (2) and (3) will be a challenge beyond merely exaFLOP. Exascale computing is essential for doing new building an exascale machine. 2 2.1.2 Challenges of Exascale 10. Scientific productivity | we want to increase productivity of domain scientists with new A DoE report on exascale computing [94] identified the tools and environments that let them work following as the top ten challenges to building an exascale on exascale machines easily. supercomputer. Many other works have also identified a subset of these as major difficulties for exascale systems The two bold challenges, (5) and (10), are of particular [149, 74, 107]. interest to performance portability research. Performance portability is concerned with creating programming mod- 1. Energy efficiency | the goal is to use only 20 MW els that run equally well on multiple architectures/ma- of power, but simply scaling up current technology chines; the more difficult question is, how can we do so would use far more than this. while allowing developers to express parallelism, data lo- cality, and fault tolerance in ways that won't tie them to 2.

Load more