Parallex: an Advanced Parallel Execution Model for Scaling- Impaired Applications

ParalleX: An Advanced Parallel Execution Model for Scaling- Impaired Applications Hartmut Kaiser Maciej Brodowicz Thomas Sterling Center for Computation and Center for Computation and Center for Computation and Technology Technology Technology Louisiana State University Louisiana State University Department of Computer Science [email protected] [email protected] Louisiana State University [email protected] 1. ABSTRACT is numerical relativity used to model colliding neutron stars to simulate gamma ray bursts (GRB) and simulta- High performance computing (HPC) is experiencing neously identify the gravitational wave signature for a phase change with the challenges of programming detection with such massive instruments as LIGO (Laser and management of heterogeneous multicore sys- Interferometer Gravitational Observatory). These codes tems architectures and large scale system configu- exploit the efficiencies of Adaptive Mesh Refinement rations. It is estimated that by the end of the next (AMR) algorithms to concentrate processing effort at decade Exaflops computing systems requiring hun- the most active parts of the computation space at any dreds of millions of cores demanding multi-billion- one time. However, conventional parallel programming way parallelism with a power budget of 50 methods using MPI [1] and systems such as distributed Gflops/watt may emerge. At the same time, there memory MPPs and Linux clusters exhibit poor efficiency are many scaling-challenged applications that al- and constrained scalability, severely limiting scientific though taking many weeks to complete, cannot advancement. Many other applications exhibit similar scale even to a thousand cores using conventional properties. To achieve dramatic improvements for such distributed programming models. This paper de- problems and prepare them for exploitation of Peta- scribes an experimental methodology, ParalleX, that flops systems comprising millions of cores, a new execu- addresses these challenges through a change in the tion model and programming methodology is required fundamental model of parallel computation from [2]. This paper briefly presents such a model, ParalleX, that of the communicating sequential processes and provides early results from an experimental im- (e.g., MPI) to an innovative synthesis of concepts plementation of the HPX runtime system that suggests involving message-driven work-queue execution in the future promise of such a computing strategy. the context of a global address space. The focus of It is recognized that technology trends have forced high this work is a new runtime system required to test, performance system architectures into the new regime validate, and evaluate the use of ParalleX concepts of heterogeneous multicore structures. With multicore for extreme scalability. This paper describes the becoming the new Moore’s Law, performance advances ParalleX model and the HPX runtime system and even for general commercial applications are requiring discusses how both strategies contribute to the goal parallelism in what was once the domain of purely se- of extreme computing through dynamic asynchron- quential computing. In addition, accelerators including, ous execution. The paper presents the first early but not limited, to GPUs are being applied for significant experimental results of tests using a proof-of- performance gains, at least for certain applications concept runtime-system implementation. These where inner loops exhibit numeric intensive operation results are very promising and are guiding future on relatively small data. Future high end systems will work towards a full scale parallel programming and integrate thousands of “nodes”, each comprising many runtime environment. hundreds of cores by means of system area networks. 2. INTRODUCTION Future applications like AMR algorithms will involve the processing of large time-varying graphs with embedded An important class of parallel applications is emerging meta-data. Also of this general class are informatics as scaling impaired. These are problems that require problems that are of increasing importance for know- substantial execution time, sometimes exceeding a ledge management and national security. month, but which are unable to make effective use of Critical bottlenecks to the effective use of new genera- more than a few hundred processors. One such example tion HPC systems include: 1 Starvation – due to lack of usable application paral- flow, and employ adaptive scheduling and routing to lelism and means of managing it, mitigate contention (e.g., memory bank conflicts). Sca- Overhead – reduction to permit strong scalability, lability will be dramatically increased, at least for cer- improve efficiency, and enable dynamic resource tain classes of problems, through data directed compu- management, ting using message-driven computation and lightweight Latency – from remote access across system or to synchronization mechanisms that will exploit the paral- local memories, lelism intrinsic to dynamic directed graphs through Contention – due to multicore chip I/O pins, memo- their meta-data. As a consequence, sustained perfor- ry banks, and system interconnects. mance will be dramatically improved both in absolute terms through extended scalability for those applica- The ParalleX model has been devised to address these tions currently constrained, and in relative terms due to challenges by enabling a new computing dynamic enhanced efficiency achieved. Finally, power reductions through the application of message-driven computation will be achieved by reducing extraneous calculations in a global address space context with lightweight syn- and data movements. Speculative execution and specul- chronization. This paper describes the ParalleX model ative prefetching are largely eliminated while dynamic through the syntax of the PXI API and presents a run- adaptive methods and multithreading in combination time system architecture that delivers the middleware serve many of the purposes these conventionally pro- mechanisms required to support the parallel execution, vide. synchronization, resource allocation, and name space management. Section 3 describes the ParalleX model ParalleX replaces the conventional communicating se- and the performance drivers motivating it. Section 4 quential processes model to address the challenges im- describes the HPX runtime system architecture and par- posed by post-Petascale computing, multicore arrays, ticular implementation details important to the results. heterogeneous accelerators, and the specific class of Section 5 describes early experiments that demonstrate dynamic graph problems while exploiting the opportun- key functional attributes of the runtime system. Section ities of multithreaded multicore technologies and archi- 6 presents and discusses the results, with concluding tectures. Instead of statically allocated processes using comments and future work presented in Section 7. message-passing for communication and synchronization, dynamically scheduled multiple threads using 3. THE PARALLEX MODEL OF PARALLEL message-driven mechanisms for moving the work to the EXECUTION data and local control objects for synchronization are employed. The result is a new dynamic based on local A phase change in high performance computing is dri- synchrony and global asynchronous operation. Key to ven by significant advances in enabling technologies the efficiency and latency hiding of ParalleX is the mes- requiring new computer architectures to exploit the sage-driven work-queue methodology of applying user performance benefits of the technologies, while com- tasks to physical processing resources. This separates pensating for the challenges they present. Fundamental the work from the resources and given sufficient paral- to such effective change, the architectures that reflect lelism allows the processor cores to continue to do use- them, and the programming models that enable access ful work even in the presence of remote service re- and user exploitation of them is the transformative ex- quests and data accesses. This results in system-wide ecution model that establishes the semantic framework latency hiding intrinsic to the paradigm. tying all levels together. Historically, HPC has expe- rienced at least five such phase changes in models of Global barriers are essentially eliminated as the prin- computation over the last six decades each enabling a cipal means of synchronization, and instead replaced by successive technology generation. Currently significant lightweight Local Control Objects (LCOs) that can be advances in technology are forcing dramatic changes in used for a plethora of purposes from simple mutex con- their usage as reflected by early multicore heterogene- structs to sophisticated futures semantics for anonym- ous architectures suggesting the need for new ap- ous producer-consumer operation. LCOs enable event- proaches to application programming. driven thread creation and can support in-place data structure protection and on-the-fly scheduling. One The goal of the ParalleX model of computation is to ad- consequence of this is the elimination of conventional dress the key challenges of efficiency, scalability, sus- critical sections with equivalent functionality achieved tained performance, and power consumption with re- local to the data structure contended for by embedded spect to the limitations of conventional programming local control objects and dynamic futures. The paral- practices (e.g., MPI). ParalleX

Load more