Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA

Distibuted Dense Numerical Linear Algebra Algorithms on massively parallel architectures: DPLASMA † George Bosilca∗, Aurelien Bouteiller∗, Anthony Danalis∗ , Mathieu Faverge∗, Azzam Haidar∗, ‡ §¶ Thomas Herault∗ , Jakub Kurzak∗, Julien Langou , Pierre Lemarinier∗, Hatem Ltaief∗, † Piotr Luszczek∗, Asim YarKhan∗ and Jack Dongarra∗ ∗University of Tennessee Innovative Computing Laboratory †Oak Ridge National Laboratory ‡University Paris-XI §University of Colorado Denver ¶Research was supported by the National Science Foundation grant no. NSF CCF-811520 Abstract—We present DPLASMA, a new project related first two, combined with the economic necessity of using to PLASMA, that operates in the distributed memory many thousands of CPUs to scale up to petascale and regime. It uses a new generic distributed Direct Acyclic larger systems. Graph engine for high performance computing (DAGuE). Our work also takes advantage of some of the features of More transistors and slower clocks means multicore DAGuE, such as DAG representation that is independent of designs and more parallelism required. The modus problem-size, overlapping of communication and computa- operandi of traditional processor design, increase the tion, task prioritization, architecture-aware scheduling and transistor density, speed up the clock rate, raise the management of micro-tasks on distributed architectures voltage has now been blocked by a stubborn set of that feature heterogeneous many-core nodes. The originality of this engine is that it is capable of translating physical barriers: excess heat produced, too much power a sequential nested-loop code into a concise and synthetic consumed, too much voltage leaked. Multicore designs format which it can be interpret and then execute in a are a natural response to this situation. By putting distributed environment. We consider three common dense multiple processor cores on a single die, architects linear algebra algorithms, namely: Cholesky, LU and QR can overcome the previous limitations, and continue factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate to increase the number of gates on the chip without from our preliminary results that our DAG-based approach increasing the power densities. However, despite obvious has the potential to bridge the gap between the peak and similarities, multicore processors are not equivalent to the achieved performance that is characteristic in the state- multiple-CPUs or to SMPs. Multiple cores on the same of-the-art distributed numerical softwares on current and chip can share various caches (including TLB!) and they emerging architectures. certainly share the memory bus. Extracting performance Index Terms—Linear systems, parallel algorithms, scheduling and task partitioning from such configurations of resources means that pro- grammers must exploit increased thread-level parallelism (TLP) and efficient mechanisms for inter-processor com- I. INTRODUCTION AND MOTIVATION munication and synchronization to manage resources Among the various factors that drive the momentous effectively. The complexity of parallel processing will changes occurring in the design of microprocessors and no longer be hidden in hardware by a combination of high end systems, three stand out as especially notable: increased instruction level parallelism (ILP) and pipeline 1) the number of transistors on the chip will continue the techniques, as it was with superscalar designs. It will current trend, i.e. double roughly every 18 months, while have to be addressed at an upper level, in software, the speed of processor clocks will cease to increase; 2) either directly in the context of the applications or in we are getting closer to the physical limit for the number the programming environment. As portability remains a and bandwidth of pins on the CPUs and 3) there will requirement, clearly the programming environment has be a strong drift toward hybrid/heterogeneous systems to drastically change. for petascale (and larger) systems. While the first two Thicker memory wall means that communication effi- involve fundamental physical limitations that the state- ciency will be even more essential. The pins that connect of-art research today is unlikely to prevail over in the the processor to main memory have become a strangle near term, the third is an obvious consequence of the point, with both the rate of pin growth and the bandwidth per pin slowing down, if not flattening out. Thus the characteristic in the state-of-the-art distributed numerical processor to memory performance gap, which is already software on current and emerging architectures. approaching a thousand cycles, is expected to grow, by The remainder of the paper is organized as follows. 50% per year according to some estimates. At the same Section II describes the related work, Section III recalls time, the number of cores on a single chip is expected the three one-sided factorization based on tile algorithms. to continue to double every 18 months, and since lim- Section IV presents the DAGuE framework. Finally, itations on space will keep the cache resources from Section V gives the experimental results and Section VI growing as quickly, cache per core ratio will continue to provides the conclusion and future work. go down. Problems with memory bandwidth and latency, and cache fragmentation will, therefore, tend to become II. RELATED WORK more severe, and that means that communication costs This paper reflects the convergence of algorithmic and will present an especially notable problem. To quantify implementation advancements in the area of dense linear the growing cost of communication, we can note that algebra in the recent years. This section presents the time per flop, network bandwidth (between parallel pro- solutions that laid the foundation for this work, which cessors), and network latency are all improving, but at include: the development of the class of tile algorithms, exponentially different rates: 59%/year, 26%/year and the application of performance-oriented matrix layout 15%/year, respectively. Therefore, it is expected to see a and the use of dynamic scheduling mechanisms based shift in algorithms’ properties, from computation-bound, on representing the computation as a Directed Acyclic i.e. running close to peak today, toward communication- Graph (DAG) [3]. bound in the near future. The same holds for communication between levels of the memory hierarchy: A. Tile Algorithms memory bandwidth is improving 23%/year, and memory The tile algorithms are based on the idea of processing latency only 5.5%/year. Many familiar and widely used the matrix by square submatrices, referred to as tiles, of algorithms and libraries will become obsolete, especially relatively small size. This makes the operation efficient dense linear algebra algorithms which try to fully exploit in terms of cache and TLB use. The Cholesky factoriza- all these architecture parameters; they will need to be tion lends itself readily to tile formulation, however the reengineered and rewritten in order to fully exploit the same is not true for the LU and QR factorizations. The power of the new architectures. tile algorithms for them are constructed by factorizing In this context, the PLASMA[1] project has devel- the diagonal tile first and then incrementally updating oped several new algorithms for dense linear algebra the factorization using the entries below the diagonal on shared memory system based on tile algorithms (see tile. This is a very well known concept, that dates back section II). In this paper, we present DPLASMA, a to the work by Gauss, and is clearly explained in the new project related to PLASMA, that operates in the classic book by Golub and Van Loan [4] and Stewart [5]. distributed-memory environment. DPLASMA introduces These algorithms were subsequently rediscovered as a novel approach to schedule dynamically dense linear very efficient methods for implementing linear algebra algebra algorithms on distributed systems. It is based operations on multicore processors [6], [7], [8], [9], [10]. on these tile algorithms, using DAGuE [2], a new It is crucial to note that the technique of processing generic distributed Direct Acyclic Graph Engine for the matrix by square tiles yields satisfactory performance high performance computing. This engine supports a only when accompanied by data organization based DAG representation independent of problem-size, over- on square tiles. This fact was initially observed by laps communications with computation, prioritizes tasks, Gustavson [11], [12] and recently investigated in depth schedules in an architecture-aware manner and manages by Gustavson, Gunnels and Sexton [13]. The layout is micro-tasks on distributed architectures featuring hetero- referred to as square block format by Gustavson et al. geneous many-core nodes. The originality of this engine and as tile layout in this work. The paper by Elmroth, resides in its capability of translating a sequential nested- Gustavson, Jonsson and Kagstr˚ om¨ [14] provides a sys- loop code into a concise and synthetic format which it tematic treatment of the subject. can interpret and then execute in a distributed environ- Finally, the well established computational model ment. We consider three common dense linear algebra al- that uses DAGs as its representation together with the gorithms, namely: Cholesky, LU and QR factorizations, dynamic task scheduling have gradually made their to investigate through the DAGuE framework

Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

Over-Scalapack.Pdf

Prospectus for the Next LAPACK and Scalapack Libraries

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

LAPACK WORKING NOTE 195: SCALAPACK's MRRR ALGORITHM 1. Introduction. Since 2005, the National Science Foundation Has Been Fund

Reciprocity Relations

On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus

Best Practice Guide

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices