Memory-aware Algorithms and Scheduling Techniques for Matrix Computations Julien Herrmann
To cite this version:
Julien Herrmann. Memory-aware Algorithms and Scheduling Techniques for Matrix Computations. Distributed, Parallel, and Cluster Computing [cs.DC]. Ecole normale supérieure de lyon - ENS LYON, 2015. English. NNT : 2015ENSL1043. tel-01241485
HAL Id: tel-01241485 https://tel.archives-ouvertes.fr/tel-01241485 Submitted on 10 Dec 2015
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. No attribué par la bibliothèque: 2015ENSL1043
-ÉCOLENORMALESUPÉRIEUREDE LYON - Laboratoire de l’Informatique du Parallélisme - UMR5668 - LIP
THÈSE
en vue d’obtenir le grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Spécialité : Informatique
au titre de l’École Doctorale Informatique et Mathématiques
présentée et soutenue publiquement le 25 Novembre 2015 par
Julien HERRMANN
Memory-aware Algorithms and Scheduling Techniques for Matrix Computations
Directeur de thèse : Yves ROBERT Co-encadrant de thèse : Loris MARCHAL
Devant la commission d’examen formée de : Luc GIRAUD Examinateur Pierre MANNEBACK Examinateur Loris MARCHAL Co-encadrant Yves ROBERT Directeur Oliver SINNEN Rapporteur Denis TRYSTRAM Rapporteur ii Contents
Introduction i
I Linear Algebra 1
1 Mixing LU and QR factorization algorithms 3 1.1 Introduction ...... 3 1.2 Hybrid LU-QR algorithms ...... 5 1.2.1 LU step ...... 6 1.2.2 QR step ...... 7 1.2.3 LU step variants ...... 8 1.2.4 Comments ...... 9 1.3 Robustness criteria ...... 10 1.3.1 Max criterion ...... 10 1.3.2 Sum criterion ...... 11 1.3.3 MUMPS criterion ...... 12 1.3.4 Extending the MUMPS criterion to tiles ...... 13 1.3.5 Complexity ...... 14 1.4 Implementation ...... 15 1.5 Experiments ...... 17 1.5.1 Experimental framework ...... 17 1.5.2 Results for random matrices ...... 20 1.5.3 Results for different processor grid sizes ...... 22 1.5.4 Results for special matrices ...... 23 1.5.5 Results for varying the condition number ...... 25 1.5.6 Assessment of the three criteria ...... 26 1.6 Related work ...... 26 1.6.1 LUPP ...... 26 1.6.2 LU NoPiv ...... 26 1.6.3 LU IncPiv ...... 27 1.6.4 CALU ...... 27 1.6.5 Summary ...... 27 1.7 Conclusion ...... 27
iii iv CONTENTS
2 Cholesky factorization on heterogeneous platforms 29 2.1 Introduction ...... 29 2.2 Context ...... 30 2.2.1 Cholesky factorization ...... 30 2.2.2 Multiprocessor scheduling ...... 32 2.3 Tools and libraries ...... 33 2.3.1 StarPU runtime system ...... 33 2.3.2 Chameleon dense linear algebra library ...... 34 2.3.3 Simgrid simulation engine ...... 34 2.4 Lower bounds ...... 35 2.4.1 Linear Programming formulation ...... 35 2.4.2 Constraint Programming formulation ...... 36 2.4.3 Upper bounds on performance ...... 37 2.5 Experiments and Results ...... 38 2.5.1 Schedulers ...... 38 2.5.2 Experimental Setup ...... 38 2.5.3 Results ...... 38 2.6 Discussion ...... 46 2.6.1 dmda vs dmdas scheduler ...... 46 2.6.2 Mapping from Constraint Programming solution ...... 47 2.6.3 Constraint Programming schedule in actual execution ...... 47 2.7 Conclusion ...... 47
II Scheduling Under Memory Constraints 49
3 Memory-aware list scheduling for hybrid platforms 51 3.1 Introduction ...... 51 3.2 Related work ...... 53 3.2.1 Task graph scheduling ...... 53 3.2.2 Pebble game and its variants ...... 53 3.2.3 Scheduling with memory constraints ...... 54 3.2.4 Hybrid computing ...... 54 3.3 Model and framework ...... 54 3.3.1 Flow and resources constraints ...... 54 3.3.2 Memory constraints ...... 56 3.3.3 Optimization problem ...... 58 3.4 Tree traversal with pre-assigned task ...... 59 3.4.1 Complexity results on trees ...... 59 3.4.2 Depth-first traversals ...... 66 3.4.3 Heuristics ...... 69 3.4.4 Experiments ...... 70 3.5 General problem ...... 73 3.5.1 ILP formulation ...... 75 3.5.2 Heuristics ...... 78 3.5.3 Simulation results ...... 81 3.6 Conclusion ...... 86 CONTENTS v
4 Assessing the cost of redistribution 89 4.1 Introduction ...... 89 4.2 Related work ...... 92 4.2.1 Communication model ...... 92 4.2.2 General data redistribution ...... 92 4.2.3 Array redistribution ...... 93 4.3 Model and framework ...... 93 4.3.1 Definitions ...... 93 4.3.2 Cost of a redistribution ...... 94 4.3.3 Optimization problems ...... 95 4.4 Redistribution ...... 96 4.4.1 Total communication volume ...... 96 4.4.2 Number of parallel communication steps ...... 96 4.4.3 Evaluation of optimal vs. arbitrary redistributions ...... 98 4.5 Coupling redistribution and stencil computations ...... 100 4.5.1 Application model ...... 101 4.5.2 Complexity ...... 102 4.6 Experiments ...... 108 4.6.1 Setup ...... 109 4.6.2 Stencil ...... 109 4.6.3 QR factorization ...... 110 4.7 Conclusion ...... 116
III Auto-adjoint computations 117
5 Optimal multistage algorithms for adjoint computation 119 5.1 Introduction ...... 119 5.2 Framework ...... 120 5.2.1 The AC problem ...... 120 5.2.2 Algorithm model ...... 122 5.3 Solution of PROB∞(l, cm, wd, rd) ...... 123 5.3.1 Properties of an optimal algorithm ...... 123 5.3.2 Optimal algorithm ...... 130 5.4 Simulations ...... 136 ⋆ 5.4.1 Stumm and Walther’s algorithm (SWA (l, cm, wd, rd)) ...... 136 5.4.2 Simulation setup ...... 137 5.4.3 Simulation results ...... 137 5.5 Conclusion and future work ...... 139
Conclusion 141
Bibliography 145
Publications 153 vi CONTENTS Introduction
The processing power of computers has always been inextricably linked to the characteristics of their memory. From the early 1940’s, many different technologies have been designed to store information. Built between 1943 and 1945, the world’s first electronic digital computer, the ENIAC, was composed of twenty vacuum tube accumulators, each of them being able to store one 10-digit decimal number. Vac- uum tubes had a considerable power consumption and were frequently overheating, leaving the ENIAC non-functional most of the time. Instead of storing data in individual bits, the next generation of com- puter memory, Delay line memory, was turning electrical pulses into sound waves, sending them through a medium that slowed them down. It significantly increased the storage capacity and the reliability of the computational systems. Delay line memory was sequential-access, as opposed to modern random-access memory, and required power to maintain the information. In the late 1940’s, magnetic-core memory was the first non-volatile random-access memory and remained the main form of computer memory from the mid 1950’s to the mid 1970’s. It has been gradually replaced by semiconductor-based memory, that is still used as primary storage in nowadays computers. The cost of this memory has led computer de- signers to introduce a secondary storage that is cheaper but slower and not directly accessible by the CPU. Since their introduction in 1956, hard disk drives are usually used as secondary storage in mod- ern computers. With the rise of storage capacity, the memory architecture has become more and more convoluted. In the mid 1960’s, the concept of Virtual Memory was introduced to provide the ability for operating systems to deal with this complex hierarchies of memory. The need to properly manage the memory usage of computation systems increased with the introduction of multiprocessor architectures. The Cray X-MP computer, released in 1982, was one of the first shared-memory parallel vector processors. It housed two CPUs embedded within a single machine that could concurrently access the same shared- memory, for a peak system performance of 400 millions Floating-point Operations Per Second (400 Megaflops). In the race for performance, the number of processors per machine kept on growing but faced technical issues. Instead of endlessly adding computational units within a single computer, mul- tiprocessor machines were interconnected to form computer clusters. In such systems, the processors hosted on different machines work together following a distributed memory pattern, significantly com- plicating the memory architecture. In the 1990’s, supercomputers with thousands of processors began to appear and ASCI Red was the first system to break through the 1 Teraflops barrier in 1996. Twelve years later, the IBM Roadrunner was the first supercomputer to reach Petascale performance. As of today, the Tianhe-2 supercomputer, released in 2013, is ranked number one in the TOP500 list of supercomput- ers [1] at the speed of 33.86 Petaflops in HPL Linpack benchmark [3]. Pushed by the increasing need of processing power by scientific applications, reaching the Exascale performance mark has become a common goal for computer designers. Recently, on 29 July 2015, U.S. President Obama, established the National Strategic Computing Initiative (NSCI) to accelerate the development of an Exascale system [5]. It is an effort "to create a cohesive, multi-agency strategic vision and federal investment strategy in High Performance Computing
i ii INTRODUCTION
(HPC)". Such HPC systems are essential in science, engineering, technology, and industry. They are used in a wide range of computationally intensive tasks, like large data analysis or simulation of complex physical systems. Many scientific applications rely on supercomputers, such as the simulation of seismic activity, evolving climate, brain activity, nuclear fusion, among others. Most of these applications use subroutines to perform basic matrix computations. Developing efficient implementations of these linear algebra subroutines has become a crucial objective since the mid 1970’s. In 1979, a specification for these common kernel operations using scalars and vectors was published: the level-1 Basic Linear Algebra Subroutines (BLAS). With the advent of vector processors, BLAS was augmented with level- 2 kernel operations that concerned vector-matrix operations, such as vector-matrix multiplication, or linear triangular system resolution. Level-2 BLAS was designed to get the most of vector processors that were operating on one-dimensional arrays of data. In 1987, level-3 BLAS introduced kernel operations that concerned matrix-matrix operations, such as matrix-matrix multiplication, or dense linear system solvers. These level-3 BLAS operations partition the matrices into blocks, keeping data manipulations localized, allowing for a better usage of the cache. Many linear algebra libraries are available. Among them, we can highlight LAPACK [2], released in 1992, that provides a well-tuned BLAS implementation to exploit the caches on modern cache-based architectures. LAPACK is written in FORTRAN and is the successor of LINPACK that is still used as a benchmark for modern supercomputers. Indeed, linear algebra problems are computationally intensive jobs and represent a good way to compare the processing power of HPC systems. The TOP500 list ranks the supercomputers twice a year based on their performance on solving a dense linear system using an LU factorization with partial pivoting and a 64 bit floating point arithmetic. Optimizing these linear algebra libraries for tomorrow’s supercomputers is one of the main concern in the HPC community. To built the next generation of supercomputers, new scientific breakthroughs are required to cope with extreme scale computing. In February 2014, the Advanced Scientific Computing Advisory Com- mittee (ASCAC) published the top ten Exascale research challenges [6]. This list suggests that a good memory management in those more and more complex architectures is a sine-qua-non condition to achieve the development of an Exascale system. Indeed, there is a growing disparity between the speed of the CPU and the main memory throughput, mainly due to the limited communication bandwidth. From 1986 to 2000, processor speed improved at an average annual rate of 55% while memory speed only improved at 10% [109]. One of the main consequences is that the processors may found them- selves lacking in work and waiting for data to arrive from memory. To limit the number of memory accesses, processors are connected to an ever-increasing amount of high-speed cache memory. With the increasing size of the applications, multi-processor systems tend to provide a separate dedicated memory for each processor. This is the Non-Uniform Memory Access (NUMA) architecture. On most supercom- puters, processors are equipped with accelerators, such as GPUs, FPGAs or the more recent Xeon Phi. The CPU offloads compute-intensive portions of the code to these coprocessors, increasing the general performance of the system. Altogether with the increasing complexity of memory architectures and the general heterogeneity at every level, modern supercomputers offer an intricate challenge for software designers. Such systems are hard to model and lead to numerous algorithmic problems. In this thesis, we focus on designing memory-aware algorithms and new scheduling techniques suited to modern storage architectures, for various applications. In the different chapters, we take data communications and memory usage into account, while designing new algorithms for HPC systems. We also exploit the heterogeneity of these systems to improve the performance of generic matrix com- putations. Different approaches can be taken, and we classify them into three main parts in this thesis. In Chapters 1 and 2, we design new linear algebra solvers that squeeze the most out of future HPC ar- chitectures. We also improve existing schedulers to better deal with the heterogeneity of modern HPC architectures. In Chapters 3 and 4, we opt for a more general technique by adding memory-awareness iii to generic scheduling techniques. Indeed, the ordering of the computations and the mapping of the tasks onto the computational resources deeply impact the memory usage while computing a workflow appli- cation. Finally, in Chapter 5, we target another scientific application that deal with large data files. In this context, the memory is extremely limited and it can be necessary to recompute some tasks whose output has not be stored because of memory constraints. Throughout the thesis, we delve into methods to improve memory management, using numerous techniques and targeting various scientific applications. The main contributions of the different chapters are sketched in the following paragraphs.
Chapter 1: Mixing LU and QR factorization algorithms to design high-performance dense linear algebra solvers
In this first chapter, we exhibit a new numerical algorithm to solve dense linear systems. Finding the solution to a linear system and linear algebra problems in general, are a common issue in the HPC com- munity. The HPL Linpack benchmark takes algebra routines performance to compare the computational power of supercomputers. To solve a linear system, one usually factors the matrix using a LU or a QR factorization. These two methods have both advantages and drawbacks. The generic LU factorization is highly parallel and involves only broadcasts of information that are efficient on modern distributed architectures. At the same time, the accuracy of the computed solution may be subject to deterioration depending upon the condition number of the initial matrix. The QR factorization is twice as costly as the LU factorization and is slightly less parallel. It involves pipelining communications when factorizing the panel. However, the computed solution is more accurate than when using a LU factorization. Here, we design an hybrid LU-QR algorithm that alternates LU and QR factorization steps to take advantage of the best characteristics of each method. The choice of which routine to run at each step is conducted by a stability criteria that will detect, before the actual computation of the step, whether performing an LU factorization step is safe from a stability point of view, or whether we should rather perform a QR step. We propose different stability criteria based on the numerical properties of the panel, and imple- ment them in the PARSEC runtime. The PARSEC runtime is well-suited to exploit the parallelism in large clusters of multiprocessors machines. It deals with task scheduling and communications, allowing us to concentrate only on algorithm design. We run an extensive set of experiments to analyse both performance and stability results compared to state-of-the-art algorithms. The experiments are run on a comprehensive set of random matrices, and ill-conditioned matrices coming from the Higham’s Matrix Computation Toolbox. The work presented in this chapter has been done in collaboration with Jack Dongarra and Mathieu Faverge from the University of Tennessee, Knoxville, USA, and Julien Langou and Bradley R. Lowery from the University of Colorado, Denver, USA.
Chapter 2: Bridging the gap between experimental performance and theoretical bounds for the Cholesky factorization on heterogeneous platforms
While the previous chapter was dealing with clusters of multiprocessor machines, in the second chapter of this thesis, we target shared-memory multiprocessors accelerated with GPUs. In this context, specific schedulers has to be designed to exploit the heterogeneity of the platform. The StarPU runtime takes care of mapping tasks efficiently onto the computational resources, using well-known generic dynamic scheduling policies from the literature, while optimizing data transfers using prefetching and overlap- ping, in particular. Although those purely dynamic schedulers are efficient in practice, they are not guaranteed to perform optimally, mostly due to their local vision of the workflow at any moment. In this chapter, we explore how adding static rules to the purely dynamic schedulers implemented in StarPU can improve their performance. We focus on another classical dense linear algebra routine, namely the iv INTRODUCTION
Cholesky factorization. We introduce theoretical bounds on the best achievable performance on a fully heterogeneous platform made of CPUs and GPUs. We investigate how adding general information of the problem can help bridging the gap between these bounds and the actual performance. This work has been conducted in collaboration with Emmanuel Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar, and Samuel Thibault from INRIA Bordeaux, Talence, France.
Chapter 3: Memory-aware list scheduling for hybrid platforms In this chapter, we study the complexity of scheduling task graph workflows whose tasks require large I/O files. We target a heterogeneous architecture with two resource types, each with a different memory, such as a multicore node equipped with a dedicated accelerator (FPGA or GPU). Each task has an unrelated processing time for either resource and can be processed only if all its input and output files can be stored in the corresponding memory. The amount of used memory of each type at a given execution step strongly depends upon the ordering in which the tasks are executed, and upon when communications between both memories are scheduled. The objective is to determine an efficient schedule taking into account the memory constraints for each type. First, we establish the complexity of minimizing memory usage when scheduling tree-shaped workflows whose tasks are already mapped on the resources. Even for this simplified version of the problem, we provide NP-completness proofs and inapproximability results. Then, we design memory-aware heuristics that provide schedules achieving good completion times, without violating the memory constraints. These heuristics are evaluated on a comprehensive set of graphs, including random graphs as well as graphs arising in the context of matrix factorizations.
Chapter 4: Assessing the cost of redistribution followed by a computational kernel In linear algebra applications, when dealing with a distributed computational platform, the data layout usually respects the owner-computes distribution. It means that each block of data is updated by the processor that hosts it. This strategy requires that data blocks are distributed onto the computational resources in a workload-balanced manner. If the initial data distribution is not suitable for the compu- tational kernel, a data redistribution may be required before starting the computation. Usually, for a given application, a specific data distribution is known to be optimal, or close to optimal. Most of the time, this data distribution is not unique. In particular, if all the processors are identical, a processor permutation will not change the performance of a given data distribution. In this chapter, we find the optimal data partition among all the optimal (or close to optimal) ones, that minimizes the cost of re- distribution, given an initial data distribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation that minimize the cost of redistribution followed by a simple computational kernel. Our new redistribution algorithms are experimentally validated using the PARSEC environment on dense linear algebra routines. The work presented in this chapter has been done with the help of Thomas Hérault from the University of Tennessee, Knoxville, USA.
Chapter 5: Optimal multistage algorithms for adjoint computation In this last chapter, we target another scientific application, namely adjoint computation. In this context, the tasks graphs to schedule deal with large I/O files and provide no parallelism across tasks. To cope with the memory constraints, some output files have to be deleted and recomputed later when needed. The objective of the scheduling algorithm is to determine which output files to store, in order to minimize the total processing time of the task graph. As in chapter 3, we deal with a bi-memory model. But this time, memories are different. One of them has no size constraint, but it takes a significant time to write v and read data from it, while the other one is limited in volume but provides a free writing/reading cost. We design the first optimal algorithm in the literature to solve this problem, and compare its performance with the state-of-the-art heuristics. This work has been conducted in collaboration with Guillaume Aupy, another PhD student of my research team, and Paul Hovland from the Argonne National Laboratory, Illinois, USA. vi INTRODUCTION Part I
Linear Algebra
1
Chapter 1
Mixing LU and QR factorization algorithms to design high-performance dense linear algebra solvers
As said in the introduction of this thesis, optimizing the linear algebra libraries is a main concern for the HPC community. Many scientific applications rely on the numerical algorithms provided by these libraries. It is thus important to design new numerical algorithm that exploit the complex hierarchical architecture of the next generation of supercomputers. One of the main application of the linear algebra libraries is to solve a dense linear systems of the form Ax = b. It is indeed the benchmark use to access the computational power of modern supercomputers. To solve a dense linear system Ax = b, one usually use either an LU factorization or a QR factorization of the matrix A. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. In this first chapter, we introduce hybrid LU-QR algorithms for solving dense linear systems. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this chapter, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU-QR algorithms provide a continuous range of trade-offs between stability and performances.
1.1 Introduction
Consider a dense linear system Ax = b to solve, where A is a square tiled-matrix, with n tiles per row and column. Each tile is a block of nb-by-nb elements, so that the actual size of A is N = n × nb. Here, nb is a parameter tuned to squeeze the most out of arithmetic units and memory hierarchy. To solve the linear system Ax = b, with A a general matrix, one usually applies a series of transformations, pre- multiplying A by several elementary matrices. There are two main approaches: LU factorization, where one uses lower unit triangular matrices, and QR factorization, where one uses orthogonal Householder
3 4 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS matrices. To the best of our knowledge, this chapter is the first study to propose a mix of both approaches during a single factorization. The LU factorization update is based upon matrix-matrix multiplications, a kernel that can be very efficiently parallelized, and whose library implementations typically achieve close to peak CPU performance. Unfortunately, the efficiency of LU factorization is hindered by the need to perform partial pivoting at each step of the algorithm, to ensure numerical stability. On the contrary, the QR factorization is always stable, but requires twice as many floating-point operations, and a more complicated update step that is not as parallel as a matrix-matrix product. Tiled QR algorithms [27, 28, 90] greatly improve the parallelism of the update step since they involve no pivoting but are based upon more complicated kernels whose library implementations requires twice as many operations as LU. The main objective of this chapter is to explore the design of hybrid algorithms that would combine the low cost and high CPU efficiency of the LU factorization, while retaining the numerical stability of the QR approach. In a nutshell, the idea is the following: at each step of the elimination, we perform a robustness test to know if the diagonal tile can be stably used to eliminate the tiles beneath it using an LU step. If the test succeeds, then go for an elimination step based upon LU kernels, without any further pivoting involving sub-diagonal tiles in the panel. Technically, this is very similar to a step during a block LU factorization [37]. Otherwise, if the test fails, then go for a step with QR kernels. On the one extreme, if all tests succeed throughout the algorithm, we implement an LU factorization without pivoting. On the other extreme, if all tests fail, we implement a QR factorization. On the average, some of the tests will fail, some will succeed. If the fraction of the tests that fail remains small enough, we will reach a CPU performance close to that of LU without pivoting. Of course the challenge is to design a test that is accurate enough (and not too costly) so that LU kernels are applied only when it is numerically safe to do so. Implementing such a hybrid algorithm on a state-of-the-art distributed-memory platform, whose nodes are themselves equipped with multiple cores, is a programming challenge. Within a node, the architecture is a shared-memory machine, running many parallel threads on the cores. But the global architecture is a distributed-memory machine, and requires MPI communication primitives for inter- node communications. A slight change in the algorithm, or in the matrix layout across the nodes, might call for a time-consuming and error-prone process of code adaptation. For each version, one must identify, and adequately implement, inter-node versus intra-node kernels. This dramatically complicates the task of the programmers if they rely on a manual approach. We solve this problem by relying on the PaRSEC software [23, 22, 20], so that we can concentrate on the algorithm and forget about MPI and threads. Once we have specified the algorithm at a task level, the PaRSEC software will recognize which operations are local to a node (and hence correspond to shared-memory accesses), and which are not (and hence must be converted into MPI communications). Previous experiments show that this approach is very powerful, and that the use of a higher-level framework does not prevent our algorithms from achieving the same performance as state-of-the-art library releases [42]. However, implementing a hybrid algorithm requires the programmer to implement a dynamic task graph of the computation. Indeed, the task graph of the hybrid factorization algorithm is no longer known statically (contrarily to a standard LU or QR factorization). At each step of the elimination, we use either LU-based or QR-based tasks, but not both. This requires the algorithm to dynamically fork upon the outcome of the robustness test, in order to apply the selected kernels. The solution is to prepare a graph that includes both types of tasks, namely LU and QR kernels, to select the adequate tasks on the fly, and to discard the useless ones. We have to join both potential execution flows at the end of each step, symmetrically. Most of this mechanism is transparent to the user. We discuss this extension of PaRSEC in more detail in Section 1.4. The major contributions of this chapter are the following: • The introduction of new LU-QR hybrid algorithms; 1.2. HYBRID LU-QR ALGORITHMS 5
• The design of several robustness criteria, with bounds on the induced growth factor; • The extension of PaRSEC to deal with dynamic task graphs; • A comprehensive experimental evaluation of the best trade-offs between performance and numer- ical stability. The rest of the chapter is organized as follows. First we explain the main principles of LU-QR hybrid algorithms in Section 1.2. Then we describe robustness criteria in Section 1.3. Next we detail the imple- mentation within the PaRSEC framework in Section 1.4. We report experimental results in Section 1.5. We discuss related work in Section 1.6. Finally, we provide concluding remarks and future directions in Section 1.7.
1.2 Hybrid LU-QR algorithms
In this section, we describe hybrid algorithms to solve a dense linear system Ax = b, where A = (Ai,j)(i,j)∈J1..nK2 is a square tiled-matrix, with n tiles per row or column. Each tile is a block of nb-by-nb elements, so that A is of order N = n × nb. The common goal of a classical one-sided factorization (LU or QR) is to triangularize the matrix A through a succession of elementary transformations. Consider the first step of such an algorithm. We A C partition A by block such that A = 11 . In terms of tile, A is 1-by-1, B is (n − 1)-by-1, C is BD 11 A 1-by-(n − 1), and D is (n − 1)-by-(n − 1). The first block-column 11 is the panel of the current B step. Traditional algorithms (LU or QR) perform the same type of transformation at each step. The key observation of this chapter is that any type of transformation (LU or QR) can be used for a given step independently of what was used for the previous steps. The common framework of a step is the follow- ing: ′ A11 C factor apply U11 C BD ⇔ eliminate update ⇔ 0 D′ . (1.1) First, A11 is factored and transformed in the upper triangular matrix U11. Then, the transformation of the factorization of A11 is applied to C. Then A11 is used to eliminate B. Finally D is accordingly updated. Recursively factoring D′ with the same framework will complete the factorization to an upper triangular matrix. For each step, we have a choice for an LU step or a QR step. The operation count for each kernel is given in Table 1.1.
LU step, var A1 QR step factor A 2/3 GETRF 4/3 GEQRT eliminate B (n − 1) TRSM 2(n − 1) TSQRT apply C (n − 1) TRSM 2(n − 1) TSMQR update D 2(n − 1)2 GEMM 4(n − 1)2 UNMQR
3 Table 1.1: Computational cost of each kernel. The unit is nb floating-point operations.
Generally speaking, QR transformations are twice as costly as their LU counterparts. The bulk of the computations take place in the update of the trailing matrix D. This obviously favors LU update kernels. In addition, the LU update kernels are fully parallel and can be applied independently on the 6 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
Algorithm 1: Hybrid LU-QR algorithm for k = 1 to n do Factor: Compute a factorization of the diagonal tile: either with LU partial pivoting or QR; Check: Compute some robustness criteria (see Section 1.3) involving only tiles Ai,k, where k ≤ i ≤ n, in the elimination panel; Apply, Eliminate, Update: if the criterion succeeds then Perform an LU step; else Perform a QR step;
Algorithm 2: Step k of an LU step - var (A1)
Factor: Ak,k ← GET RF (Ak,k) ; for i = k + 1 to n do Eliminate: Ai,k ← TRSM(Ak,k,Ai,k); for j = k + 1 to n do Apply: Ak,j ← SWPTRSM(Ak,k,Ak,j); for i = k + 1 to n do for j = k + 1 to n do Update: Ai,j ← GEMM(Ai,k,Ak,j,Ai,j);
2 −1 −1 (n − 1) trailing tiles. Unfortunately, LU updates (using GEMM) are stable only when kA11 k is larger than kBk (see Section 1.3). If this is not the case, we have to resort to QR kernels. Not only these are twice as costly, but they also suffer from enforcing more dependencies: all columns can still be processed (apply and update kernels) independently, but inside a column, the kernels must be applied in sequence. The hybrid LU-QR Algorithm uses the standard 2D block-cyclic distribution of tiles along a virtual p-by-q cluster grid. The 2D block-cyclic distribution nicely balances the load across resources for both LU and QR steps. Thus at step k of the factorization, the panel is split into p domains of approximately n−k+1 p tile rows. Domains will be associated with physical memory regions, typically a domain per node in a distributed memory platform. Thus an important design goal is to minimize the number of communications across domains, because these correspond to nonlocal communications between nodes. At each step k of the factorization, the domain of the node owning the diagonal tile Ak,k is called the diagonal domain. The hybrid LU-QR Algorithm applies LU kernels when it is numerically safe to do so, and QR kernels otherwise. Coming back to the first elimination step, the sequence of operations is described in Algorithm 1.
1.2.1 LU step We assume that the criterion validates an LU step (see Section 1.3). We describe the variant (A1) of an LU step given in Algorithm 2. The kernels for the LU step are the following: 1.2. HYBRID LU-QR ALGORITHMS 7
Algorithm 3: Step k of the HQR factorization for i = k + 1 to n do elim(i, eliminator(i, k), k);
• Factor: Ak,k ← GET RF (Ak,k) is an LU factorization with partial pivoting: Pk,kAk,k = Lk,kUk,k, the output matrices Lk,k and Uk,k are stored in place of the input Ak,k. • Eliminate: Ai,k ← TRSM(Ak,k,Ai,k) solves in-place, the upper triangular system such that −1 Ai,k ← Ai,kUk,k where Uk,k is stored in the upper part of Ak,k. • Apply: Ak,j ← SWPTRSM(Ak,k,Ai,k) solves the unit lower triangular system such that −1 Ak,j ← Lk,kPk,kAk,j where Lk,k is stored in the (strictly) lower part of Ak,k. • Update: Ai,j ← GEMM(Ai,k,Ak,j,Ai,j) is a general matrix-matrix multiplication Ai,j ← Ai,j − Ai,kAk,j. In terms of parallelism, the factorization of the diagonal tile is followed by the TRSM kernels that can be processed in parallel, then every GEMM kernel can be processed concurrently. These highly parallelizable updates constitute one of the two main advantages of the LU step over the QR step. The second main advantage is halving the number of floating-point operations. During the factor step, one variant is to factor the whole diagonal domain instead of only factor- ing the diagonal tile. Considering Algorithm 2, the difference lies in the first line: rather than calling GET RF (Ak,k), thereby searching for pivots only within the diagonal tile Ak,k, we implemented a variant where we extend the search for pivots across the diagonal domain (the Apply step is modified ac- cordingly). Working on the diagonal domain instead of the diagonal tile increases the smallest singular value of the factored region and therefore increases the likelihood of an LU step. Since all tiles in the diagonal domain are local to a single node, extending the search to the diagonal domain is done without any inter-domain communication. The stability analysis of Section 1.3 applies to both scenarios, the one where Ak,k is factored in isolation, and the one where it is factored with the help of the diagonal domain. In the experimental section, we will use the variant which factors the diagonal domain.
1.2.2 QR step If the decision to process a QR step is taken by the criterion, the LU decomposition of the diagonal domain is dropped, and the factorization of the panel starts over. This step of the factorization is then processed using orthogonal transformations. Every tile below the diagonal (matrix B in Equation (1.1)) is zeroed out using a triangular tile, or eliminator tile. In a QR step, the diagonal tile is factored (with a GEQRF kernel) and used to eliminate all the other tiles of the panel (with a TSQRT kernel) The trailing submatrix is updated, respectively, with UNMQR and TSMQR kernels. To further increase the degree of parallelism of the algorithm, it is possible to use several eliminator tiles inside a panel, typically one (or more) per domain. The only condition is that concurrent elimination operations must involve disjoint tile pairs (the unique eliminator of tile Ai,k will be referred to as Aeliminator(i,k),k). Of course, in the end, there must remain only one non-zero tile on the panel diagonal, so that all eliminators except the diagonal tile must be eliminated later on (with a TTQRT kernel on the panel and TTMQR updates on the trailing submatrix), using a reduction tree of arbitrary shape. This reduction tree will involve inter-domain communications. In our hybrid LU-QR algorithm, the QR step is processed following an instance of the generic hierarchical QR factorization HQR [42] described in Algorithms 3 and 4.
Each elimination elim(i, eliminator(i, k), k) consists of two sub-steps: first in column k, tile (i, k) is zeroed out (or killed) by tile (eliminator(i, k), k); and in each following column j > k, tiles (i, j) 8 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
Algorithm 4: Elimination elim(i, eliminator(i, k), k) (a) With TS kernels Aeliminator(i,k),k ← GEQRT (Aeliminator(i,k),k); Ai,k,Aeliminator(i,k),k ← T SQRT (Ai,k,Aeliminator(i,k),k); for j = k + 1 to n − 1 do Aeliminator(i,k),j ← UNMQR(Aeliminator(i,k),j,Aeliminator(i,k),k; Ai,j,Aeliminator(i,k),j ← T SMQR(Ai,j,Aeliminator(i,k),j,Ai,k);
(b) With TT kernels Aeliminator(i,k),k ← GEQRT (Aeliminator(i,k),k); Ai,k ← GEQRT (Ai,k); for j = k + 1 to n − 1 do Aeliminator(i,k),j ← UNMQR(Aeliminator(i,k),j,Aeliminator(i,k),k; Ai,j ← UNMQR(Ai,j,Ai,k;
Ai,k,Aeliminator(i,k),k ← T T QRT (Ai,k,Aeliminator(i,k),k); for j = k + 1 to n − 1 do Ai,j,Aeliminator(i,k),j ← T T MQR(Ai,j,Aeliminator(i,k),j,Ai,k); and (eliminator(i, k), j) are updated; all these updates are independent and can be triggered as soon as the elimination is completed. The algorithm is entirely characterized by its elimination list, which is the ordered list of all the eliminations elim(i, eliminator(i, k), k) that are executed. The orthogonal transformation elim(i, eliminator(i, k), k) uses either a TTQRT kernel or a TSQRT kernel depending upon whether the tile to eliminate is either triangular or square. In our hybrid LU-QR Algorithm, any combination of reduction trees of the HQR algorithm described in [42] is available. It is then possible to use an intra-domain reduction tree to locally eliminate many tiles without inter-domain communication. A unique triangular tile is left on each node and then the reductions across domains are performed following a second level of reduction tree.
1.2.3 LU step variants In the following, we describe several other variants of the LU step.
1.2.3.1 Variant (A2) It consists of first performing a QR factorization of the diagonal tile and proceeds pretty much as in (A1) thereafter. • Factor: Ak,k ← GEQRF (Ak,k) is a QR factorization Ak,k = Qk,kUk,k, where Qk,k is never constructed explicitly and we instead store the Householder reflector Vk,k. The output matrices Vk,k and Uk,k are stored in place of the input Ak,k. • Eliminate: Ai,k ← TRSM(Ak,k,Ai,k) solves in-place the upper triangular system such that −1 Ai,k ← Ai,kUk,k where Uk,k is stored in the upper part of Ak,k. T T • Apply: Ak,j ← ORMQR(Ak,k,Ai,k) performs Ak,j ← Qk,kAk,j where Qk,k is applied using Vk,k stored in the (strictly) lower part of Ak,k. • Update: Ai,j ← GEMM(Ai,k,Ak,j,Ai,j) is a general matrix-matrix multiplication Ai,j ← Ai,j − Ai,kAk,j. 1.2. HYBRID LU-QR ALGORITHMS 9
The Eliminate and Update steps are the exact same as in (A1). The (A2) variant has the same data dependencies as (A1) and therefore the same level of parallelism. A benefit of (A2) over (A1) is that if the criterion test decides that the step is a QR step, then the factorization of Ak,k is not discarded but rather used to continue the QR step. A drawback of (A2) is that the Factor and Apply steps are twice as expensive as the ones in (A1).
1.2.3.2 Variants (B1) and (B2) Another option is to use the so-called block LU factorization [37]. The result of this formulation is a factorization where the U factor is block upper triangular (as opposed to upper triangular), and the diag- onal tiles of the L factor are identity tiles. The Factor step can either be done using an LU factorization −1 (variant (B1)) or a QR factorization (variant (B2)). The Eliminate step is Ai,k ← Ai,kAk,k. There is no Apply step. And the Update step is Ai,j ← Ai,j − Ai,kAk,j. The fact that row k is not updated provides two benefits: (i) Ak,k does not need to be broadcast to these tiles, simplifying the communication pattern; (ii) The stability of the LU step can be determined by considering only the growth factor in the Schur complement of Ak,k. One drawback of (B1) and (B2) is that the final matrix is not upper triangular but only block upper triangular. This complicates the use of these methods to solve a linear system of equations. The stability of (B1) and (B2) has been analyzed in [37]. We note that (A2) and (B2) use a QR factorization during the Factor step. Yet, we still call this an LU step. This is because all four LU variants mentioned use the Schur complement to update the trailing −1 sub-matrix. The mathematical operation is: Ai,j ← Ai,j − Ai,kAk,kAk,j,. In practice, the Update step −1 for all four variants looks like Ai,j ← Ai,j − Ai,kAk,j, since Ak,k is somehow applied to Ai,k and Ak,j during the preliminary update and eliminate steps. The Schur update dominates the cost of an LU factorization and therefore all variants are more efficient than a QR step. Also, we have the same level of parallelism for the update step: embarrassingly parallel. In terms of stability, all variants would follow closely the analysis of Section 1.5.4. We do not consider further variants (A2), (B1), and (B2) in this chapter, since they are all very similar, and only study Algorithm 2, (A1).
1.2.4 Comments 1.2.4.1 Solving systems of linear equations To solve systems of linear equations, we augment A with the right-hand side b to get A˜ = (A, b) and apply all transformations to A˜. Then an N-by-N triangular solve is needed. This is the approach we used in our experiments. We note that, at the end of the factorization, all needed information about the transformations is stored in place of A, so, alternatively, one can apply the transformations on b during a second pass.
1.2.4.2 No restriction on N
In practice, N does not have to be a multiple of nb. We keep this restriction for the sake of simplicity. The algorithm can accommodate any N and nb with some clean-up codes, which we have written.
1.2.4.3 Relation with threshold pivoting The LU-QR Algorithm can be viewed as a tile version of standard threshold pivoting. Standard threshold pivoting works on 1-by-1 tiles (scalars, matrix elements). It checks if the diagonal element passes a 10 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS certain threshold. If the diagonal element passes the threshold then elimination is done without pivoting using this diagonal element as the pivot. If the diagonal element does not pass the threshold, then standard pivoting is done and elimination is done with largest element in absolute value in the column. Our Hybrid LU algorithm can be seen as a variant with tiles. Our algorithm checks if the diagonal tile passes a certain threshold. If the diagonal tile passes the threshold then elimination is done without pivoting using this diagonal tile as the pivot. If the diagonal tile does not pass the threshold, then a stable elimination is performed. The fact that our algorithm works on tiles as opposed to scalars leads to two major differences. (1) New criteria for declaring a diagonal tile as being safe had to develop. (2) In the event when a diagonal tile is declared not safe, we need to resort to a tile QR step.
1.3 Robustness criteria
The decision to process an LU or a QR step is done dynamically during the factorization, and constitutes the heart of the algorithm. Indeed, the decision criteria has to be able to detect a potentially “large” stability deterioration (according to a threshold) due to an LU step before its actual computation, in order to preventively switch to a QR step. As explained in Section 1.2, in our hybrid LU-QR algorithm, the diagonal tile is factored using an LU decomposition with partial pivoting. At the same time, some data (like the norm of non local tiles belonging to other domains) are collected and exchanged (using a Bruck’s all-reduce algorithm [26]) between all nodes hosting at least one tile of the panel. Based upon this information, all nodes make the decision to continue the LU factorization step or to drop the LU decomposition of the diagonal tile and process a full QR factorization step. The decision is broadcast to the other nodes not involved in the panel factorization within the next data communication. The decision process cost will depend on the choice of the criterion and must not imply a large computational overhead compared to the factorization cost. A good criterion will detect only the “worst” steps and will provide a good stability result with as few QR steps as possible. In this section, we present three criteria, going from the most elaborate (but also most costly) to the simplest ones. The stability of a step is determined by the growth of the norm of the updated matrix. If a criterion determines the potential for an unacceptable growth due to an LU step, then a QR step is used. A QR step is stable as there is no growth in the norm (2-norm) since it is a unitary transformation. Each criterion depends on a threshold α that allows us to tighten or loosen the stability requirement, and thus influence the amount of LU steps that we can afford during the factorization. In Section 1.5.4, we experiment with different choices of α for each criterion.
1.3.1 Max criterion LU factorization with partial pivoting chooses the largest element of a column as the pivot element. Partial pivoting is accepted as being numerically stable. However, pivoting across nodes is expensive. To avoid this pivoting, we generalize the criterion to tiles and determine if the diagonal tile is an acceptable pivot. A step is an LU step if
(k) −1 −1 (k) α × k(A ) k1 ≥ max kA k1. (1.2) k,k i>k i,k For the analysis we do not make an assumption as to how the diagonal tile is factored. We only assume that the diagonal tile is factored in a stable way (LU with partial pivoting or QR are acceptable). Note that, for the variant using pivoting in the diagonal domain (see Section 1.2.1), which is the variant (k) we experiment with in Section 1.5, Ak,k represents the diagonal tile after pivoting among tiles in the diagonal domain. 1.3. ROBUSTNESS CRITERIA 11
To assess the growth of the norm of the updated matrix, consider the update of the trailing sub- matrix. For all i, j > k we have:
(k+1) (k) (k) (k) −1 (k) kAi,j k1 = kAi,j − Ai,k (Ak,k) Ak,j k1 (k) (k) (k) −1 (k) ≤ kAi,j k1 + kAi,k k1k(Ak,k) k1kAk,j k1 (k) (k) ≤ kAi,j k1 + αkAk,j k1 (k) (k) ≤ (1 + α) max kAi,j k1, kAk,j k1 (k) ≤ (1 + α) max kAi,j k1 . i≥k
The growth of any tile in the trailing sub-matrix is bounded by 1 + α times the largest tile in the same column. If every step satisfies (1.2), then we have the following bound:
(k) maxi,j,k kA k1 i,j ≤ (1 + α)n−1. maxi,j kAi,jk1
The expression above is a growth factor on the norm of the tiles. For α = 1, the growth factor of 2n−1 is an analogous result to an LU factorization with partial pivoting (scalar case) [67]. Finally, note that we can obtain this bound by generalizing the standard example for partial pivoting. The following matrix will match the bound above:
α−1 0 0 1 −1 α−1 0 1 A = . −1 −1 α−1 1 −1 −1 −1 1
1.3.2 Sum criterion
A stricter criterion is to compare the diagonal tile to the sum of the off-diagonal tiles:
(k) −1 −1 (k) α × k(Ak,k) k1 ≥ kAi,k k1. (1.3) Xi>k
−1 Again, for the analysis, we only assume Ak,k factored in a stable way. For α ≥ 1, this criterion (and the Max criterion) is satisfied at every step if A is block diagonally dominant [67]. That is, a general matrix n×n A ∈ R is block diagonally dominant by columns with respect to a given partitioning A = (Ai,j) and a given norm k · k if:
−1 −1 ∀j ∈ J1, nK, kAj,j k ≥ kAi,jk. Xi6=j 12 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
Again we need to evaluate the growth of the norm of the updated trailing sub-matrix. For all i, j > k, we have
(k+1) (k) (k) (k) −1 (k) kAi,j k1 = kAi,j − Ai,k (Ak,k) Ak,j k1 Xi>k Xi>k (k) ≤ kAi,j k1 Xi>k (k) (k) −1 (k) + kAk,j k1k(Ak,k) k1 kAi,k k1 Xi>k (k) (k) ≤ kAi,j k1 + αkAk,j k1. Xi>k
Hence, the growth of the updated matrix can be bounded in terms of an entire column rather than just an individual tile. The only growth in the sum is due to the norm of a single tile. For α = 1, the inequality becomes (k+1) (k) kAi,j k1 ≤ kAi,j k1. Xi>k Xi≥k If every step of the algorithm satisfies (1.3) (with α = 1), then by induction we have:
(k+1) kAi,j k1 ≤ kAi,jk1, Xi>k Xi≥1 for all i, j, k. This leads to the following bound:
(k) maxi,j,k kA k1 i,j ≤ n. maxi,j kAi,jk1
From this we see that the criteria eliminates the potential for exponential growth due to the LU steps. Note that for a diagonally dominant matrix, the bound on the growth factor can be reduced to 2 [67].
1.3.3 MUMPS criterion
In LU decomposition with partial pivoting, the largest element of the column is used as the pivot. This method is stable experimentally, but the seeking of the maximum and the pivoting requires a lot of com- munications in distributed memory. Thus in an LU step of the LU-QR Algorithm, the LU decomposition with partial pivoting is limited to the local tiles of the panel (i.e., to the diagonal domain). The idea behind the MUMPS criterion is to estimate the quality of the pivot found locally compared to the rest of the column. The MUMPS criterion is one of the strategies available in MUMPS although it is for sym- metric indefinite matrices (LDLT )[45], and Amestoy et al. [12] provided us with their scalar criterion for the LU case. At step k of the LU-QR Algorithm, let L(k)U (k) be the LU decomposition of the diagonal domain (k) and Ai,j be the value of the tile Ai,j at the beginning of step k. Let local_maxk(j) be the largest element of the column j of the panel in the diagonal domain, away_maxk(j) be the largest element of the column j of the panel off the diagonal domain, and pivotk be the list of pivots used in the LU 1.3. ROBUSTNESS CRITERIA 13 decomposition of the diagonal domain:
local_maxk(j) = max max |(Ai,k)l,j|, tiles Ai,k on the l diagonal domain
away_maxk(j) = max max |(Ai,k)l,j|, tiles Ai,k off the l diagonal domain (k) pivotk(j) = |Uj,j |. pivotk(j) represents the largest local element of the column j at step j of the LU decomposition with partial pivoting on the diagonal domain. Thus, we can express the growth factor of the largest lo- cal element of the column j at step j as: growth_factork(j) = pivotk(j)/local_maxk(j). The idea behind the MUMPS criterion is to estimate if the largest element outside the local domain would have grown the same way. Thus, we can define a vector estimate_maxk initialized to away_maxk and updated for each step i of the LU decomposition with partial pivoting like estimate_maxk(j) ← estimate_maxk(j) × growth_factork(i). We consider that the LU decomposition with partial pivot- ing of the diagonal domain can be used to eliminate the rest of the panel if and only if all pivots are larger than the estimated maximum of the column outside the diagonal domain times a threshold α. Thus, the MUMPS criterion (as we implemented it) decides that step k of the LU-QR Algorithm will be an LU step if and only if: ∀j, α × pivotk(j) ≥ estimate_maxk(j). (1.4)
1.3.4 Extending the MUMPS criterion to tiles In this section, we extend the MUMPS criterion [12, 45] to tiles. We did not implement this extension in software. We present the main idea here in the context of the max criterion. It is possible to adapt to the sum criterion as well. The main idea is to maintain a local upper bound on the maximum norm of the tiles below the diagonal. The goal for MUMPS is to spare a search in the pivot column to see if the current pivot is acceptable. Our goal is to spare a search in the tile column to know if we are going to apply an LU step or a QR step. In both cases, if the criterion is satisfied, the search will not happen, hence (1) this avoids the communication necessary for the search, and (2) this avoids to synchronize the processes in the column. In other words, if the initial matrix is such that the criterion is always satisfied at each step of the algorithm, the synchronization of the panel will go away and a pipeline will naturally be instantiated. Following our general framework, we have a matrix A partitioned in n tiles per row and column. At the start of the algorithm, all processors have a vector r of size n. If the process holds column j, j = 1, . . . , n, then it also holds rj such that
(1) r = max kAi,jk1. j i=2,...,n
If the process does not hold column j, then it does not have (and will not need) a value for rj. If
(1) −1 −1 (1) α × k(A ) k ≥ max kA k1, 11 1 i>1 i,1 we know that an LU step will be performed according to the max criterion. (See Equation 1.2.) This condition is guaranteed if (1) −1 −1 (1) α × k(A11 ) k1 ≥ r1 . 14 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
We assume that this condition is satisfied and so an LU step is decided, so that our first step is an LU step. The pivot process can decide so without any communication or synchronization. An LU step therefore is initiated. Now, following the MUMPS criterion [12, 45], we update the vectors r as follows:
(2) (1) (1) (1) −1 (1) rj = rj + r1 k(A11 ) k1kA1j k1, j = 2, . . . , n,
(1) −1 −1 (1) (1) We note that k(A11 ) k1 can be broadcast along with the L and U factors of A11 and that A1j is broadcast for the update to all processes holding rj, so all processes holding rj can update without communicating rj. Now we see that (2) (2) r ≥ max kA k1, j = 2, . . . , n j i=3,...,n i,j Indeed, let j = 2, . . . , n. Now, let i = 3, . . . , n, we have
(2) (1) (1) (1) −1 (1) kAi,j k1 = kAi,j − Ai,1 (A1,1) A1,j k1 (1) (1) (1) −1 (1) ≤ kAi,j k1 + kAi,1 k1k(A1,1) k1kA1,j k1
(1) (1) (1) −1 (1) ≤ max kA k1 + max kA k1 k(A ) k1kA k1 i=2,...,n i,j i=2,...,n i,1 1,1 1,j (1) (1) (1) −1 (1) ≤ rj + r1 k(A1,1) k1kA1,j k1 Hence, for j = 2, . . . , n, we have
(2) (1) (1) (1) −1 (1) max kA k1 ≤ r + r k(A ) k1c , i=3,...,n i,j j 1 1,1 1 so that, as claimed, (2) (2) max kA k1 ≤ r . i=3,...,n i,j j
Therefore, at step 2, the process holding A22 can evaluate locally (without communication nor syn- chronization) the condition (2) −1 −1 (2) α × k(A2,2) k1 ≥ r2 , and decides whether an LU or a QR step is appropriate.
1.3.5 Complexity All criteria require the reduction of information of the off-diagonal tiles to the diagonal tile. Criteria (1.2) and (1.3) require the norm of each tile to be calculated locally (our implementation uses the 1-norm) −1 and then reduced to the diagonal tile. Both criteria also require computing kAk,kk. Since the LU factorization of the diagonal tile is computed, the norm can be approximated using the L and U factors 2 by an iterative method in O(nb ) floating-point operations. The overall complexity for both criteria is 2 O(n×nb ). Criterion (1.4) requires the maximum of each column be calculated locally and then reduced 2 to the diagonal tile. The complexity of the MUMPS criterion is also O(n × nb ) comparisons. The Sum criterion is the strictest of the three criteria. It also provides the best stability with linear growth in the norm of the tiles in the worst case. The other two criteria have similar worst case bounds. The growth factor for both criteria are bound by the growth factor of partial (threshold) pivoting. The Max criterion has a bound for the growth factor on the norm of the tiles that is analogous to partial pivoting. The MUMPS criteria does not operate at the tile level, but rather on scalars. If the estimated growth factor computed by the criteria is a good estimate, then the growth factor is no worse than partial (threshold) pivoting. 1.4. IMPLEMENTATION 15
1.4 Implementation
As discussed in section 1.1, we have implemented the LU-QR Algorithm on top of the PARSEC runtime. There are two major reasons for this choice: (i) it allows for easily targeting distributed architectures while concentrating only on the algorithm and not on implementation details such as data distribution and communications; (ii) previous implementations of the HQR algorithm [42] can be reused for QR elimination steps, and they include efficient reduction trees to reduce the critical path of these steps. The other advantage of using such a runtime is that it provides an efficient look-ahead algorithm without the burden. This is illustrated by the figure 1.2 that shows the first steps of the factorization with the LU-QR Algorithm.QR STEPS (in green) and LU STEPS (in orange/red) are interleaved automatically by the runtime and panel factorization does not wait for the previous step to be finished before starting. However, this choice implied major difficulties due to the parameterized task graph representation exploited by the PARSEC runtime. This representation being static, a solution had to be developed to allow for dynamism in the graph traversal. To solve this issue, we decided to fully statically describe both LU and QR algorithms in the parameterized task graph. A layer of selection tasks has then been inserted between each iteration, to collect the data from the previous step, and to propagate it to the correct following step. These tasks, which do no computations, are only executed once they received a control flow after the criterion selection has been made. Thus, they delay the decision to send the data to the next elimination step until a choice has been made, in order to guarantee that data follow the correct path. These are the Propagate tasks on Figure 1.1. It is important to note, that these tasks do not delay the computations since no updates are available as long as the panel is not factorized. Furthermore, these tasks, as well as Backup Panel tasks, can receive the same data from two different paths which could create conflicts. In the PARSEC runtime, tasks are created only when one of their dependencies is solved; then by graph construction they are enabled only when the previous elimination step has already started, hence they will receive their data only from the correct path. It also means that tasks belonging to the neglected path will never be triggered, and so never be created. This implies that only one path will forward the data to the Propagate tasks.
Figure 1.1: Dataflow of one step of the algorithm.
Figure 1.1 describes the connection between the different stages of one elimination step of the algo- rithm. These stages are described below: 16 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
BACKUP PANEL This is a set of tasks that collect the tiles of the panel from the previous step. Since an LU factorization will be performed in-place for criterion computation, it is then necessary to backup the modified data in case the criterion fails the test on numerical accuracy. Only tiles from the current panel belonging to the node with the diagonal row are copied, and sent directly to the Propagate tasks in case a QR elimination step is needed. On other nodes, nothing is done. Then, all original tiles belonging to the panel are forwarded to the LU On Panel tasks.
LUON PANEL Once the backup is done, the criterion is computed. Two kinds of work are performed in those tasks. On the first node, the U matrix related to this elimination step is computed. This can be done through an LU factorization with or without pivoting. We decided to exploit the multi-threaded recursive-LU kernel from the PLASMA library to enlarge the pivot search space while keeping good efficiency [43]. On all other nodes, the information required for the criterion is computed (see sec- tion 1.3). Then, an all-reduce operation is performed to exchange the information, so that everyone can take and store the decision in a local array. This all-reduce operation is directly implemented within the parameterized task graph with Bruck’s algorithm [26] to optimize the cost of this operation. Once the decision is known by the nodes on the panel, it is stored in a global array by each process in order to give access to the information to every worker threads. The information is then broadcast by row to all other nodes such that everyone knows which kind of update to apply, and a control flow per process triggers all the local Propagate task which can now have access to the decision and release the correct path in the dataflow.
PROPAGATE These tasks, one per tile, receive the decision from the previous stage through a control flow, and are responsible for forwarding the data to the computational tasks of the selected factorization. The tasks belonging to the panel (assigned to the first nodes) have to restore the data back to their previous state if QR elimination is chosen. In all cases, the backup is destroyed upon exit of these tasks. We are now ready to complete the description of each step: a) LU STEP If the numerical criterion is met by the panel computation, the update step is performed. On the nodes with the diagonal row, a task per panel is generated to apply the row permutation computed by the factorization, and then, the triangular solve is applied to the diagonal to compute the U part of the matrix. The result is broadcasted per column to all other nodes and a block LU algorithm is used to performed the update. This means that the panel is updated with TRSM tasks, and the trailing sub-matrix is updated with GEMM tasks. This avoids the row pivoting between the nodes usually performed by the classical LU factorization algorithm with partial pivoting, or by tournament pivoting algorithms [61]. Here this exchange is made within a single node only. b) QR STEP If the numerical criterion is not met, a QR factorization has to be performed. Many solutions could be used for this elimination step. We chose to exploit the HQR method implementation presented in [42]. This allowed us to experiment with different kinds of reduction trees, so as to find the most adapted solution to our problem. The goal is to reduce the inter-nodes communications to the minimum while keeping the critical path short. In [42], we have shown that the FLATTREE tree is very efficient for a good pipeline of the operations on square matrices, while FIBONACCI,GREEDY or BINARYTREE are good for tall and skinny matrices because they reduce the length of the critical path. In this algorithm, our default tree (which we use in all of our experiments) is a hierarchical tree made of GREEDY reduction trees inside nodes, and a FIBONACCI reduction tree between the nodes. The goal is to perform as few QR steps as possible, so a FIBONACCI tree between nodes has been chosen for 1.5. EXPERIMENTS 17 its short critical path and its good pipelining of consecutive trees if multiple QR steps are performed in sequence. Within a node, the GREEDY reduction tree is favored for similar reasons (See [42] for more details on the reduction trees). A two-level hierarchical approach is natural when considering multicore parallel distributed architectures, and those choices could be reconsidered according to the matrix size and numerical properties. To implement the LU-QR Algorithm within the PARSEC framework, two extensions had to be implemented within the runtime. The first extension allows the programmer to generate data during the execution with the OUTPUT keywords. This data is then inserted into the tracking system of the runtime to follow its path in the dataflow. This is what has been used to generate the backup on the fly, and to limit the memory peak of the algorithm. A second extension has been made for the end detection of the algorithm. Due to its distributed nature, PARSEC detects the end of an algorithm by counting the remaining tasks to execute. At algorithm submission, PARSEC loops over all the domain space of each type of task of the algorithm and uses a predicate, namely the owner computes rule, to decide if a task is local or not. Local tasks are counted and the end of the algorithm, it is detected when all of them have been executed. As explained previously, to statically describe the dynamism of the LU-QR Algorithm, both LU and QR tasks exist in the parameterized graph. The size of the domain space is then larger than the number of tasks that will actually be executed. Thus, a function to dynamically increase/decrease the number of local tasks has been added, so that the Propagate tasks decrease the local counter of each node by the number of update tasks associated to the non selected algorithm. The implementation of the LU-QR Algorithm is publicly available in the latest DPLASMA release (1.2.0).
1.5 Experiments
The purpose of this section is to present numerical experiments for the hybrid LU-QR Algorithm, and to highlight the trade-offs between stability and performance that can be achieved by tuning the threshold α in the robustness criterion (see Section 1.3).
1.5.1 Experimental framework
We used Dancer, a parallel machine hosted at the Innovative Computing Laboratory (ICL) in Knoxville, to run the experiments. This cluster has 16 multi-core nodes, each equipped with 8 cores, and an In- finiband interconnection network of 10GB/s bandwith (MT25208 cards). The nodes feature two Intel Westmere-EP E5606 CPUs at 2.13GHz. The system is running the Linux 64bit operating system, ver- sion 3.7.2-x86_64. The software was compiled with the Intel Compiler Suite 2013.3.163. BLAS kernels were provided by the MKL library and OpenMPI 1.4.3 has been used for the MPI communications by the PARSEC runtime. Each computational thread is bound to a single core using the HwLoc 1.7.1 li- brary. If not mentioned otherwise, we will use all 16 nodes and the data will be distributed according to a 4-by-4 2D-block-cyclic distribution. In all our experiments, this distribution performs better than a 8-by-2 or a 16-by-1 distribution for our hybrid LU-QR Algorithm, as expected, since square grids are known to perform better for LU and QR factorization applied to square matrices [28]. The theoretical peak performance of the 16 nodes is 1091 GFLOP/sec.
For each experiment, we consider a square tiled-matrix A of size N-by-N, where N = n × nb. The tile size nb has been fixed to 240 for the whole experiment set, because this value was found to achieve good performance for both LU and QR steps. We evaluate the backward stability by computing the 18 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
Figure 1.2: Execution trace of the first steps of the LU-QR Algorithm on a matrix of size N = 5000 with nb = 200, and criterion that alternates between LU and QR steps. A grid of 2 − by − 2 process with 4 threads each is used. The green tasks are the QR STEPS (dark for the panel factorization, and light for the trailing submatrix update); the orange and red tasks are the LU STEPS (red for the panel update and orange for the trailing submatrix update); the blue tasks are the LU panel factorizations performed at every step; and small black tasks, that do not show up on the figure because their duration is too small, are the criterion selections, and the Propagate tasks.
HPL3 accuracy test of the High-Performance Linpack benchmark [44]: kAx − bk HPL3 = ∞ , kAk∞kxk∞ × ǫ × N where b is the right-hand side of the linear system, x is the computed solution and ǫ is the machine precision. Each test is run with double precision arithmetic. In all our experiments, the right-hand side of the linear system is generated using the DPLASMA_dplrnt routine. It generates a random matrix (or a random vector) with each element uniformly taken in [−0.5, 0.5]. For performance, we point out that the number of floating point operations executed by the hybrid algorithm depends on the number of LU and QR steps performed during the factorization. Thus, for a fair comparison, we assess the efficiency by reporting the normalized GFLOP/sec performance computed as 2 N 3 GFLOP/sec = 3 , EXECUTIONTIME 2 3 where 3 N is the number of floating-point operations for LU with partial pivoting and EXECUTIONTIME is the execution time of the algorithm. With this formula, QR factorization will only achieve half of the 4 3 performance due to the 3 N floating-point operations of the algorithm. Note that in all our experiments, the right-hand side b of the linear system is a vector. Thus, the cost of applying the transformations on b to solve the linear system is negligible, which is not necessarily the case for multiple right-hand sides. 1.5. EXPERIMENTS 19 Figure 1.3: Stability, performance, andDancer percentage platform of (4x4 grid). LU steps obtained by the three criteria and by random choices, for random matrices, on the 20 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
1.5.2 Results for random matrices We start with the list of the algorithms used for comparison with the LU-QR Algorithm. All these methods are implemented within the PaRSEC framework: • LU NoPiv, which performs pivoting only inside the diagonal tile but no pivoting across tiles (known to be both efficient and unstable) • LU IncPiv, which performs incremental pairwise pivoting across all tiles in the elimination panel [28, 90] (still efficient but not stable either) • Several instances of the hybrid LU-QR Algorithm, for different values of the robustness parameter α. Recall that the algorithm performs pivoting only across the diagonal domain, hence involving no remote communication nor synchronization. • HQR, the Hierarchical QR factorization [42], with the same configuration as in the QR steps of the LU-QR Algorithm:GREEDY reduction trees inside nodes and FIBONACCI reduction trees between the nodes. For reference, we also include a comparison with PDGEQRF: this is the LUPP algorithm (LU with partial pivoting across all tiles of the elimination panel) from the reference ScaLAPACK implementa- tion [32].
Fake True Fake % True % Algorithm α Time % LU steps GFLOP/sec GFLOP/sec Peak Perf. Peak Perf. LU NoPiv 6.29 100.0 848.6 848.6 77.8 77.8 LU IncPiv 9.25 100.0 576.4 576.4 52.9 52.9 LUQR (MAX) ∞ 7.87 100.0 677.7 677.7 62.1 62.1 LUQR (MAX) 13000 7.99 94.1 667.7 707.4 61.2 64.9 LUQR (MAX) 9000 8.62 83.3 619.0 722.2 56.8 66.2 LUQR (MAX) 6000 10.95 61.9 486.9 672.4 44.6 61.7 LUQR (MAX) 4000 12.43 51.2 429.0 638.4 39.3 58.5 LUQR (MAX) 1400 13.76 35.7 387.6 636.9 35.5 58.4 LUQR (MAX) 900 16.39 11.9 325.4 612.0 29.8 56.1 LUQR (MAX) 0 18.05 0.0 295.5 590.9 27.1 54.2 HQR 16.01 0.0 333.1 666.1 30.5 61.1 LUPP 15.30 100.0 348.6 348.6 32.0 32.0
Table 1.2: Performance obtained by each algorithm, for N = 20, 000, on the Dancer platform (4 × 4 grid). We only show the results for the LU-QR Algorithm with the Max criterion. The other criteria have 2 3 similar performance. In column Fake GFLOP/sec, we assume all algorithms perform 3 N floating- point operations. In column True GFLOP/sec, we compute the number of floating-point operations to 2 4 3 be ( 3 fLU + 3 (1 − fLU ))N , where fLU is the fraction of the steps that are LU steps (column 4).
Figure 1.3 summarizes all results for random matrices. The random matrices are generated using the DPLASMA_dplrnt routine. The figure is organized as follows: each of the first three rows corresponds to one criterion. Within a row: • the first column shows the relative stability (ratio of HPL3 value divided by HPL3 value for LUPP) • the second column shows the GFLOP/sec performance • the third column shows the percentage of LU steps during execution The fourth row corresponds to a random choice between LU and QR at each step, and is intended to assess the performance obtained for a given ratio of LU vs QR steps. Plotted results are average values obtained on a set of 100 random matrices (we observe a very small standard deviation, less than 2%). For each criterion, we experimentally chose a set of values of α that provides a representative range of ratios for the number of LU and QR steps. As explained in Section 1.3, for each criterion, the smaller the α is, the tighter the stability requirement. Thus, the numerical criterion is met less frequently and the hybrid algorithm processes fewer LU steps. A current limitation of our approach is that we do not know how to auto-tune the best range of values for α, which seems to depend heavily upon matrix size and available degree of parallelism. In addition, the range of useful α values is quite different for each 1.5. EXPERIMENTS 21 criterion. For random matrices, we observe in Figure 1.3 that the stability of LU NoPiv and LU IncPiv is not satisfactory. We also observe that, for each criterion, small values of α result in better stability, to the detriment of performance. For α = 0, LU-QR Algorithm processes only QR steps, which leads to the exact same stability as the HQR Algorithm and almost the same performance results. The difference between the performance of LU-QR Algorithm with α = 0 and HQR comes from the cost of the decision making process steps (saving the panel, computing the LU factorization with partial pivoting on the diagonal domain, computing the choice, and restoring the panel). Figure 1.3 shows that the overhead due to the decision making process is approximately equal to 10% for the three criteria. This overhead, computed when QR eliminations are performed at each step, is primarily due to the backup/restore steps added to the critical path when QR is chosen. Performance impact of the criterion computation itself is negligible, as one can see by comparing performance of the random criterion to the MUMPS and Max criteria. LU-QR Algorithm with α = ∞ and LU NoPiv both process only LU steps. The only difference between both algorithms in terms of error analysis is that LU NoPiv seeks for a pivot in the diagonal tile, while LU-QR Algorithm with α = ∞ seeks for a pivot in the diagonal domain. This difference has a considerable impact in terms of stability, in particular on random matrices. LU-QR Algorithm with α = ∞ has a stability slightly inferior to that of LUPP and significantly better to that of LU NoPiv. When the matrix size increases, the relative stability results of the LU-QR Algorithm with α = ∞ tends to 1, which means that, on random matrices, processing an LU factorization with partial pivoting on a diagonal domain followed by a direct elimination without pivoting for the rest of the panel is almost as stable as an LU factorization with partial pivoting on the whole panel. A hand-waving explanantion would go as follows. The main instabilities are proportional to the small pivots encountered during a factorization. Using diagonal pivoting, as the factorization of the diagonal tile proceeds, one is left with fewer and fewer choices for a pivot in the tile. Ultimately, for the last entry of the tile in position (nb,nb), one is left with no choice at all. When working on random matrices, after having performed several successive diagonal factorizations, one is bound to have encountered a few small pivots. These small pivots lead to a bad stability. Using a domain (made of several tiles) for the factorization significantly increases the number of choice for the pivot and it is not any longer likely to encounter a small pivot. Consequently diagonal domain pivoting significantly increases the stability of the LU-QR Algorithm with α = ∞. When the local domain gets large enough (while being significanty less than N), the stability obtained on random matrices is about the same as partial pivoting. When α = ∞, our criterion is deactivated and our algorithm always performs LU step. We note that, when α is reasonable, (as opposed to α = ∞,) the algorithm is stable whether we use a diagonal domain or a diagonal tile. However using a diagonal domain increases the chance of well-behaved pivot tile for the elimination, therefore using a diagonal domain (as opposed to a diagonal tile) increases the chances of an LU step. Using random choices leads to results comparable to those obtained with the three criteria. However, since we are using random matrices in this experiment set, we need to be careful before drawing any conclusion on the stability of our algorithms. If an algorithm is not stable on random matrices, this is clearly bad. However we cannot draw any definitive conclusion if an algorithm is stable for random matrices. Table 1.2 displays detailed performance results for the Max criterion with N = 20, 000. In column 2 3 Fake GFLOP/sec, we assume all algorithms perform 3 N floating-point operations. In column True 2 4 3 GFLOP/sec, we compute the number of floating-point operations to be ( 3 fLU + 3 (1 − fLU ))N , where fLU is the fraction of the steps that are LU steps (column 4). For this example, we see that the LU- QR Algorithm reaches a peak performance of 677.7 GFLOP/sec (62.1% of the theoretical peak) when
24 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
Condition Number No. Matrix Description with N = 4000 Householder matrix, A = eye(n) − β ∗ v ∗ 1 house 1 vH Parter matrix, a Toeplitz matrix with most 2 parter of singular values near Π. A(i, j) = 1/(i− 4.8 j + 0.5). Ris matrix, matrix with elements A(i, j) = 3 ris 0.5/(n − i − j + 1.5). The eigenvalues 4.8 cluster around −Π/2 and Π/2. Counter-example matrix to condition esti- 4 condex 1.0 × 102 mators. 5 circul Circulant matrix 2.2 × 104 Hankel matrix, A = hankel(c, r), where 6 hankel c = randn(n, 1), r = randn(n, 1), and 3.8 × 104 c(n) = r(1). Companion matrix (sparse), A = 7 compan 4.1 × 106 compan(randn(n + 1, 1)). Lehmer matrix, a symmetric positive def- 8 lehmer inite matrix such that A(i, j) = i/j for 1.7 × 107 j ≥ i. Its inverse is tridiagonal. Dorr matrix, a diagonally dominant, ill- 9 dorr 1.6 × 1011 conditioned, tridiagonal matrix (sparse). A = D ∗ (eye(n) + 10 − 7 ∗ rand(n)), 10 demmel 9.8 × 1020 where D = diag(1014 ∗ (0 : n − 1)/n). Chebyshev Vandermonde matrix based on 11 chebvand n equally spaced points on the interval 2.2 × 1019 [0,1]. 12 invhess Its inverse is an upper Hessenberg matrix. — Prolate matrix, an ill-conditioned Toeplitz 13 prolate 9.4 × 1018 matrix. 14 cauchy Cauchy matrix. 1.9 × 1021 Hilbert matrix with elements 1/(i+j −1). 15 hilb 7.5 × 1021 A = hilb(n). Lotkin matrix, the Hilbert matrix with its 16 lotkin 2.1 × 1023 first row altered to all ones. 17 kahan Kahan matrix, an upper trapezoidal matrix. 1.4 × 1027 Symmetric eigenvector matrix: A(i, j) = 18 orthogo 1 sqrt(2/(n + 1)) ∗ sin(i ∗ j ∗ π/(n + 1)) Matrix attaining the upper bound of the 19 wilkinson 9.4 × 103 growth factor of GEPP. Matrix arising from using the quadrature 20 foster method to solve a certain Volterra integral 2.6 × 103 equation. Matrix with an exponential growth fac- 21 wright tor when Gaussian elimination with Partial 6.5 Pivoting is used.
Table 1.3: Special matrices in the experiment set.
26 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS
1.5.6 Assessment of the three criteria With respect to stability, while the three criteria behave similarly on random matrices, we observe dif- ferent behaviors for special matrices. The MUMPS criterion provides good results for most of the tested matrices but not for all. If stability is the key concern, one may prefer to use the Max criterion (or the Sum criterion), which performs well for all special matrices (which means that the upper bound of (1 + α)n−1 on the growth is quite pessimistic). With respect to performance, we observe very comparable results, which means that the overhead induced by computing the criterion at each step is of the same order of magnitude for all criteria. The overall conclusion is that all criteria bring significant improvement over LUPP in terms of stability, and over HQR in terms of performance. Tuning the value of the robustness parameter α enables the exploration of a wide range of stability/performance trade-offs.
1.6 Related work
State-of-the-art QR factorizations use multiple eliminators per panel, in order to dramatically reduce the critical path of the algorithm. These algorithms are unconditionally stable, and their parallelization has been fairly well studied on shared memory systems [27, 90, 24] and on parallel distributed systems [42]. The idea of mixing Gaussian transformations and orthogonal transformations has been considered once before. Irony and Toledo [73] present an algorithm for reducing a banded symmetric indefinite matrix to diagonal form. The algorithm uses symmetric Gaussian transformations and Givens rotations to maintain the banded symmetric structure and maintain similar stability to partial symmetric pivoting. The reason for using LU kernels instead of QR kernels is performance: (i) LU performs half the number of floating-point operations of QR; (ii) LU kernels relies on GEMM kernels which are very efficient while QR kernels are more complex and much less tuned, hence not that efficient; and (iii) the LU update is much more parallel than the QR update. So all in all, LU is much faster than QR (as observed in the performance results of Section 1.5). Because of the large number of communications and synchronizations induced by pivoting in the reference LUPP algorithm, communication-avoiding variants of LUPP have been introduced [36], but they have proven much more challenging to design because of stability issues. In the following, we review several approaches:
1.6.1 LUPP LU with partial pivoting is not a communication-avoiding scheme and its performance in a parallel distributed environment is low (see Section 1.5). However, the LUPP algorithm is stable in practice, and we use it as a reference for stability.
1.6.2 LU NoPiv The most basic communication-avoiding LU algorithm is LU NoPiv. This algorithm is stable for block diagonal dominant matrices [67, 37], but breaks down if it encounters a nearly singular diagonal tile, or loses stability if it encounters a diagonal tile whose smallest singular value is too small. Baboulin et al. [16] propose to apply a random transformation to the initial matrix, in order to use LU NoPiv while maintaining stability. This approach gives about the same performance as LU NoPiv, since preprocessing and postprocessing costs are negligible. It is hard to be satisfied with this ap- proach [16] because for any matrix which is rendered stable by this approach (i.e, LU NoPiv is stable), there exists a matrix which is rendered not stable. Nevertheless, in practice, this proves to be a valid approach. 1.7. CONCLUSION 27
1.6.3 LU IncPiv LU IncPiv is another communication-avoiding LU algorithm [28, 90]. Incremental pivoting is also called pairwise pivoting. The stability of the algorithm [28] is not sufficient and degrades as the number of tiles in the matrix increases (see our experimental results on random matrices). The method also suffers some of the same performance degradation of QR factorizations with multiple eliminators per panel, namely low-performing kernels, and some dependencies in the update phase.
1.6.4 CALU CALU [61] is a communication-avoiding LU. It uses tournament pivoting which has been proven to be stable in practice [61]. CALU shares the (good) properties of one of our LU steps: (i) low number of floating-point operations; (ii) use of efficient GEMM kernels; and (iii) embarrassingly parallel update. The advantage of CALU over our algorithm is essentially that it performs only LU steps, while our algorithm might need to perform some (more expensive) QR steps. The disadvantage is that, at each step, CALU needs to perform global pivoting on the whole panel, which then needs to be reported during the update phase to the whole trailing submatrix. There is no publicly available implementation of parallel distributed CALU, and it was not possible to compare stability or performance. CALU is known to be stable in practice [60, 40]. Performance results of CALU in parallel distributed are presented in [60]. Performance results of CALU on a single multicore node are presented in [40].
1.6.5 Summary Table 1.4 provides a summary of key characteristics of the algorithms discussed in this section.
ALGORITHM CA KERNELS EFF. FOR UPDATE PIPELINE #FLOPS STABLE LU NoPiV YES GEMM-EFFICIENT YES 1x NOT AT ALL LU IncPiv YES LESS EFFICIENT YES 1x SOMEWHAT CALU YES GEMM-EFFICIENT NO 1x PRACTICALLY LUQR (alpha) LU only YES GEMM-EFFICIENT NO 1x PRACTICALLY LUQR (alpha) QR only YES LESS EFFICIENT NO 2x UNCONDITIONALLY HQR YES LESS EFFICIENT YES 2x UNCONDITIONALLY LUPP NO GEMM-EFFICIENT NO 1x PRACTICALLY
Table 1.4: A summary of key characteristics of each algorithm. CA in column 2 stands for Communica- tion Avoiding. Other column titles are self-explanatory.
1.7 Conclusion
Linear algebra software designers have been struggling for years to improve the parallel efficiency of LUPP (LU with partial pivoting), the de-facto choice method for solving dense systems. The search for good pivots throughout the elimination panel is the key for stability (and indeed both NoPiv and IncPiv fail to provide acceptable stability), but it induces several short-length communications that dramatically decrease the overall performance of the factorization. Communication-avoiding algorithms are a recent alternative which proves very relevant on today’s archictectures. For example, in our experiments, our HQR factorization [42] based of QR kernels ends with similar performance as ScaLAPACK LUPP while performing 2x more floating-point operations, using slower sequential kernels, and a less parallel update phase. In this chapter, stemming from the 28 CHAPTER 1. MIXING LU AND QR FACTORIZATION ALGORITHMS key observation that LU steps and QR steps can be mixed during a factorization, we present the LU- QR Algorithm whose goal is to accelerate the HQR algorithm by introducing some LU steps whenever these do not compromise stability. The hybrid algorithm represents dramatic progress in a long-standing research problem. By restricting to pivoting inside the diagonal domain, i.e., locally, but by doing so only when the robustness criterion forecasts that it is safe (and going to a QR step otherwise), we improve performance while guaranteeing stability. And we provide a continuous range of trade-offs between LU NoPiv (efficient but only stable for diagonally-dominant matrices) and QR (always stable but twice as costly and with less performance). For some classes of matrices (e.g., tile diagonally dominant), the LU-QR Algorithm will only perform LU steps. This work opens several research directions. First, as already mentioned, the choice of the robustness parameter α is left to the user, and it would be very interesting to be able to auto-tune a possible range of values as a function of the problem and platform parameters. Second, there are many variants and extensions of the hybrid algorithm that can be envisioned. Several have been mentioned in Section 1.2, and many others could be tried. In particular, the tile extension of the MUMPS criterion looks promising and deserves to be implement in software during future work. Another goal would be to derive LU algorithms with several eliminators per panel (just as for HQR) to decrease the critical path, provided the availability of a reliable robustness test to ensure stability. Chapter 2
Bridging the gap between experimental performance and theoretical bounds for the Cholesky factorization on heterogeneous platforms
Mainly linear algebra libraries rely on runtime systems to deal with resource allocation and low-level data management, such as MPI communications, shared-memory accesses... These runtimes allows developers to focus on their algorithms at a task-level. In the previous chapter, we designed a new dense linear solver relying on the dynamic scheduler provided by PARSEC. In the purpose of increasing the performance of linear solvers, it is thus important to optimize the resource allocation of these runtime systems. In this chapter, we consider the problem of allocating and scheduling dense linear application on fully heterogeneous platform made of CPUs and GPUs. More specifically, we focus on the Cholesky factorization since it is the simplest factorization algorithm that exhibits the main problems encountered in heterogeneous scheduling. Indeed, the relative performance of CPU and GPU highly depends on the sub-routine: GPUs are for instance much more efficient to process regular kernels such as matrix- matrix multiplication rather than more irregular kernels such as matrix factorization. In this context, one solution consists in relying on dynamic scheduling and resource allocation mechanisms such as the ones provided by PaRSEC or StarPU. In this chapter we analyze the performance of dynamic schedulers based on both actual executions and simulations, and we investigate how adding static rules based on an offline analysis of the problem to their decision process can indeed improve their performance, up to reaching some improved theoretical performance bounds which we introduce.
2.1 Introduction
Linear algebra operations are the basis of many scientific operations. Our objective is to optimize the per- formance of one of them (namely the Cholesky decomposition) on a hybrid computing platform. The use of GPUs and other accelerators such as Xeon Phi are common ways to increase the computation power of computers at a limited cost. The large computing power available on such accelerators for regular computation makes them unavoidable for linear algebra operations. However, optimizing the perfor- mance of a complex computation on such a hybrid platform is very complex, and a manual optimization seems out of reach given the wide variety of hybrid configurations. Thus, several runtime systems have been proposed to dynamically schedule a computation on hybrid platforms, by mapping parts of the computation to each processing elements, either cores or accelerators. Among other successful projects,
29 30 CHAPTER 2. CHOLESKY FACTORIZATION ON HETEROGENEOUS PLATFORMS we may cite StarPU [14] from INRIA Bordeaux (France), Quark [110] and PaRSEC [21] from ICL, Univ. of Tennessee Knoxville (USA), Supermatrix [30] from University of Texas (USA), StarSs [87] from Barcelona Supercomputing Center (Spain) or KAAPI [50] from INRIA Grenoble (France). Usu- ally, the structure of the computation has to be described as a task graph, where vertices represent tasks and edges represent dependencies between them. Most of these tools enable, up to a certain extent, to schedule an application described as a task graph onto a parallel platform, by mapping individual tasks onto computing resources and by performing data movements between memories when needed. There is an abundant literature on the problem of scheduling task graphs on parallel processors. This problem is known to be NP-complete [49]. Lower-bounds based either on the length of the critical path (the longest path from an entry vertex to an output vertex) or on the overall workload (assum- ing ideal parallelism) have been proposed, and simple list-scheduling algorithms are known to provide 2 − 1/m-approximation on homogeneous platforms, at least when communication times are negligi- ble [56]. Several scheduling heuristics have also been proposed, and among them the best-known cer- tainly is HEFT [104], which inspired some dynamic scheduling strategies used in the above-mentioned runtimes. However, it remains a large gap between the theoretical lower-bounds and the actual perfor- mance of dynamic HEFT-like heuristics. Another way to assess the quality of a scheduling strategy is to compare the actual performance to the machine peak performance of the computing platform com- puted as the sum of the performance of its individual computational units. Rather than this machine peak performance which is known to be unreachable, one usually considers the GEMM peak obtained by running matrix multiplication kernels (GEMMs). For large matrices, the task-graph of a Cholesky factorization exhibits a sufficient amount of parallelism, and a sufficient number of GEMM calls for this bound to be reasonable. However, on small and medium size matrices, there is still a large gap between GEMM peak performance and the best-achievable Cholesky performance. In this chapter, we optimize the dynamic scheduling of the Cholesky decomposition of a dense, symmetric, and positive-definite double-precision matrix A, into the product LLT , where L is lower triangular and has positive diagonal elements, using one runtime system, StarPU, and provide better bounds to prove the quality of our schedules. The contributions of the chapter are: • Better lower bounds on the processing time of a Cholesky factorization on a parallel hybrid plat- form; • Better dynamic schedules, based not only on HEFT but also on an hybridization of static and dynamic task assignments; • A very efficient schedule for a simple hybrid platform model, achieved by constraint program- ming. • Numerous experiments and simulations to assess the performance of our schedules using the StarPU runtime. Note that what is done here using StarPU could have been done with other runtimes, provided that we are able to control their mapping and scheduling policies. Similarly, we could have chosen another dense linear algebra factorization such as the QR or LU decompositions.
2.2 Context
2.2.1 Cholesky factorization The Cholesky factorization (or Cholesky decomposition) is mainly used to solve a system of linear equations Ax = b, where A is a N × N symmetric positive-definite matrix, b is a vector, and x is the 2.2. CONTEXT 31
Algorithm 5: Pseudocode of the tile Cholesky factorization for k = 0 to n − 1 do A[k][k] ← POTRF(A[k][k]); for i = k + 1 to n − 1 do A[i][k] ← TRSM(A[k][k], A[i][k]); for j = k + 1 to n − 1 do A[j][j] ← SYRK(A[j][k], A[j][j]); for i = j + 1 to n − 1 do A[i][j] ← GEMM(A[i][k], A[j][k], A[i][j]);
unknown solution vector to be computed. Such systems often arise in physics applications, especially when looking for numerical solutions of partial differential equations, where A is positive-definite due to the nature of the modeled physical phenomenon. One way to solve such a linear system is first to compute the Cholesky factorization A = LLT , where L (referred to as the Cholesky factor) is a N × N real lower triangular matrix with positive diagonal elements. The solution vector x can then be computed by solving the two following triangular systems: Ly = b and LT x = y. To take advantage of modern highly parallel architectures, state-of-the-art numerical algebra libraries implement tiled Cholesky factorizations. The matrix A = (Aij)0≤i,j≤n is divided into n × n tiles (or blocks) of nb × nb elements, and the tiled Cholesky algorithm can then be seen as a sequence of tasks that operate on small portions of the matrix. This approach greatly improves the parallelism of the algorithm and mostly involves BLAS3 kernels whose library implementations are really fast on modern architectures. The benefits of such an approach on parallel multicore systems have already been discussed in the past [63, 28, 89]. Following the BLAS and LAPACK terminology, the tiled algorithm for Cholesky factorization is based on the following set of kernel subroutines:
• POTRF: This LAPACK subroutine is used to perform the Cholesky factorization of a symmetric positive definite tile Akk of size nb × nb producing a lower triangular tile Lkk of size nb × nb.
• TRSM: This BLAS subroutine is used to apply the transformation computed by POTRF to a Aik tile by means of a triangular system solving.
• GEMM: This BLAS subroutine computes a matrix-matrix multiplication of two tiles Aik, Ajk and subtract the result to the tile Aij. The old value of Aij is overwritten by the new one.
• SYRK: This BLAS subroutine executes a symmetric rank-k update on a diagonal tile Akk.
Note that no extra memory area is needed to store the Lij tiles since they can overwrite the corre- sponding Aij tiles from the original matrix. The tiled Cholesky algorithm can be written as in Algo- rithm 5. In this sequential pseudocode, we can notice that some kernel subroutines depend on each other, while others can be processed in parallel. Such an algorithm is commonly represented by its task graph (or DAG) that depicts its actual dependencies. In this well established model, each vertex of the graph represents a call to one of the four kernel subroutines presented above. The edges between two task vertices represent a direct data dependency between tasks. Figure 2.1 depicts the task graph for the Cholesky decomposition of a 5 × 5 tiled matrix. 32 CHAPTER 2. CHOLESKY FACTORIZATION ON HETEROGENEOUS PLATFORMS
POTRF_0
TRSM_1_0
TRSM_4_0 SYRK_1_0 TRSM_2_0 TRSM_3_0
GEMM_4_1_0 GEMM_4_2_0 GEMM_4_3_0 POTRF_1 GEMM_2_1_0 GEMM_3_2_0 GEMM_3_1_0
SYRK_4_0 TRSM_4_1 TRSM_2_1 SYRK_2_0 TRSM_3_1 SYRK_3_0
SYRK_4_1 GEMM_4_2_1 GEMM_4_3_1 SYRK_2_1 GEMM_3_2_1 SYRK_3_1
POTRF_2
TRSM_4_2 TRSM_3_2
SYRK_4_2 GEMM_4_3_2 SYRK_3_2
POTRF_3
TRSM_4_3
SYRK_4_3
POTRF_4
Figure 2.1: Task graph for the Cholesky decomposition of a 5 × 5 tiled matrix.
2.2.2 Multiprocessor scheduling 2.2.2.1 Static Mapping In such an algorithm, the way the matrix tiles are distributed over the processors and the mapping of the tasks onto the computing resources have a strong impact on its performance and scalability. In the distributed context, the ScaLAPACK library [39] uses the standard 2D block-cyclic distribution of tiles along a virtual p-by-q homogeneous grid and the owner compute strategy. In this layout the p-by-q top-left tiles of the matrix are mapped topologically onto the processor grid and the rest of the tiles are distributed onto the processors in a round-robin manner. This layout has been incorporated in the High Performance Fortran standard [79]. Once the tiles of the matrix are distributed onto the processors, the mapping of the tasks is done following the owner-compute strategy: each subroutine overwriting a tile Aij is executed on the processor hosting Aij. The only freedom during the execution is the scheduling of the tasks inner a processor and the communications between processors. This layout ensures a good load balance for homogeneous computing resources and an equal memory usage between processors [39]. However, for heterogeneous resources, this layout is no longer an option, and dynamic scheduling is a widespread practice. These ideas also make sense in a shared-memory environment in order to take advantage of locality. For instance the Plasma library provides an option for relying on such static schedules on multicore chips.
2.2.2.2 Dynamic Scheduling Dynamic strategies have been developed in order to design methods flexible enough to cope with un- predictable performance of resources, especially in the context of real time systems, where on-line and adaptive scheduling strategies are required [31, 82]. More recently, the design of dynamic schedulers 2.3. TOOLS AND LIBRARIES 33 received a lot of attention, since on modern heterogeneous and possibly shared systems, the actual pre- diction of either execution and communication times is very hard, thus justifying the design of ad-hoc tools that will be described in Section 2.3.
2.2.2.3 Task graph scheduling heuristics
As presented earlier, many scheduling heuristics have been proposed for DAGs since this problem is NP-complete. Most of these heuristics are list-scheduling heuristics: they sort tasks according to some criterion and then schedule them greedily. This makes them good candidates to be turned into dynamic scheduling heuristics. The best-known list-scheduling heuristic for DAGs on heterogeneous platforms is certainly HEFT [104]. It consists in sorting tasks by decreasing bottom-level, which is the weight of the longest path from a task to an exit task (a task without successors). In a heterogeneous environment, the weight of a task (or communication) is computed as the average computation (or communication) time over the whole platform. Then, each task is considered and scheduled on the resource on which it will finish the earliest. HEFT turns out to be an efficient heuristic for heterogeneous processors. Other approaches have been proposed to avoid data movement when taking communications into account, such as clustering tasks into larger granularity tasks before scheduling them [93].
2.3 Tools and libraries
For this study, we used the Chameleon [4] implementation of the Cholesky factorization, running on top of the StarPU runtime system. We performed real executions on the target platform, and we additionally used the Simgrid [29] simulator, in order to reduce the experimentation time, improve reproducibility of the experiments, and also be able to modify the execution platform.
2.3.1 StarPU runtime system
StarPU [14] is a runtime system aiming to allow programmers to exploit the computing power of the available CPUs and GPUs, while relieving them from the need to specifically adapt their programs to the target machine and processing units. The StarPU runtime supports a task-based programming model. Applications submit computational tasks, forming a task graph, with CPU and/or GPU implementations, and StarPU schedules these tasks and associated data transfers on available CPUs and GPUs. The data that a task manipulates is automatically transferred between the local memory of the accelerators and the main memory, so that application programmers are freed from the scheduling issues and technical details associated with these transfers. In particular, StarPU takes care of scheduling tasks efficiently, using well-known generic dynamic and task graphs scheduling policies from the literature (see Section 2.2.2), and optimizing data transfers using prefetching and overlapping, in particular. In addition, it allows scheduling experts, such as compiler or computational library developers, to implement custom scheduling policies in a portable fashion. In this study, we specialize the StarPU scheduling algorithms to include a mixture of static and dynamic task assignments, based on the knowledge of the Cholesky task graph, to improve performance on small and medium size matrices. In the following, we call “small” a matrix with less than 10 × 10 tiles, “medium” a matrix with tile size between 10 and 20, and “large” a matrix with more than 20 × 20 tiles. 34 CHAPTER 2. CHOLESKY FACTORIZATION ON HETEROGENEOUS PLATFORMS
2.3.2 Chameleon dense linear algebra library To cope with the increased degree of parallelism, a new class of dense linear algebra algorithms has been proposed, often referred as tile algorithms in the literature [28, 89]. These algorithms led to the design of new libraries in the past five years such as Plasma [28], Flame [72] and DPlasma [19]. Although both static and dynamic versions of the algorithms have been initially implemented, the dynamic codes are now predominant since they proved to provide more flexibility. These dynamic codes rely on runtime systems (Quark [110], Supermatrix [30], PaRSEC [21]) that have been specifically designed for the purpose of the numerical software (in the case of Plasma, Flame and DPlasma, respectively). The advantage of relying on specialized runtime systems is that they can be optimized for both the numerical algorithm and the target architecture. On the other hand, designing and maintaining a runtime system is a highly time consuming task, which makes it difficult to design a fully-featured specialized runtime system. The Chameleon library is based on the Plasma tile algorithms and code but relies on the StarPU generic runtime system instead of the specialized Quark runtime system. One advantage is that it allows for handling heterogeneous architectures (whereas Plasma and Quark were initially designed for multicore chips). Another advantage when aiming at focusing on the impact of scheduling strategies is that it allows for running in simulation mode with the field-proven combination [99] of StarPU and Simgrid.
2.3.3 Simgrid simulation engine Simgrid [29] is a versatile simulation toolkit initially designed to study the behavior of large-scale dis- tributed systems like grids, clouds, or peer-to-peer systems. It builds on fluid network models that have been proven as a reasonable alternative to both simple analytic models and expensive, difficult-to- instantiate packet-level simulations. The Simgrid version of StarPU [99] uses Simgrid to simulate the execution of an application within a single machine. The idea is to run the application normally, except that data transfers and computation kernel calls are replaced by a simple procedure accounting for the time they are expected to take, and gathered coherently by Simgrid. StarPU models each execution unit (CPUs and GPUs) by defining the time taken by each execution unit on each possible task/kernel [14]. It also models the PCI buses between them, using offline bus bandwidth measurements, and relies on Simgrid to compute the interferences on PCI buses between the different transfers. The resulting simulated times are very close to actual measurements on the real platforms [99], and properly reproduce the various behaviors that can be observed for the various schedulers. This allows one to run experiments with the Simgrid version of StarPU, which provides several advantages:
• The time to simulate execution is reduced, since no actual computation or data transfer is done. The Simgrid simulator itself is not parallel, so the whole execution gets serialized, but several simulations can be run in parallel for e.g. various matrix sizes or schedulers, and one then gets all the results in parallel.
• The experiments do not depend on the availability of the platform, both in terms of quotas, and in terms of versions of the installed software, thus allowing reproducible experiments. This proved useful while performing the experimentation for this very article, since the platform became un- available for a couple of weeks due to Air Conditioning issues.
• The platform can be modified, for instance to change the available PCI bandwidth, the execution times of the kernels, etc. In Section 2.5.3.2, we use this feature in order to build a virtual "related" heterogeneous platform. 2.4. LOWER BOUNDS 35
2.4 Lower bounds
Performance results for linear algebra computation are often accompanied with an upper bound in terms of FLOP’S, in order to assess the achieved efficiency. Since the theoretical peak performance is usually unreachable, particularly with GPUs, the common bound being used is the performance of a simple matrix multiplication (GEMM) since this is the most efficient dense linear algebra operation, and thus providing a good hint of some achievable performance. This bound takes into account the heterogeneity of the platform by summing up the obtained GFLOPS on the various processing elements. It however does not take into account the heterogeneity of the application, which is particularly important for small and medium matrices, for which a fair amount of the tasks are not GEMMs but much less efficient tasks such as POTRFs, especially on accelerators. We here propose much more accurate bounds that take into account both heterogeneity of the com- putation resources and of the application kernels, by taking as input the execution time of any kernel on any type of resource. They also to a certain extent take into account the task graph itself, in terms of task dependencies.
2.4.1 Linear Programming formulation The lower bound computation is based on a relaxation of the scheduling problem, in which almost all precedence constraints are ignored. This formulation focuses on the number of tasks nrt of each type t (GEMM, SYRK, TRSM, POTRF) which are executed on each resource type r (CPU, GPU, ...). From the Cholesky task graph, we know the number Nt of tasks of each type t that need to be performed, and from the platform we know the number Mr of processing elements of each type r available to schedule the tasks. For each task type t and resource type r, the calibration mechanisms inside StarPU provide the execution time Trt of these tasks on this resource type. The basic area bound is obtained by solving the following linear problem:
minimize l such that
∀t, nrt = Nt (all Nt tasks of type t get executed) r X ∀r, nrtTrt ≤ l × Mr (resources of type r complete their tasks) t X + ∀r, t nrt ∈ N
It is clear that the optimal value l∗ of this linear program is a lower bound on the total execution time of the task graph, since any execution needs to execute all tasks. Ignoring the task graph precedences in this bound allows one to handle tasks of the same type with a couple of variables (one per resource type), instead of having one variable for each task in the graph, thus limiting the number of variables and reducing symmetries in the solution space. StarPU is able, without any input from the application beyond the normal task submission, to automatically generate this program and solve it on the fly, right after the application execution, which thus allows one to print this theoretical bound along the measured performance in the application output. Due to the actual timings of the different task types, this linear program always decides that all POTRF tasks should be executed on CPUs, since all other task types make much more efficient use of the GPU resources. However, in practice all POTRF tasks are on the critical path of the Cholesky graph, and hence this implies that the resulting lower bound is too optimistic for small matrix sizes, since it does not take dependencies into account. This interesting feature of the Cholesky task graph to contain a 36 CHAPTER 2. CHOLESKY FACTORIZATION ON HETEROGENEOUS PLATFORMS path with all n POTRF tasks can be used to strengthen the bound, without adding other variables in the n×(n−1) linear program. In addition to the n POTRF tasks, this path contains n − 1 of the 2 TRSM tasks, n×(n−1) and n − 1 of the 2 SYRK tasks. We can thus add the following constraint, which states that the execution time is necessarily larger than the time to execute all these tasks in sequence:
∗ ∗ nrP TrP + (n − 1) × TT + (n − 1) × TS ≤ l r X ∗ ∗ In this constraint, TT and TS denotes the fastest execution time of TRSM and SYRK tasks: we do not model exactly on which resources these TRSM and SYRK tasks are executed, and thus underestimate their completion times, ignoring which resource they actually run on1. The resulting lower bound is called the mixed bound in the rest of the chapter. This linear program has a very small number of variables and constraints (in particular, they are independent of the matrix size), and it can thus be solved very quickly.
2.4.2 Constraint Programming formulation In addition to this lower bound computation, we have used a Constraint Programming formulation of the scheduling problem, in order to obtain good feasible solutions. These solutions provide both a comparison point for StarPU schedules and a limit for possible improvements of the lower bound. The formulation contains one boolean variable bir for each task i and each resource type r (only one can be true for a given task), and one integer variable si for each task i which represents the starting time of the task. The constraints are the following:
minimize l such that
∀i, OnlyOne(bi1, . . . , biR) (only one type of resource executes task i)
∀i, si + birTir ≤ l (task i completes) r X ∀r, ∀t, birTir}| ≤ Mr (at time t the Mr resources of type r r X are executing at most Mr tasks)
∀i → j, si + birTir ≤ sj (dependency i → j is respected) r X
We have implemented this constraint programming formulation using CP Optimizer v12.4. The first constraint is expressed using the alternative constraint, and the third constraint uses the concept of cumulative functions to express the number of tasks which use resources of type r at time t. The other constraints are simple linear constraints and are easily expressed. The solver explores the solution space with an exhaustive search and backtracking, using constraint propagation to reduce the search space as much as possible. Furthermore, providing the result of a HEFT heuristic as an initial solution allows the solver to explore good solutions more rapidly. We let the solver optimize for 23 hours and keep the best solution
1It is possible to include additional variables to the linear program to have more precise values, but this does not provide a better bound unless we take more dependencies into account, which requires adding too many variables and constraints and makes the linear program intractable. 2.4. LOWER BOUNDS 37 found in this duration. The obtained solutions are quite good compared to what is obtained with other heuristics, but the solver is unable to prove optimality. Because it would otherwise be extremely costly to solve 2, this formulation does not take into ac- count data transfers. With the usual platforms and the dense linear algebra operation being studied (the Cholesky factorization), data transfers are indeed not a concern: computation is dense enough for transfers to be largely overlapped with kernel computation.
2.4.3 Upper bounds on performance
Lower bounds on execution time also give upper bounds on the performance. Therefore, we have plotted different theoretical performance upper bounds of the Cholesky factorization in Figure 2.2, based on real
execution timings of different tasks on the Mirage machine (described in section 2.5.2).
✥ ✥✥