TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF

Designing Communication-Efficient Matrix Algorithms in Distributed-Memory

Thesis submitted in partial fulfillment of the requirements for the M.Sc. degree of Tel-Aviv University by Eyal Baruch

The research work for this thesis has beencarried out at Tel-Aviv University under the direction of Dr. Sivan Toledo

November 2001

Abstract

This thesis studies the relationship between parallelism, space and com- munication in dense matrix algorithms. We study existing matrix multipli- cationalgorithms, specifically those that are designedfor shared-memory multiprocessor machines (SMP’s). These machines are rapidly becoming commodity in the computer industry, but exploiting their computing power remains difficult. We improve algorithms that originally were designed us- ing an algorithmic multithreaded language called Cilk (pronounced silk), and we present new algorithms. We analyze the algorithms under Cilk’s dag- consistent memory model. We show that by dividing the matrix-multiplication into phases that are performed ina sequence, we canobtainlower communication bound without significantly limiting parallelism and with- out consuming significantly more space. Our new algorithms are inspired by distributed-memory matrix algorithms. Inparticular, we have developed algorithms that mimic the so-called two-dimensional and three-dimensional matrix multiplicationalgorithms, which are typically implementedusing message-passing mechanisms, not using share-memory programming. We focus onthree key matrix algorithms: matrix multiplication,solutionof tri- angular linear systems of equations, and the factorization of matrices into triangular factors.

3 Contents

Abstract 3 Chapter 1. Introduction 5 1.1. Two New Matrix MultiplicationAlgorithms 6 1.2. New Triangular Solver and LU Algorithms 7 1.3. Outline of The Thesis 7 Chapter 2. Background 8 2.1. Parallel Matrix MultiplicationAlgorithms 8 2.2. The Cilk Language 9 2.3. The Cilk Work Stealing Scheduler 12 2.4. Cilk’s Memory Consistency Model 12 2.5. The BACKER Coherence Algorithm 14 2.6. A Model of Multithreaded Computation15 Chapter 3. Communication-Efficient Dense Matrix Multiplication in Cilk 18 3.1. Space-Efficient Parallel Matrix Multiplication 18 3.2. Trading Space for Communication in Parallel Matrix Multiplication23 3.3. A Comparison of Message-Passing and Cilk Matrix- MultiplicationAlgorithms 27 Chapter 4. A Communication-Efficient Triangular Solver in Cilk 29 4.1. Triangular Solvers in Cilk 29 4.2. Auxiliary Routines 31 4.3. Dynamic Cache-Size Control 34 4.4. Analysis of the New Solver with Dynamic Cache-Size Control 35 Chapter 5. LU Decomposition39 Chapter 6. Conclusion and Open Problems 42 Bibliography 43

4 CHAPTER 1

Introduction

The purpose of parallel processing is to perform computations faster than can be done with a single processor by using a number of processors concurrently. The need for faster solutions and for solving large problems arises in wide variety of applications. These include fluid dynamics, weather prediction, image processing, artificial intelligence and automated manufac- turing. Parallel computers canbe classified accordingto variety of architec- tural features and modes of operations. In particular, most of the existing machines may be broadly grouped into two classes: machines with shared- memory architectures (examples include most of the small multiprocessors inthe market, such as Pentium-based servers, andseveral large multiproces- sors, such as the SGI Origin 2000) and machines with distributed-memory architecture (examples include the IBM SP systems and clusters of work- stations and servers). In a shared-memory architecture, processors commu- nicate by reading from and writing into the . In distributed- memory architectures, processors communicate by sending messages to each other. This thesis focuses on the efficiency of parallel programs that run un- der the Cilk programming environment. Cilk is a parallel programming system that offers the programmer a shared-memory abstractionontop a hardware. Cilk includes a compiler for its programming language, which is also referred to as Cilk, and a run-time system consisting of a scheduler and a memory consistency protocol. (The memory consis- tency protocol, which this thesis focuses on, is only part of one version of Cilk; the other versions assume a shared-memory hardware.) The Cilk parallel multithreaded language has been developed in order to make high-performance parallel shared-memory programming easier. Cilk is built around a provably efficient algorithm for scheduling the execution of fully strict multithreaded computations, based on the technique of work stealing [21][4][26][22][5]. Inhis PhD thesis [ 21], Randall developed a memory-consistency for running Cilk programs on distributed-memory par- allel computers and clusters. His protocol allows the algorithm designer to analyze the amount of communication in a Cilk program and the impact of this communication on the total running time of the program. The analyti- cal tools that he developed, along with earlier tools, also allows the designer to estimate the space requirements of a program. Randall demonstrated the power of these results by implementing and analyzing several algorithms, including matrix multiplication and LU factorization algorithms. However, the communication bounds of Randall’s algorithms are quite loose compared to known distributed-memory message-passing algorithms.

5 1.1. TWO NEW MATRIX MULTIPLICATION ALGORITHMS 6

This is alarming, since extensive communication between processors may significantly slow down parallel computations even if the work and commu- nicationis equally distributed betweenprocessors. Inthis thesis we show that it is possible to tightenthe communication bound with respect to the cache size using new Cilk algorithms that we have designed. We demonstrate new algorithms for matrix multiplication, for solution of triangular linear systems of equations, and for the factorization of matrices into triangular factors. By the term Cilk algorithms we essentially mean Cilk implementation of conventional matrix algorithms. Programming languages allow the program- mer to specify a computation (how to compute intermediate and final results from previously-computed results). But most programming languages also force the designer to constrain the schedule of the computation. For exam- ple, a C program essentially specifies a complete ordering of the primitive operations. The compiler may change the order of computations only if it can prove that the new ordering produces equivalent results. Parallel message-passing programs fully specify the schedule of the parallel compu- tation. Cilk programs, in contrast, declare that some computations may be performed inparallel but let a run-time scheduler decide onthe exact schedule. Our analysis, as well as previous analyses of Cilk programs, es- sentially show that a given program admits an efficient schedule and that Cilk’s run-time scheduler is indeed likely to choose such a schedule.

1.1. Two New Matrix Multiplication Algorithms The maincontributionof this thesis is inpresentinga newapproach for designing algorithms implemented in Cilk for achieving lower communi- cationbound. Inthe distributed-memory applicationworld there exists a traditionalclassificationof matrix multiplicationalgorithms. So-called two- dimensional (2D) algorithms, such as those of Cannon [7], or of Ho, Johns- sonandEdelman[ 18], use only a little amount of extra memory. Three- dimensional (3D) algorithms use more memory but perform asymptotically less communication; examples include the algorithms of as Gupta and Sa- dayappan[ 14], of Berntsen [3], of Dekel, Nassimi and Sahni [8]andofFox, Otto and Hey [10]. Cilk’s shared-memory abstraction, in comparison to message-passing mechanisms, simplifies programming by allowing of each procedure, no mat- ter which processor runs it, to access the entire memory space of the pro- gram. The Cilk runtime system provides support for scheduling decisions and the programmer needs not specify which processor executes which pro- cedure, nor exactly when each procedure should be executed. These factors make it substantially easier to develop parallel programs using Cilk than with other parallel-programming environments. One may suspect that the ease-of-programming comes at a cost: reduced performance. We show in this thesis that this is not necessarily the case, at least theoretically (up to logarithmic factors), but that careful programming is required in order to match existing bounds. More naive implementations of algorithms, in- cluding those proposed by Randall, do indeed suffer from relatively poor theoretical performance bounds. 1.3. OUTLINE OF THE THESIS 7

We give tighter communication bounds for new Cilk matrix multiplica- tionalgorithms that canbe classified as 2D and3D algorithms andprove that it is possible to designsuch algorithms with the simple programming environment of Cilk almost without compromising on performance. In the 3D case we have evenslightly improved parallelism. The analysisshows we canimplementa 2D-like algorithm for multiplying n × n matrices on P n2 a machine√ with processors, each with P memory, with communication bound O( Pn2 log n). In comparison, Randall’s notempul algorithm, which is equivalent in the sense that it uses little space beyond the space required for the input and output, performs O(n3) communication. We also present 3 O √n CP √n n a 3D-like algorithm with communication bound ( C + log C log ), where C is the memory size of each processor, which is lower thanexisting Cilk implementations for any amount of memory per processor.

1.2. New Triangular Solver and LU Algorithms Solving a linear system of equations is one of the most fundamental prob- lems in numerical linear algebra. The classic Gaussian elimination scheme to solve an arbitrary linear system of equations reduces the given system to a triangular form and then generates the solution by using the standard forward and backward substitution algorithm. This essentially factors the coefficient matrix into two triangular factors, one lower triangular and the other upper triangular. The contribution of this thesis is showing that if we can dynamically control the amount of memory that processors use to cache data locally, thenwe designcommunicationefficientalgorithms for solving dense linear systems. In other words, to achieve low communication bounds, we limit the amount of data that a processor may cache during cer- tainphases√ of the algorithm. Our algorithms perform asymptotically√ a√ fac- tor of C√ less communication than Randall’s (where C>log(n C)), log(n C) but our algorithms have somewhat less parallelism.

1.3. Outline of The Thesis The rest of the thesis is organized as follows. In Chapter 2 we present an overview of existing parallel linear-algebra algorithms together, and we present Cilk, an algorithmic multithreaded language. Chapter 2 also intro- duces the tools that Randall and others have developed for analysing the performance of Cilk programs. In Chapter 3 we present new Cilk algorithms for parallel matrix multiplication and analyze our algorithms. In Chapter 4 we present a new triangular solver and demonstrate how controlling the size of the cache can reduce communication. In Chapter 5 we use the results con- cerning the triangular solver to design communication-efficient LU decom- position algorithm. We present our conclusions and discuss open problems inChapter 6. CHAPTER 2

Background

This chapter provides background material required in the rest of the thesis. The first sectiondescribes parallel distributed-memory matrix mul- tiplication algorithms. Our new Cilk algorithms are inspired by these algo- rithms. The other sections describe Cilk and the analytical tools that allow us to analyze the performance of Cilk programs. Some of the material on Cilk follows quite closely the Cilk documentation and papers.

2.1. Parallel Matrix Multiplication Algorithms R AB r n a b n The product = is defined as ij = k=1 ik kj,where is the number of columns of A and rows in B. Implementing matrix multiplication according to the definition requires n3 multiplications and n2(n−1) additions whenthe matrices are n-by-n. Inthis thesis we ignore o(n3) algorithms, such as Strassen’s, which are not widely-used in practice. Matrix multiplication is a regular computationthat parallelizes well. The first issue when implementing such algorithms on parallel machines is how to assigntasks to processors. We cancompute all the elements of the product inparallel, so we canclearly employ n2 processors for n time steps. We canactually compute all the products ina matrix multiplication computationinonestep if we canuse n3 processors. But to compute the n2 sums of product, we need additional log n steps for the summations. Note that with n3 processors, most of them would remainidle duringmost of the time, since there are only 2n3 − 1 arithmetic operations to perform during log n + 1 time steps. Another issue when implementing parallel algorithms is the mechanisms used to support communication among different processors. In a distributed- memory architectures each processor has its ownlocal memory which it can address directly and quickly. A processor may or may not be able to address the memory of other processors directly and in any case, accessing remote memories is slower thanaccessingits ownlocal memory. Programming a distributed-memory machine with message passing poses two challenges. The first challenge is a software engineering one, since the memory of the computer is distributed and since the running program is composed of multiple processes, each with its ownvariables, we must dis- tribute data structures among the processors. The second and more funda- mental challenge is to choose the assignment of data-structure elements and computational tasks to process in a way that minimizes communication, since transferring data between memories of different processors is much slower thanaccessingdata ina processor ownlocal memory, reducingdata transfers usually reduces the running time of a program. Therefore, we

8 2.2. THE CILK LANGUAGE 9 must analyze the amount of communication in matrix algorithms when we attempt to design efficient parallel algorithms and predict their performance. There are two well known implementation concepts for parallel matrix multiplication on distributed-memory machines. The first and more natural implementation concept is to lay√ out the√ matrix in blocks. The P processors are arranged√ in a 2-dimensional P -by- P grid (we assume√ for√ simplicity here that P is an integer), split the three matrices into P -by- P block matrices and store each block on the corresponding processor. The grid of processors is simply a map from 2 dimensional processor indices to the usual 1-dimensional rank (processor indexing). This form of distributing a matrix is called a 2-dimensional (2D) block distribution, because we distrib- ute both the rows and the columns of the matrix among processors. The basic idea of the algorithm is to assignprocessor ( i, j) the computationof √ R P A B R R ij = k=1 ik kj (here ij is a block√ of the matrix , and similarly for A and B). The algorithm consists of P mainphases. Ineach phase, every n2 processor sends two messages of size P words (and receives two such mes- √n 3 sages as well), and performs 2( P ) floating point operations, resulting in √ 2 2 2 P n √n n P = P communication cost and P memory space per processor. A second kind of distributed-memory matrix multiplication algorithm uses less communication but more space than the 2D algorithm. The basic idea is to arrange the processors in a p-by-p-by-p 3D grid, where p = P 1/3, and to split the matrices into p-by-p block matrices. The first phase of the algorithm distributes the matrix so that processor (i, j, k)storesAik and Bkj. The next phase computes on processor (i, j, k) the product AikBkj. Inthe third andlast phase of the algorithm the processors sum up the products AikBkj to produce Rij. More specifically, the group of processors with indices (i, j, k)withk =1..p sum up Rij. The computational load in the 3D algorithm is nearly perfectly balanced. Each processor multiplies two blocks and adds at most two. Some processors add none. The 2D algorithm requires each processor to store exactly 3 submatrices 2 √ √n P √n n2 P of order P during the algorithm and performs total of ( P )= communication. The 3D algorithm stores at each processor 3 submatrices n P n2 n2P 1/3 of order P 1/3 and performs a total of ( P 2/3 )= communication. 2.2. The Cilk Language The philosophy behind Cilk is that a programmer should concentrate on structuring his program to expose parallelism and exploit locality, leaving the runtime system with the responsibility of scheduling the computation to run efficiently on the given platform. Cilk’s runtime system takes care of details like load balancing and communication protocols. Unlike other mul- tithreaded languages, however, Cilk is algorithmic in that the runtime sys- tem’s scheduler guarantees provably efficient and predictable performance. Cilk’s algorithmic multithreaded language for parallel programming gen- eralizes the semantics of C by introducing linguistic constructs for parallel control. The basic Cilk language is simple. It consists of C with the ad- ditionof three keywords: cilk, spawn and sync to indicate parallelism and synchronization. A Cilk program, when run on one processor, has the same 2.2. THE CILK LANGUAGE 10 semantics as the C program that results when the Cilk keywords are deleted. Cilk extends the semantics of C in a natural way for parallel execution so procedure may spawn subprocedures in parallel and synchronize upon their completion. A Cilk procedure definition is identified by the keyword cilk and has an argument list and body just like a C function and its declaration can be used anywhere an ordinary C function declarations can be used. The main procedure must be named main, as in C, but unlike C, however, Cilk insists that the returntype of main be int.Sincethemain procedure must also be Cilk procedure, it must be defined with the cilk keyword. Most of the work ina Cilk procedure is executed serially, just like C, but parallelism is created whenthe invocationof a Cilk procedure is immediately preceded by the keyword spawn. A spawnis the parallel analogof a C function call, and like a C function call, when a Cilk procedure is spawned, execution proceeds to the child. Unlike a C function call, however, where the parent is not resumed until after its child returns, in the case of a Cilk spawn, the parent can continue to execute in parallel with the child. Indeed, the parent can continue to spawn off children, producing a high degree of parallelism. Cilk’s scheduler takes the responsibility of scheduling the spawned procedures on the processors of the parallel computer. A Cilk procedure cannot safely use the returnvalues (or data written to shared data structures) of the childrenit has spawned until it executes a sync statement. If all of its children have not completed when it executes a sync, the procedure suspends and does not resume until all of its children have completed. InCilk, a sync waits only for the spawned children of the procedure to complete and not for all procedures currently executing. Whenall its childrenreturn,executionof the procedure resumes at the point immediately following the sync statement. As an aid to programmers, Cilk inserts an implicit sync before every return, if it is not present already. As a consequence, a procedure never terminates while it has outstanding children. The program inFigure 2.2.1 demonstrates how Cilk works. The figure shows a Cilk procedure that computes the n-th Fibonacci number. InCilk’s terminology, a is a maximal sequence of instructions that ends with a spawn, sync or return (either explicit or implicit) state- ment (the evaluation of arguments to these statements is considered part of the thread preceding the statement). Therefore, We can visualize a Cilk program executionas a directed acyclic graph, or dag, inwhich vertices are threads (instructions) and edges denote ordering constraints imposed by control statements. A Cilk program execution consists of a collection of procedures, each of which is broken into a sequence of non blocking threads. The first thread that executes whena procedure is called is the procedure initial thread, and the subsequent threads are successor threads.At runtime, the binary spawn relation causes procedure instances to be struc- tured as a rooted tree, and the dependencies among their threads form a dag embedded inthis spawn-tree. For example, the computationgenerated by the executionof fib(4) from the program inFigure 2.2.1 generates the dag showninFigure 2.2.2. A correct executionof a Cilk program must obey all the dependencies in the dag, since a thread can not be executed until 2.2. THE CILK LANGUAGE 11

cilk int fib(int n) { if (n<2) return n; else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } }

Figure 2.2.1. A simple Cilk procedure to compute the nth Fibonacci number in parallel (using an exponential work method while logarithmic-time methods are known). Delet- ing the cilk, spawn,andsync keywords would reduce the procedure to a valid and correct C procedure.

fib(4)

fib(2) fib(3)

fib(2)

Figure 2.2.2. A dag of threads representing the multi- threaded computationof fib(4) from Figure 2.2.1. Each procedure, shown as a rounded rectangle, is broken into se- quences of threads, shown as circles. A downward edge in- dicates the spawning of a subprocedure. A horizontal edge indicates the continuation to a successor thread. An upward edge indicates the returning of a value to a parent procedure. All three types of edges are dependencies which constrain the orders inwhich thread may be scheduled. The figure is from [22]. all the threads onwhich it depends have completed. Note that the use of the term thread here is different from the common use in programming environments such as Win32 or POSIX threads, where the same term refers to a process-like object that shares anaddress space with other threads and which competes with other threads and processes for CPU time. 2.4. CILK’S MEMORY CONSISTENCY MODEL 12

2.3. The Cilk Work Stealing Scheduler The spawn and sync keywords specify logical parallelism, as opposed to actual parallelism. That is, these keywords indicate which code may pos- sibly execute inparallel but what actually runsinparallel is determined by the scheduler, which maps dynamically unfolding computations onto the available processors. To execute a Cilk program correctly, Cilk’s underlying scheduler must obey all the dependencies in the dag, since a thread can not be executed until all the threads on which it depends have completed. These dependencies form a partial order, permitting many ways of scheduling the threads in the dag. A scheduling algorithm must ensure that enough threads remain concurrently active to keep the processors busy. Simultaneously, it should ensure that the number of concurrently active threads remains within reasonable limits so that memory requirements can be bounded. Moreover, the scheduler should also try to maintain related threads on the same pro- cessor, if possible, so that communicationbetweenthem canbe minimized. Needless to say, achieving all these goals simultaneously can be difficult. Two scheduling paradigms address the problem of scheduling multi- threaded computations: work sharing and work stealing. In work sharing, whenever a processor generates new threads, the scheduler attempts to mi- grate some of them to other processors inhopes of distributingthe work to under utilized processors. In work stealing, however, under utilized proces- sors take the initiative: they attempt to steal threads from other processors. Intuitively, the migration of threads occurs less frequently with work steal- ing than with work sharing, since if all processors have plenty of work to do, no threads are migrated by a work-stealing scheduler, but threads are always migrated by a work-sharing scheduler. Cilk’s work stealing scheduler executes any Cilk computation in nearly optimal time [21][4][5]. Along the execution of a Cilk program, when a processor runs out of work, it asks another processor, chosen at random, for work to do. Locally, a processor executes procedures inordinary serial order (just like C), exploring the spawn tree in a depth-first manner. When a child procedure is spawned, the processor saves local variables of the parent (activationframe) onthe bottom of a stack, which is a ready deque (doubly ended queue from which procedures can be added or deleted) and commences work on the child (the convention is that the stack grows downward, and the items are pushed and popped from the bottom of the stack). When the child returns, the bottom of the stack is popped (just like C) and the parent resumes. Whenanother processor requests work, however, work is stolen from the top of the stack, that is, from the end opposite to the one normally used by the worker.

2.4. Cilk’s Memory Consistency Model Cilk’s shared memory abstraction greatly enhances the programmability of a multiprocessor. Incomparisonto a message-passingarchitecture, the ability of each processor to access the entire memory simplifies programming by reducing the need for explicit data partitioning and data movement. The single address space also provides better support for parallelizing compilers 2.4. CILK’S MEMORY CONSISTENCY MODEL 13 and standard operating systems. Since shared-memory systems allow multi- ple processors to simultaneously read and write the same memory locations, programmers require a conceptual model for the semantics of memory op- erations to allow them to correctly use the shared memory. This model is typically referred to as a memory consistency model or memory model. To maintain the programmability of shared-memory systems, such a model should be intuitive and simple to use. The intuitive memory model assumed by most programmers requires the executionof a parallel program ona multiprocessor to appear as some interleaving of the execution of the parallel processes on a uniprocessor. This intuitive model was formally defined by Lamport as sequential consis- tency [19]:

Definition 2.4.1. A multiprocessor is sequentially consistent if the re- sult of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear inthis sequenceinthe order specified by its program.

Sequential consistency maintains the memory behavior that is intuitively expected by most programmers. Each processor is required to issue its memory operations in program order. Operations are serviced by memory one-at-a-time, and thus, they appear to execute atomically with respect to other memory operations. The memory services operations from different processors based onanarbitrary but fair global schedule. This leads to an arbitrary interleaving of operations from different processors into a single sequential order. The fairness criteria guarantees eventual completion of all processor requests. The above requirements lead to a total order on all memory operations that is consistent with the program order dictated by each processors program. Unfortunately, architects of shared memory systems for parallel comput- ers who have attempted to support Lamport’s strong model of sequential consistency have generally found that Lamport’s model is difficult to im- plement efficiently and hence relaxed models of shared-memory consistency have beendeveloped [ 9][11][12]. These models adopt weaker semantics to allow a faster implementation. By and large, all of these consistency mod- els have had one thing in common: they are processor centric in the sense that they define consistency in terms of actions by physical processors. In contrast, Cilk’s dag consistency is defined on the abstract computation dag of a Cilk program and hence is computational centric. To define a computation-centric memory model like dag consistency it suffices to define what values are allowed to be returned by a read. We now define dag consistency in terms of the computation. A compu- tationis representedby its graph G =(V,E), where V is a set of vertices representing threads of the computation, and E is a set of edges representing ordering constraints on the threads. For two threads u and v,wesaythat u (strictly) precedes v which we write u ≺ v if u = v and there is a directed path in G from u to v. 2.5. THE BACKER COHERENCE ALGORITHM 14

Definition 2.4.2. The shared memory M of a computation G =(V,E) is a dag consistent if for every object x inthe shared memory, there exists an observer function fx : V → V such that the following conditions hold: 1. For all instructions u ∈ V , the instruction fx(u)writestox. 2. If an instructions u writes to x, thenwe have fx(u)=u. 3. If an instructions u reads to x, it receives a value of fx(u). 4. For all instructions u ∈ V ,wehaveu ≺ fx(u). 5. For each triple u, v, w of instructions such that u ≺ v ≺ w,iffx(u) = fx(v) holds, thenwe have fx(w) = fx(u).

Informally, the observer function fx(u) represents the viewpoint of in- struction u on the content of object x. For deterministic programs, this definition implies the intuitive notion that a read can see a write in the dag consistency model only if there is some serial execution order consistent with the dag in which the read sees the write. Unlike sequential consistency, but similar to certainprocessor-centric models [ 11][12] dag consistency allows different reads to return values that are based on different serial orders, but the values returned must respect the dependency in the dag. Thus, the writes performed by a thread are seenby its successors, but threads that are incomparable in the dag may or may not see each other’s writes. The Primary motivation for any weak consistency model, including dag consistency, is performance. In addition, however, a memory model must be understandable by a programmer. In the dag consistency model if the programmer wishes to ensure that a read sees the write, he must ensure that there is a path inthe computationdag from the write to the read. Using Cilk can ensure that such a path exists by placing a sync statement between the write and the read in his program.

2.5. The BACKER Coherence Algorithm Cilk’s maintains dag consistency using a coherence protocol called BACKER1 [21]. In this protocol, versions of shared-memory objects can reside simultane- ously inanyof the processors’ local caches or inthe global backingstore. Each procedure’s cache contains objects recently used by the threads that have executed on that processor and the backing store provides a global storage locationfor each object. Inorder for a thread executingonthe processor to read or write anobject, the object must be inthe processor’s cache. Each object inthe cache has a dirty bit to record whether the object has beenmodified since it was brought into the cache. Three basic actions are used by the BACKER to manipulate shared- memory objects: fetch, reconcile and flush. A fetch copies anobject from the backing store to a processor cache and marks the cached object as clean. A reconcile copies a dirty object from a processor cache to the backing store and marks the cached as clean. Finally, a flush removes a cleanobject from a processor cache. Unlike implementations of other models of consistency, all three actions are bilateral between a processors cache and the backing store and other processors’ caches are never involved.

1The BACKER coherence algorithm was designed and implemented as part of Keith Randall’s PhD thesis, but it is not included in the Cilk versions that are actively maintained. 2.6. A MODEL OF MULTITHREADED COMPUTATION 15

The BACKER coherence algorithm operates as follows. When the pro- gram performs a read or write actiononanobject, the actionis performed directly ona cached copy of the object. If the object is notinthe cache, it is fetched from the backing store before the action is performed. if the action is a write , the dirty bit of the object is set. To make space inthe cache for a new object, a clean object can be removed by flushing it from the cache. To remove a dirty object, it is reconciled and then flushed. Besides performing these basic operations in response to user reads and writes, the BACKER performs additional reconciles and flushes to enforce dag consistency. For each edge i → j inthe computationdag, if threads i and j are scheduled ondifferentprocessors, say P and Q, thenBACKER reconcilesall of P ’s cached objects after P executes i, but before P enables j, and it reconciles and flushes all of Q’s cached object before Q executes j. The key reason BACKER works is that it always safe, at any point during the execution, for a processor P to reconcile an object or to flush a cleanobject. The BACKER algorithm uses this safety property to guarantee dag consistency even when there is communication. BACKER causes P to reconcile all its cached object after executing i but before enabling j and it causes Q to reconcile and flush its entire cache before executing j.Atthis point, the state of Q’s cache (empty) is the same as P ’s if j had executed with i onprocessor P , but a reconcile and flush had occurred between them. Consequently, BACKER, ensures dag consistency.

2.6. A Model of Multithreaded Computation Cilk supports analgorithmic model of multithreaded computationwhich equips us with an algorithmic foundation for predicting the performance of Cilk programs. A multithreaded computationis composed of a set of threads, each of which is a sequential ordering of unit-size instructions. A processor takes one unit of time to execute one instruction. The instructions of a threads must execute insequentialorder from the first instructionto the last instruction. From anabstract theoretical perspective, there are two fundamental limits to how fast a Cilk program could run[ 21][4][5]. Let us denote by Tp the executiontime of a givencomputationon P processors. The work of the computation, denoted T1, is the total number of instructions in the dag, which corresponds to the amount of time required by a one-processor execution (ignoring cache misses and other complications). Notice that with T1 T1 work and P processors, the lower bound Tp ≥ P must hold, since in one step, a P-processor computer cando at most P work (this, again, ignores cache misses). The second limit is based on the program’s critical path length, denoted by T∞, which is the maximum number of instructions on any directed path in the dag, which corresponds to the amount of time required by an infinite-processor execution, or equivalently, the time needed to execute threads along the longest path of dependency. The corresponding lower bound is simply Tp ≥ T∞, since a P-processor computer can do no more work inonestep thananinfinite-processor computer. The work T1 and the critical path length T∞ are not intended to denote the execution time on any real single-processor or infinite-processor machine. 2.6. A MODEL OF MULTITHREADED COMPUTATION 16

These quantities are abstractions of a computation and are independent of any real machine characteristics such as communication latency. We can think of T1 and T∞ as execution times on an ideal machine with no sched- uling overhead and with a unit-access-time memory system. Nevertheless, Cilk work-stealing scheduler executes a Cilk computation that does not use locks onP processors inexpected time [ 21]

T T 1 O T , P = P + ( ∞)

T1 which is asymptotically optimal, since P and T∞ are both lower bounds. Empirically, the constant factor hidden by the big O is oftenclose to 1 T1 or 2 [22] and the formula Tp = P + T∞ provides a good approximation of the running time on shared-memory multiprocessors. This performance model holds for Cilk programs that do not use locks. If locks are used, Cilk cannot not guarantee anything [22]. This simple performance model allows the programmer to reasonabout the performanceof his Cilk program by examining the two simple metrics: work and critical path. P T1 The computationon processors is the ratio Tp which in- dicates how many times faster the P-processor execution is than a one- T1 P processor execution. if Tp =Θ( ), thenwe say that the P-processor execu- T1 tionexhibits linear speedup. The maximum possible speedup is T∞ which is also called the parallelism of the computation, because it represents the average amountof work that canbe doneinparallel ineach time step along the critical path. We denote the parallelism of a computation by P . In order to model performance for Cilk programs that use dag-consistent shared memory, we observe that running times will vary as a function of the size C of the cache that each processor uses. Therefore, we must introduce metrics that account for this dependence. We define a new work measure, the total work, that accounts for the cost of cache misses in the serial execution. Let Γ be the time to service a cache miss inthe serial execution. We assign weight to the instructions of the dag. Each instruction that generates a cache miss in the one-processor execution with the standard, depth-first serial executionorder andwith a cache of size C has weight Γ + 1, and all other instructions have weight 1. The total work, denoted T1(C), is the total weight of all instructions in the dag, which corresponds to the serial executiontime if cache misses take Γ unitsof time to be serviced. The work term T1, which was defined before, corresponds to the serial execution time if all cache misses take zero time to be serviced. Unlike T1, T1(C) depends onthe serial executionorder of the computation. It further differs from T1 T1(C) inthat P is not a lower bound on the execution time for P processors. T1(C) Consequently, the ratio T∞ is defined to be the average parallelism of the computation. We canboundthe amountof space used by parallel Cilk executionin terms of its serial space. Denote by Sp the space required for a P-processor execution. Then S1 is the space required for anexecutionononeprocessor. Cilk’s guarantees [21]thatforaP processor executionwe have SP ≤ S1P . This bound implies that if a computation uses a certain amount of memory 2.6. A MODEL OF MULTITHREADED COMPUTATION 17 on one processor, it will use no more space per processor on average when it runs in parallel. The amount of interprocessors communication can be related to the number of cache misses that a Cilk computation incurs when it runs on P processors using the implementation of the BACKER coherence algorithm with cache size C. Let us denote by Fp(C) the amount of cache misses performed by a P -processor Cilk computation. Randall [21]showsthat Fp(C) ≤ F1(C)+2Cs,wheres is the total number of steals executed by the scheduler. The 2Cs term represents cache misses due to warming up the processors’ caches. Randall has performed empirical measurements that indicated that the warm-up events are much smaller inpractice thanthe theoretical bound. Randall shows that this bound can be further tightened, if we assumes that the accesses to the backing store behave as if they were random and independent. Under this assumption, the following theorem predicts the performance of a distributed-memory Cilk program [21]: Theorem 2.6.1. Consider any Cilk program executed on P processors, each with an LRU cache of C elements, using Cilk’s work stealing sched- uler in conjunction with the BACKER coherence algorithm. Assume that accesses to the backing store are random and independent. Suppose the com- putation has F1(C) serial cache misses and T∞ critical path length. Then, for any !>0, the number of cache misses is at most F1(C)+O(CPT∞ + 1 CP log( )) with probability at least 1 − !. Moreover, the expected number of cachemissesisatmostF1(C)+O(CPT∞). The standard assumptionin[ 21] is that the backing store consists half the physical memory of each processor, and that the other half is used as a cache. Inother words, C is roughly a 1/2P fractionof the total memory of the machine. It is, therefore, convenient to assess the communication re- quirements of algorithms under this assumption, although C can, of course, be smaller. Finally, from here onwe focus onthe expected performancemeasures (communication, time, cache misses, and space). CHAPTER 3

Communication-Efficient Dense Matrix Multiplication in Cilk

Dense matrix multiplication is used in a variety of applications and is one of the core component in many scientific computations. The standard way of multiplying two matrices of size n × n requires O(n3) floating-point op- erations on a sequential machine. Since dense matrix multiplication is com- putationally expensive, the development of efficient algorithms is of great interest. This chapter discusses two types of parallel algorithms for multiplying n × n dense matrices A and B to yield the product matrix R = A × B using Cilk programs. We analyze the communication cost and space requirements of specific Cilk algorithms and show new algorithms that are efficient with respect to the measures of communication and space. Specifically, we prove upper bounds on the amount of communication on SMP machines with P processors and shared-memory cache of size C when dag consistency is maintained by the BACKER coherence algorithm and under the assumption that accesses to the backing store are random and independent.

3.1. Space-Efficient Parallel Matrix Multiplication Previous papers onCilk [ 21][20][6] presented two divide-and-conquer algorithms for multiplying n-by-n matrices. The first algorithm uses Θ(n2) memory and Θ(n) critical-path length (as stated above, we only focus on conventional Θ(n3)-work algorithms). In[ 21, page 56], this algorithm is called notempmul, which is the name we will use to refer to it. This algo- n n rithm divides the two input matrices into four 2 -by- 2 blocks or submatrices, computes recursively the first four products and store the result in the out- put matrix, thencomputes recursively, inparallel, the last four products and then concurrently adds the new results to the output matrix. The notempmul algorithm is showninFigure 3.1.1. Inessence,the algorithm uses the following formulation: R11 R12 A11B11 A11B12 R R = A B A B 21 22 21 11 21 12 R R A B A B 11 12 += 12 21 12 22 . R21 R22 A22B21 A22B22 Under the assumption that C is 1/2P of the total memory of the ma- chine, and that the backing store’s size is Θ(n2)(soC = n2/P ), the commu- nication upper bound for notempmul√ that Theorem 2.6.1 implies is O(n3), which is a lot more thanthe Θ( n2 P ) bound for 2D message-passing algo- rithms.

18 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 19

cilk void notempmul (long nb, block *A, block *B, block *R) { if (nb == 1) multiplyadd block(A,B,R); else{ block *C, *D, *E, *F, *G, *H, *I, *J; block *CGDI, *CHDJ, *EGFI, *EHFJ;

/* get pointers to input submatrices */ partition (nb, A, &C, &D, &E, &F); partition (nb, B, &G, &H, &I, &J);

/* get pointers to result submatrices */

partition (nb, R, &CGDI, &CHDJ, &EGFI, &EHFJ); /* solve subproblem recursively */ spawn notempmul(nb/2, C, G, CGDI); spawn notempmul(nb/2, C, H, CHDJ); spawn notempmul(nb/2, E, H, EHFJ); spawn notempmul(nb/2, E, G, EGFI); sync; spawn notempmul(nb/2, D, I, CGDI); spawn notempmul(nb/2, D, J, CHDJ); spawn notempmul(nb/2, F, J, EHFJ); spawn notempmul(nb/2, F, I, EGFI); sync; } return; }

Figure 3.1.1. A Cilk code for notempmul algorithm. It is a no-temporary version of recursive blocked matrix multipli- cation.

We suggest another way to perform n × n matrix multiplication. Our new algorithm, called CilkSUMMA, is inspired by the SUMMA matrix mul- tiplicationalgorithm [ 23]. The algorithm divides the multiplicationprocess into phases that are performed one after the other, not concurrently; all the parallelism is withinphases. Figure 3.1.2 illustrates the algorithm. For some constant r,0

B1

B2

B3 . = A1 A2 A3 ....An/r R . . .

Bn/r

Figure 3.1.2. The CilkSUMMA algorithm for parallel matrix n multiplication. CilkSUMMA divides A into r vertical blocks n of size n×r and B into r horizontal blocks of size r ×n.The corresponding blocks of A and B are iteratively multiplied to produce n × n product matrix. Each such matrix is stored in matrix R with the previous iterationresult values. Finally R will hold the product of A and B.

B1 B2

A1B1 A1B2 A1

=

A2B1 A2B2 A2

Figure 3.1.3. Recursive rank-r update to an n-by-n matrix R.

{ R =0 n for k =1.. r { spawn RankRUpdate(Ak,Bk,R,n,r) sync } }

The RankRUpdate procedure recursively divides an n × r block of A into n × r r × n B r × n two blocks of size 2 and an block of into two blocks of size 2 and multiplies the corresponding blocks in parallel, recursively. If n =1, thenthe multiplicationreduces to a dot product, for which we give the code later. The algorithm is showninFigure 3.1.3, andits code is givenbelow: cilk RankRUpdate(A, B, R, n, r) 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 21

{ if n=1 R += DotProd(A, B, r) else { A ,B ,R , n ,r spawn RankRUpdate( 1 1 11 2 ) A ,B ,R , n ,r spawn RankRUpdate( 1 2 12 2 ) A ,B ,R , n ,r spawn RankRUpdate( 2 1 21 2 ) A ,B ,R , n ,r spawn RankRUpdate( 2 2 22 2 ) } } The recursive Cilk procedure DotProd, shownbelow, is executed at the bottom of the rank-r recursion. If r = 1, the code returns the scalar multi- plicationof the inputs. Otherwise, the code splits each of the the r-length input vectors a and b into two subvectors of r/2 elements, and multiplies the two halves recursively, and returns the sum of the two dot products. Clearly, the code performs Θ(r) work and has critical path Θ(log r). The details are as follows: cilk DotProd(a, b, r) {

if (r =1) return a1 · b1 else { x a ,b , r = spawn DotProd( [1,...,r/2] [1,...,r/2] 2 ) y a ,b , r = spawn DotProd( [r/2+1,...,r] [r/2+1,...,r] 2 ) sync return(x + y) } } The analysis of communication cost is organized as follows. First, we prove a lemma describing the amount of communication performed by RankRUpdate. Next, we obtain a bound on the amount of communication in CilkSUMMA. RRU Lemma 3.1.1. The amount of communication in RankRUpdate, FP (C, n), incurred by BACKER running on P processors,√ each with a shared-memory cache of C elements, and with√ block size r = C/3, when solving a problem of size n, is O(n2 + CP log(n C)). Proof. To find the number of RankRUpdate cache misses, we use The- orem 2.6.1. The work and critical path for RankRUpdate canbe computed usingre- currences. We find the number of cache misses incurred when RankRUpdate algorithm is executed ona singleprocessor andthenassignthem inTheorem 2.6.1. T RRU n, r T RRU ,r T DOTPROD r The work bound 1 ( )satisfies 1 (1 )= 1 ( )= r T RRU n, r T RRU n/ n> T RRU n, r Θ( )and 1 ( )=4 1 ( 2) for 1. Therefore, 1 ( )= Θ(n2r). RRU To derive recurrence for the critical path length T∞ (n, r), we observe that with an infinite number of processors the 4 block multiplications can RRU RRU execute inparallel, therefore T∞ (n, r)=T∞ (n/2,r)+Θ(1)forn>1. 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 22

RRU DOTPROD For n =1T∞ (1,r)=T∞ (r) + Θ(1) = Θ(log r). Consequently, RRU the critical path satisfies T∞ (n, r)=Θ(logn +logr). F RRU C, n Next, we bound 1 ( ), the number of cache misses occur when the RankRUpdate algorithm is used to solve a problem of size n with the stan- dard, depth-first serial executionorder ona singleprocessor with anLRU cache of size C. At each node of the computational tree of RankRUpdate, k2 elements of R ina k × k block are updated using the results of k2 dot products of size r. To perform such anoperationentirely inthe cache, the cache must store√k2 elements of R, kr elements of A,andkr elements of B.When√ k ≤ C/3, the three submatrices fit into the cache. Let k = r = C/3. Clearly, the (n/k)2 updates to k-by-k blocks canbe performed entirely in the cache, so the total number of cache misses is at F RRU C, n ≤ n/k 2 × k 2 kr ≤ n2 most 1 ( ) ( ) [( ) +2 ] Θ( ). By Theorem 2.6.1 the amount of communication that RankRUpdate per- forms, whenrunon P processors using Cilk’s scheduler and the BACKER coherence algorithm, is F RRU C, n F RRU C, n O CPTRRU . p ( )= 1 ( )+ ( ∞ ) Since the critical path length of RankRUpdate√ is Θ(log nr), the total number of cache misses is O(n2 + CP log(n C)).  Next, we analyze the amount of communication in CilkSUMMA. CS Theorem 3.1.2. The number of CilkSUMMA cache misses FP (C, n), incurred by BACKER running on P processors,√ each with a shared-memory cache of C elements and block√ size r = C/3 when solving a problem of size n is O( √n (n2 + CP log(n C))). In addition the total amount SCS(n) of C √ P space taken by the algorithm is O(n2 + P log(n C)). Proof. Notice that the CilkSUMMA algorithm only performs sequential calls to the parallel algorithm RankRUpdate.Thesync statement at the end of each iteration guarantees that the procedure suspends and does not resume until all the RankRUpdate childrenhave completed. Each such itera- tionis a phase inwhich onlyonecall to RankRUpdate is invoked so the only parallel execution is of the parent procedure and its own children. Thus, we canboundthe total number of cache misses to be at most the sum of all the cache misses incurred during each phase n/r CS RRU FP (C, n) ≤ FP (C, n) . 1 √ RRU 2 By Lemma 4.2.1, we have FP (C, n)=O(n + CP log(n C)), yielding n CS 2 FP (C, n)=O n + CP log (nr) r n √ = O √ n2 + CP log n C . C The total amount of space used by the algorithm, is the space allocated for the product matrix, and the space for the activation frames allocated by the runtime system. The Cilk runtime system uses activationframes 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION23 to represent procedure instances. Each such representation is of constant size, including the program counter and all live, dirty variables. The frame is pushed into a deque on the heap and it is deallocated at the end of the procedure call. Therefore, inthe worst case the total space saved for the activation frames is the longest possible chain of procedure instances, for each processor, which is the critical path length, resulting in O(P log nr) SCS n n2 O P nr total space allocated√ at any time, thus P ( )=Θ( )+ ( log )= O(n2 + P log(n C)).  We have bounded the work and critical path of CilkSUMMA. Using these values we can compute the total work and estimate the total running time T CS C, n T CS n n3 P ( ). The computational work of CilkSUMMA is 1 ( )=Θ( ), so T CS C, n T CS n F CS C, n n3 the total work is 1 ( )= 1 ( )+Γ 1 ( )=Θ( ),√ assuming Γ is T CS C, n √n n C a constant. The critical path length is ∞ ( )= C log( ), so using the performance model in [21], the total expected time for CilkSUMMA on P processors is T C, n n3 √ √ T C, n O 1( ) CT n O n C n C . P ( )= P +Γ ∞( ) = P +Γ log 2 3 Consequently, if P = O √ n √ , the algorithm runs in O n time, Γ C log(n C) P obtaining linear speedup. CilkSUMMA√ uses the√ processor cache more effectively than notempmul whenever C>log(n C), which holds asymptotically for all C =Ω(n). n2 If we consider the size of the cache to be C = P ,whichisthemem- ory size of each one of the P processors ina distributed-memory machine n × n F CS C, n whensolving2D,√ matrix multiplicationalgorithms, then p ( )= 2 CS 2 Θ( Pn log n)andSP (n)=O(n ). These results are comparable to the communication and space requirements of 2D distributed-memory matrix multiplication algorithms, and they are significantly better than the Θ(n3) communication bound. We also improved the average parallelism of the algorithm over notempmul T1(C,n) n3 2 for r =Ω(n), since T n = n nr >n . ∞( ) r log( )

3.2. Trading Space for Communication in Parallel Matrix Multiplication The matrix-multiplicationalgorithm showninFigure 3.2.1 is perhaps the most natural matrix-multiplication algorithm in Cilk. The code is from [21, page 55], but the same algorithm also appears in[ 20]. The motivationfor this algorithm, called blockedmul, is to increase parallelism, at the expense of using more memory. Its critical path is only Θ(log2 n), as opposed to Θ(n)innotempmul, but it uses Θ(n2P 1/3)space[21, page 148], as opposed to only Θ(n2)innotempmul. Inthe message-passingliterature onmatrix algorithms, space is traded for a reduction in communication, not for parallelism. So-called 3D matrix multiplicationalgorithm [ 1, 2, 3, 13, 14, 17] replicate the input matrices P 1/3 times inorder to reduce the total amountof communicationfrom Θ(n2P 1/2) in2D algorithms downto Θ( n2P 1/3). Irony and Toledo have 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION24

cilk void blockedmul (long nb, block *A, block *B, block *R) { if (nb == 1) multiply block(A,B,R); else{ block *C, *D, *E, *F, *G, *H, *I, *J;

block *CG, *CH, *EG, *EH, *DI, *Dj, *FI, *Fj; block tmp[nb*nb];

/* get pointers to input submatrices */ partition (nb, A, &C, &D, &E, &F); partition (nb, B, &G, &H, &I, &J);

/* get pointers to result submatrices */ partition (nb, R, &CG, &CH, &EG, &EH);

partition (nb, tmp, &DI, &DJ, &FI, &FJ); /* solve subproblem recursively */ spawn blockedmul(nb/2, C, G, CG); spawn blockedmul(nb/2, C, H, CH); spawn blockedmul(nb/2, E, H, EH); spawn blockedmul(nb/2, E, G, EG); spawn blockedmul(nb/2, D, I, DI); spawn blockedmul(nb/2, D, J, DJ); spawn blockedmul(nb/2, F, J, FJ); spawn blockedmul(nb/2, F, I, FI); sync;

/* add results together into R*/ spawn matrixadd(nb,tmp,R); sync; } return; }

Figure 3.2.1. A Cilk code for recursive blocked matrix multiplication. It uses divide-and-conquer to solve one n × n n × n multiplicationproblem by splittingit into8 2 2 multipli- cation subproblems and combining the results with one n×n addition. A temporary matrix of size n × n is allocated at each divide step. A serial matrix multiplicationroutineis calledtodothebasecase. shown that the additional memory is necessary for reducing communication, and that the tradeoff is asymptotically tight [16]. 2 2/3 Substituting C = n /P in Randall’s communication analysis for blockedmul , we find out that with that much memory, the algorithm performs O n2P 1/3 log2 n 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION25 communication. That is, if we provide the program with caches large enough to replicate the input Θ(P 1/3) times, as in3D message-passingal- gorithms, the amount of communication that it performs is at most a factor of Θ(log2 n) more thanmessage-passing3D algorithms. Inother words, blockedmul is a Cilk analog of 3D algorithms. We propose a slightly more communication efficient algorithm than blockedmul. Like our previous algorithm, CilkSUMMA, obtaining optimal performance from this algorithm requires explicit knowledge and use of the cache size parameter C. This makes the algorithm more efficient but less elegant than blockedmul, which exploits a large cache automatically without explicit use of the cache-size parameter. Onthe other hand, blockedmul may simply fail if it cannot allocate temporary storage (a real-world implementation should probably synchronize after 4 recursive calls, as notempmul does, if it cannot allocate a temporary matrix). The code for SpaceMul is givenbelow. It uses anauxiliary procedure, MatrixAdd, which is not shown here, to sum an array of n × n matrices. We assume that MatrixAdd sums k matrices of dimension n using Θ(kn2)work and critical path Θ(log k log n); such analgorithm is trivial to implementin Cilk. For simplicity, we assume that n is a power of 2.

cilk spawnhelper(cilk procedure f, array [Y1,Y2,... Yk]) {

if (k =1) spawn f(Y1) else { spawn spawnhelper(f,[Y1,Y2,... Yk/2]) spawn spawnhelper(f,[Yk/2+1,... Yk]) } } cilk SpaceMul(A, B, R) { /* comment: A, B, R are n-by-n */

n Allocate r matrices, each n-by-n, denoted R1..R n n r Partition A into r block columns A1..A n n r B B ..B n Partition into r block rows 1 r spawn spawnhelper(RankRUpdate,

[(A ,B ,R ),...,(A n ,Bn ,Rn )] 1 1 1 r r r ) sync R, R ,R , ..., R n spawn MatrixAdd( 1 2 r ) return R }

SM Theorem 3.2.1. The number of SpaceMul cache misses FP (C, n),in- curred by BACKER running on P processors,√ each with a shared-memory cache of C elements and block size r = C/3 when solving a problem of size 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION26

3 O √n CP √n n SSM n nis ( C + log C log ). The total amount P ( ) of space used by 3 O √n the algorithm is ( C ). Proof. The amount of communication in SpaceMul is bounded by the sum of the communication incurred until the sync and the communication SM 1 2 1 incurred after the sync.Thus,FP (C, n)=FP (C, n)+FP (C, n)whereFP , 2 FP represent the communication in the two phases of the algorithm. ADD First, we compute the work and critical path of SpaceMul.LetTP (n, k) be the P -processor running time of MatrixAdd for summing k matrices of di- RRU mension n,letTP (n, r)betheP -processor running time of RankRUpdate SM to perform a rank-r update onan n×n matrix, and let TP (n) be the total running time of SpaceMul. T RRU n, r n2r Recall from the previous sectionthat is 1 ( )=Θ( ) and that RRU T∞ (n, r)=Θ(lognr) . As discussed above, it is trivial to implement T ADD n, n/r n3/r T ADD n, n/r n n MatrixAdd so that 1 ( )=Θ( )and ∞ ( )=Θ(logr log ). We now bound the work and critical path of SpaceMul.Theworkfor SpaceMul is n n n T SM n, r T RRU n, r T ADD n, 1 ( )= r + r 1 ( )+ 1 ( r ) n n n n2r n2 n3 = r + r + r =Θ( ) (There is nothing surprising about this: this is essentially a schedule for SM the conventional algorithm). The critical path for SpaceMul is T∞ (n)= n RRU ADD n log r + T∞ (n, r)+T∞ (n, r ). The first term accounts for spawning n SM n the r parallel rank-r updates. Therefore, T∞ (n)=Θ(logr +lognr + n n log r log n)=Θ(logr log n). Next, we compute the amount of communication in SpaceMul.From F RRU C, n, r n2 the proof√ of Lemma 3.1.1 we know that 1 ( )= (recall that n 2 r = C/3). A sequential execution of MatrixAdd performs O( r n )cache misses, at most 3n2 during the addition of every pair of n-by-n matrices. SM Using Theorem 2.6.1, we can bound the total communication FP (C, n) in SpaceMul,

F SM(C, n)=F 1 (C, n)+F 2 (C, n) P P P n3 n3 n O CP n O CP n = r + log + r + log r log n3 n = O √ + CP log √ log n . C C n The space used by the algorithm consists of the space for the r product matrices and the space of the activation frames which are bounded by PT∞. 3 O n n2 P n n O √n  Therefore the total space cost is ( r + log r log )= ( C ).

Conclusion 3.2.2 . The communication upper bound of SpaceMul is log n smaller a factor of Ω √n thanthe boundof blockedmul for any cache log C size. 3.3. A COMPARISON OF MESSAGE-PASSING AND CILK MATRIX-MULTIPLICATION ALGORITHMS27

Algorithm SP FP T1 T∞ 3 n2 √n CPn n3 n notempmul C + 3 n2P 1/3 √n CP 2 n n3 2 n blockedmul C + log log 3 √ √ √ n2 √n CPn n C n3 √n n C CilkSUMMA C + log( ) C log( ) 3 3 √n √n CP n √n n3 √n n SpaceMul C C + log log( C ) log C log Table 1. Asymptotic upper bounds on the performance metrics of the four Cilk matrix multiplicationalgorithm, whenapplied to n-by-n matrices ona computer with P pro- cessors and cache size C.

Proof. The amount of communication in blockedmul is bounded by 3 √n CP 2 n  Θ C + log . The result follows from Theorem 3.2.1. √ n2 inparticular, for C = 2/3 and r = C/3, P SM 2 1/3 1 2 1/3 FP (n)=O n P + n P log P log n 3 = O n2P 1/3 log P log n .

P 1/6 This bound is a factor of log P smaller than the corresponding bound for SM 3 T1 (C,n) n CilkSUMMA; The average parallelism is SM =Ω( 2 ), which is the T∞ (n) log n same as in blockedmul.

3.3. A Comparison of Message-Passing and Cilk Matrix-Multiplication Algorithms Table 1 summarized the performance bounds of the four Cilk algorithms that we have discussed inthis chapter. The table compares the space, com- munication, work, and critical path in the four algorithms as a function of the input size n and the cache size C. Table 2 compares message-passing algorithms to their Cilk analogs. Message- passing algorithms have fixed space requirements, and the table shows the required amount of space and the corresponding bound on communication. In Cilk algorithms the amount of communication depends on the size of local memories (the cache size), and the table fixes these sizes to match the space requirements of message-passing algorithms. The communication bounds or the Cilk algorithms were derived by substituting the cache size C inthe general bounds shown in Table 1. The main conclusions that we draw from the table are • The notempmul algorithm is communication inefficient. CilkSUMMA is a much better alternative. • The communication upper bounds for even the best Cilk algorithms are worse by factor of betweenlog n and log2 n thanthe communi- cationinmessage-passingalgorithms. Canthe performanceof notempmul and blockedmul be improved by tuning the cache size to the problem size and machine size at hand? Table 3 3.3. A COMPARISON OF MESSAGE-PASSING AND CILK MATRIX-MULTIPLICATION ALGORITHMS28

Algorithm Memory per Processor Total Communication √ n2 2 Distributed 2D Θ P O n P n2 2 1/3 Distributed 3D Θ 2/3 O n P P n2 3 notempmul Θ P O n n2 2 1/3 2 blockedmul Θ 2/3 O n P log n P √ n2 2 CilkSUMMA Θ P O n P log n n2 O n2P 1/3 P n SpaceMul Θ P 2/3 log log Table 2. Communication overhead for Cilk shared- memory algorithms whenthe processor cache size is the same as the processor memory size whenperformingdistributed- memory algorithms.

Algorithm Optimal Cache Size Overall Communication 4/3 notempmul C =Θ n O n7/3P 1/3 P 2/3 2 blockedmul C =Θ n O n2P 1/3 log2/3 n P 2/3 log4/3 n Table 3. Optimized communication overhead of Cilk n×n matrix multiplicationalgorithms andtheir appropriate cache size values. shows that the performance of notempmul canindeed be improved by reduc- ing the caches slightly. Obviously, the size of the backing store cannot be shrunk if it to hold the input and output, so the implication is that for this algorithm, the size of caches should be smaller thanhalf the local memory of the processor. (This setting reduces the provable upper bound; whether it reduces communication in practice is another matter, and we conjecture that it does not.) However, evenafter the reductionthe communicationup- per bound is significantly worse than that of all the other algorithms. The table also shows that the performance of blockedmul canalso be improved slightly by reducing the size of the cache, but not by much. Since SpaceMul always performs less communication, the same applied to it. CHAPTER 4

A Communication-Efficient Triangular Solver in Cilk

This chapter focuses on the solution of linear systems of equations with triangular coefficient matrices. Such systems are solved nearly always by substitution, which creates a relatively long critical path in the computa- tion. In Randall’s analyses of communication, long critical paths weaken the upper bounds because of the CPT∞ term. Inthis chapter we show that by allowing the programmer to dynamically control the size of local caches, we can derive tighter communication upper bounds. More specifically, the combination of dynamic cache-size control and a new Cilk algorithm al- lows us to achieve performance bounds similar to those of state-of-the art message-passing algorithms.

4.1. Triangular Solvers in Cilk A lower triangular linear system of equations

l11x1 = b1 l21x1 + l22x2 = b2 . . ln1x1 + ln2x2 + ···+ lnnxn = bn which we canalso write as a matrix equation Lx = b, is solved by substitu- tion. Here L is a known coefficient matrix, b is a known vector, and x is a vector of unknown to be solved for. This chapter actually focuses on solu- tionof multiple linearsystems with the same coefficientmatrix L but with different right hand sides b,whichwewriteLX = B. More specifically, we focus onthe case inwhich B has exactly n columns, which is the case that comes up in the factorization of general matrices, the subject of the next chapter. We assume that L is nonsingular, which for a triangular matrix is equivalent to saying that it has no zeros on the diagonal. Although we focus on lower triangular systems, upper triangular systems are handled in exactly the same way. We can solve such systems by substitution. In a lower triangular system the first equation involves only one variable, x1. We can, therefore, solve it x b1 x directly, 1 = l11 . Now that we know the value of 1, we cansubstitute its value in all the other equations. Now the second equation involves only one unknown, which we solve for, and so on. Randall presents and analyzes a recursive formulation of the substitution algorithm inCilk [ 21, pg. 58]. This recursive solver partitions the matrix as showninFigure 4.1.1. The code is as follows: cilk RecursiveTriSolver(L,X,B)

29 4.1. TRIANGULAR SOLVERS IN CILK 30

X11 X12 B11 B12 L11 =

X21 X22 B21 B22 L 21 L22

Figure 4.1.1. Recursive decompositionof a traditionaltri- angular solver algorithm. Subdividing the three matrices into n × n 2 2 matrices. First solving the recursive matrix equations L11X11 = B11 for X11 andinparallelL11X12 = B12 for X B B − L X 12. Then, compute 21 = 21 21 11 and in parallel B B − L X B L X 22 = 22 21 12. Finally solve recursively 21 = 22 21 B L X andinparallel 22 = 22 22.

{

if (L is 1-by-1) x11 = b11/l11 else {

partition L, B, and X as in Figure 4.1.1 spawn RecursiveTriSolver(L11,X11,B11) spawn RecursiveTriSolver(L11,X12,B12) sync

spawn multiply-and-update(B21 = B21 − L21X11)

spawn multiply-and-update(B22 = B22 − L21X12) sync spawn RecursiveTriSolver(L22,X21,B21) spawn RecursiveTriSolver(L22,X22,B22) } }

This algorithm performs Θ(n3) work and the length of its critical path is Θ(n log n)whenthemultiply-and-updates are performed using notempmul. Here too, the advantage of notempmul is that it uses no auxiliary space. But because the substitution algorithm solves for one unknown after the other, the critical path cannot be shorter than n evenif the multiply-and-updates are performed using a short-critical-path multiplier, so notempmul does not n2 worsenthe parallelism significantly. For C = P , Randall’s result implies 3 that the amount of communication is bounded by FP (C)=O(n log n). This is a meaningless bound, since even without local caching at all, a Θ(n3) algorithm does not perform more than Θ(n3) communication. 4.2. AUXILIARY ROUTINES 31

X1 B1 L11 =

X2 B2 L 21 L22

Figure 4.1.2. The recursive partitioning in NewTriSolver. The algorithm first recursively solves L11X1 = B1 for X11, B B − L X thenupdates 2 = 2 21 1, and then recursively solves L X B 22 2 = 2.

We use a slightly different recursive formulation of the substitution al- gorithm, which will later allow us to develop a communication-efficient algo- rithm. Our new algorithm, NewTriSolver, partitions the matrices as shown inFigure 4.1.2. Its code is actually simpler thanthat of RecursiveTriSolver: cilk NewTriSolver(L,X,B) { if (L is 1-by-1)

call VectorScale(1/l11,X,B)/*X = B/l11 */ else {

partition L, B, and X as in Figure 4.1.2 call NewTriSolver(L11,X1,B1) call RectangularMatMult(B2 = B2 − L21X1) call NewTriSolver(L22,X2,B2) } } This algorithm exposes no parallelism—all the parallelism is exposed by the two auxiliary routines that it calls. We use the keyword call to emphasize that the caller is suspended until the callee returns, unlike a spawn,which allows the caller to continue to execute concurrently with the callee. The call keyword uses the normal C function call mechanism. We could achieve the same scheduling result by each of the called routines in NewTriSolver and immediately following each spawn with a sync.

4.2. Auxiliary Routines We now turn to the description of the auxiliary routines that NewTriSolver uses. The VectorScale routine scales a vector Y =(y1, ..., yn) by a scalar α andreturnthe result inanother vector X =(x1, ..., xn).

cilk VectorScale(α, [x1, ..., xn], [y1, ..., yn]) 4.2. AUXILIARY ROUTINES 32

R1 R2 – = A B1 B2

Figure 4.2.1. The recursive partitioning in RectangularMatMult. The algorithm recursively parti- tions the long matrices until they are square and then calls CilkSUMMA.

{

if (n =1) x1 = αy1 else {

spawn VectorScale(α, [x1, ..., x n ], [y1, ..., y n ]); 2 2

spawn VectorScale(α, [x n , ..., xn], [y n , ..., yn]);  2 +1  2 +1 } }

Analyzing the performance of this simple algorithm is easy. Clearly, it per- forms Θ(n) work with critical path Θ(log n). The next lemma analyzes the amount of communication in the code. Intuitively, we expect the amount of communication to be Θ(n), since there is no data reuse in the algorithm.

Lemma 4.2.1. The amount of communication in VectorScale is bounded VS by FP (C =3,n)=O(n + P log n).

VS Proof. We use Theorem 2.6.1 to show that FP (C, n)=O(n+CP log n). T VS n T n/ n The work in VectorScale is 1 ( )=2 1( 2) = Θ( ) and the critical VS path length is T∞ (n)=Θ(logn). The amount of serial cache misses is O(n) independently in the cache size because the work bounds the number of cache VS misses. Applying Theorem 2.6.1 yields FP (C, n)=O(n + CP log n). The result follows by substituting C =3. 

The second algorithm that NewTriSolver uses, RectangularMatMult, is a rectangular matrix multiplier that calls CilkSUMMA. This algorithm is always called to multiply a square m-by-m matrix by an m-by-n matrix, where m

{

partition A and B into as shown in Figure 4.2.1 if (R, B have more columns than rows ) { spawn RectangularMatMult(A, B1,R1) spawn RectangularMatMult(A, B2,R2) } else call CilkSUMMA(A, B, R, m) } Let us now analyze the performance of this algorithm. The next lemma ana- lyzes the amount of work and the length of the critical path, and lemma 4.2.3 that follows analyzes communication. To keep the analysis simple, we as- sume that n is a power of 2 and that m divides n.

Lemma 4.2.2. Let A be m-by-m and let B and R be m-by-n. The amount A, B, R m2n of work in RectangularMatMult( √ ) is√Θ( ) and the length of its critical path is O log(n/m)+m log(m C)/ C .

Proof. The work in RectangularMatMult is equal to the work in CilkSUMMA n m T RMM m, n T RMM m, n if = and is 1 ( )=21 ( 2 ) otherwise. Therefore, RMM RMM n n CS n T m, n T m, log m T m m3 m2n 1 ( )=2 1 ( 2 )=2 1 ( )=m Θ( )=Θ( ). CS The critical path of RectangularMatMult is bounded by Θ(1)+T∞ (C, m) n m T RMM C, m, n if = and by Θ(1) + ∞ ( 2 ) otherwise. Therefore, the critical path is

RMM RMM n T∞ (C, m, n)=Θ(1)+T∞ (C, m, ) 2 n T CS C, m =Θ(logm)+ ∞ ( ) n m √ =Θ(log )+√ log(m C) . m C



We now bound the amount of communication in RectangularMatMult. Although the bound seems complex, we actually need to use this result in only one very special case, which will allow us to simplify the expression.

Lemma 4.2.3. Let A be m-by-m and let B and R be m-by-n. The amount of communication in RectangularMatMult(A, B, R) is bounded by √ n m √ n max(m, m2/ C)+O CP log + √ log(m C) . m C

Proof. The number of cache misses in a sequential execution is bounded by n/m times the number of cache misses in each call to CilkSUMMA.The number of cache√ misses in CilkSUMMA onmatrices of size m is at most max(m2,m3/ C), since even if C is large, we still have to read the argu- ments into the cache. 4.3. DYNAMIC CACHE-SIZE CONTROL 34

Therefore, the amount of communication in a parallel execution is bounded by n √ RMM 2 3 RMM FP (C, m, n)= max(m ,m / C)+O(CPT∞ (C, m, n)) m n √ n m √ = max(m2,m3/ C)+O CP log + √ log(m C) . m m C



4.3. Dynamic Cache-Size Control

The bound in Theorem 2.6.1 has two terms. In the first term, F1(C), larger caches normally lead to a tighter communication bound, which is intuitive. The second term, CPT∞ causes larger caches to weakenthe com- munication bound. This happens because the cost of flushing the caches rises. Randall [21] addresses this issue intwo ways. First, he suggests that good parallel algorithms have short critical paths, so the CPT∞ term should usually be small. This argument fails in many numerical-linear-algebra al- gorithms, which have long critical paths but which parallelize well thanks to the amount of work they must perform. In particular, triangular solvers and triangular factorization algorithms, which have Ω(n) critical paths, par- allelize well, but Randall’s communication bounds for them are too loose. The second argument that Randall suggests is empirical: in his experiments, the actual amount of communication that can be attributed to the CPT∞ is insignificant. While this empirical evidence is encouraging, we would like to have tighter provable bounds. Our mainobservationis that most of the tasks alongthe (rather long) critical path in triangular solvers do not benefit from large caches. That is, the critical path is long, but most of the tasks along it perform little work on small amounts of data, and such tasks do not benefit from large caches. Consider square matrix multiplication: a data item is used at most n times, the dimension of the matrices, and caches of size n2 minimize the amount of communication. Larger caches do not reduce the F1(C) term, but they do increase the CPT∞ term. This observation leads us to suggest a new feature in the Cilk run-time system. This feature allows us to temporarily instruct processors to use only part of their local caches to cache data. Definition 4.3.1. The programmer canset the effective cache size when calling a Cilk procedure. This effective size is registered in the activation frame of the newly-created procedure instance. When an effective cache size is specified in a procedure instance, it is inherited by all the procedures that it calls or spawns, unless a different cache size is explicitly set in the invocation of one of these descendant procedures. When a processor starts to executes a thread from anactivationframe with newspecified effective cache size, but its current effective cache size is larger, it flushes its local cache and limits its effective size to the specified size. When a child procedure returns the cache size returns to its parent effective cache size as is stored inthe parent’sactivationframe. 4.4. ANALYSIS OF THE NEW SOLVER WITH DYNAMIC CACHE-SIZE CONTROL 35

Although the ability to limit the effective cache size allows us to reduce the effect of the CPT∞ term, we need to account for the cost of the extra cache flush.

4.4. Analysis of the New Solver with Dynamic Cache-Size Control We are now ready to add cache-size control to NewTriSolver.Ouraimis simple: to set the cache size to zero during calls to VectorScale, which does not benefit from the cache, and to set the cache size in RectangularMatMult to the minimum size that ensures optimal data reuse. In particular, when multiplying an m-by-m matrix by an m-by-n matrix, data is used at most m times, so a cache of size C = m2 ensures optimal data reuse. The complete algorithm is shownbelow. cilk NewTriSolver(L,X,B) { if (L is 1-by-1) SetCacheSize(0) call VectorScale(1/l11,X,B) else {

partition L, B, and X as in Figure 4.1.2 2 SetCacheSize(dim(L21) ) call NewTriSolver(L11,X1,B1) call RectangularMatMult(B2 = B2 − L21X1) call NewTriSolver(L22,X2,B2) } } The next lemma bounds the amount of communication that RectangularMatMult performs inthe context of NewTriSolver, whenit uses a cache of size at most m2. Lemma 4.4.1. The amount of communication that RectangularMatMult(A, B, R) performs with cache size at most m2 (the dimension of A)isboundedby nm2 √ √ F RMM (min(C, m2),m,n) ≤ O √ + nm + m CP log(n C) . P C Proof. In general, the amount of communication is bounded by √ n m √ n max(m, m2/ C)+O CP log + √ log(m C) . m C √ √ √ √ C ≤ m m/ C ≥ m ≤ n m C ≤ n C Since , 1. We√ also have so√ log( ) log( ). n √m m C O √m n C Therefore, log m + C log( )= C log( ) . The result follows from this bound and from replacing the max by the sum of its arguments.  We can use now Lemma 4.4.1 to bound the amount of communication in NewTriSolver. Since the algorithm is essentially sequential, we do not use Theorem 2.6.1 directly, so we do not need to know the length of the critical path. (We do analyze the critical path later in this section, but the critical-path length bound is not used to bound communication.) 4.4. ANALYSIS OF THE NEW SOLVER WITH DYNAMIC CACHE-SIZE CONTROL 36

The following theorem bounds the amount of communication with the additional communication cost for the flush performed when decreasing the cache size:

TS Theorem 4.4.2. The amount of communication FP (C, n) in NewTriSolver , 3 √ √ O √n n CP n C n using cache-size control, is bounded by C + log( )log .

Proof. NewTriSolver is essentially a sequential algorithm that calls parallel Cilk subroutines. Before calling a parallel subroutine, it performs a SetCacheSize, which sets the maximum cache size of the processor ex- ecuting the algorithm. The other processors participating in the parallel sub-computation inherit this cache size when they steal work for executing the parallel sub-computation. This behavior eliminates interaction between parallel subroutines, in that it ensure that each parallel computation uses all P processors and that each parallel computation starts with specified cache size and in a specific order. This allows us to simply sum the communication upper bounds for the parallel subcomputations in order to derive an upper bound for NewTriSolver as a whole. Let ϕ be a cost function that accounts for the communication cost in- curred by cleaning the cache when changing the cache size from size m1 to size m2, which is implemented by adding the command SetCacheSize(m2) to NewTriSolver algorithm, whenthe actual cache size is m1. 0 m2 ≥ Corm2 ≥ m1 ϕ(C, m1,m2)= m1 m2

φ1 TS RRM Fp (C, n) ≤ ϕ(C, mk−1,mk)+Fp (C, mk,n)+3ϕ(C, mk,mk−1) k=1 VS +φ2 · Fp (C, n) ,

where φ1 is the number of SetCacheSize phases and mk is the problem size at each such phase, and φ2 is the number of VectorScale phases. At each phase all the P processor have the same cache size. We count the communication incurred by reducing the cache, for each phase, only for one processor, since all the other processors before they steal work their cache is clean and therefore changing (reducing) their cache size does not include extra cache flush. Recall that the processor cache is not changed when it finishes the execution of a computation. The sequence of calls to RectangularMatMult creates a binary tree in which the current phase is partial problem size of the previous one, but it is done twice as many times. Notice that before VectorScale is called, the cache size is always changed to constant size (0) and that it is called exactly n times. Also notice that the factor ϕ(C, m0,m1)=0sincem0 is the initial phase, before program execu- tion, where each processor cache is clean and the factor ϕ(C, mk,mk−1)=0, since no communication performed for not changing or increasing the cache size if mk ≤ mk−1. Therefore, 4.4. ANALYSIS OF THE NEW SOLVER WITH DYNAMIC CACHE-SIZE CONTROL 37

log n n 2 n 2 n TS k−1 RMM FP (C, n) ≤ 2 ϕ C, , + FP C, ,n +0 2k−1 2k 2k k=1 +n · F VS(C, n) P n log n2 n3 n2 n √ √ = O +2k−1 √ + + CP log(n C) + n2 k−1 2k C k k k 2 2 2 2 =1 log n log n log n 1 n3 n2 = O n2 + √ 2−k + 1 k−1 C k 2 2 k 2 k =1 =1 =1 n√ √ logn +O CP log(n C) 1 + n2 2 k=1 n3 √ √ = O n2 + √ + n2 log n + n CP log(n C)logn + n2 C n3 √ √ = O √ + n2 log n + n CP log(n C)logn . C

3 √ √ if C<( n )2 then n2 log n = O( √n ), else n2 log n = O(n CP log(n C)logn) , log n C 3 √ √ F TS C, n O √n n CP n C n  therefore P ( )= C + log( )log .

We now bound the amount of work and parallelism in NewTriSolver.

Theorem 4.4.3. NewTriSolver performs Θ(n3) work when its argu- n n ments are all -by- , and the length√ of its critical path when the cache-size- control feature is used is O n log(n C)logn .

Proof. The work for NewTriSolver is expressed by the recurrence n n T TS(n, n)=2T TS( ,n)+T RMM ( ,n) 1 1 2 1 2 =2log nn +Θ(n3)=Θ(n3) , since VectorScale performs Θ(n) work. The length of the critical path satisfies   log nm=1 2 √ T TS(C, m, n)= T TS C, m ,n T RMM m , m ,n ≤ m ≤ C ∞  2 ∞ ( 2 )+ ∞ ( 4 2 )1√ 2 T TS C, m ,n T RMM C, m ,n C

Recall that n m √ RMM 2 T∞ (min(C, m ),m,n)=O log + √ log m min( C,m) . m min( C,m) 4.4. ANALYSIS OF THE NEW SOLVER WITH DYNAMIC CACHE-SIZE CONTROL 38

Therefore, 2 TS TS n RMM n n T∞ (C, n, n)=2T∞ (C, ,n)+T∞ (min(C, ), ,n) 2 4 2 log n n n/2k √ = O n log n + 2k−1 log + √ log(n C) n/ k C,n/ k k 2 min( 2 ) =1 log n n √ 1 = O n log n + log(n C) √ C,n/ k 2 k min( 2 ) √ =1 = O n log n + n log(n C)logn . √ = O n log(n C)logn . 

NewTriSolver√ √performs less communication than Randall’s RecursiveTriSolver [21] C> n C C O n2 when √log( ). Inparticular, for = ( P ) the new algorithm per- forms O(n2 P log2 n) communication, a factor of √ n less than RecursiveTriSolver. P log n The critical path of the new algorithm, however, is slightly longer than that of RecursiveTriSolver. The new algorithm uses O(n2) space, since no external space is allocated by the algorithm and the additional√ space taken by the runtime system activationframes is O(Pnlog(n C)logn). CHAPTER 5

LU Decomposition

In this chapter we describe a communication-efficient LU decomposition algorithm. A general linear system Ax = b canoftenbe solved by factoring A into a product of triangular factors A = LU,whereL is lower triangular and U is upper triangular. Not all matrices have such a decomposition, but many classes of matrices do, such as diagonally-dominant matrices and symmetric positive-definite matrices. In this chapter we assume that A has an LU decomposition. Once the matrix has been factored, we can solve the linear system using forward and backward substitution, that is, by solving Ly = b for y and then Ux = y for x. To achieve a high level of data reuse ina sequentialfactorization,the matrix must be factored inblocks or recursively (see [ 24, 25] and the ref- erences therein). A more conventional factorization by row or by column 3 performs√ Θ(n ) cache misses whenthe dimension n of the matrix is twice C or larger. Since factorization algorithms perform Θ(n3) work, it follows that data reuse inthe cache infactorizationsby row or by columnis limited to a constant.√ Recursive factorizations, on the other hand, perform only Θ(n3/ C) cache misses. Like triangular linear solvers, LU factorizations have long critical paths, Θ(n) or longer. Randall analyzed a straightforward recursive factorization inCilk [ 21, section4.1.2]. He used the formulationshowninFigure 5.0.1, which partitions A, L,andU into 2-by-2 block matrices, all square or nearly square. The algorithm begins by factoring A11 = L11U11, thensolves in parallel for the off-diagonal blocks of L and U using the equations L21U11 = A21 and L11U12 = A12, updates the remaining equations, A22 = A22 − L21U12,andfactorsA22 = L22U22. The performance characteristics of this algorithm depend, of course, on the specific triangular solver and matrix multiplication subroutines. Randall’s choices for these subroutines lead to 2 n2 analgorithm with overall critical path T∞(n)=Θ(n log n). When C = P , the CPT∞ term in Theorem 2.6.1 causes the communication bound to reach Θ(n3 log2 n). This is a meaningless bound, since a Cilk algorithm never performs more communication than work. (Randall does show that this algorithm, like most matrix algorithm, canbe arrangedto achieve good spatial locality, but he does not show a meaningful temporal locality bound.) We√ propose a better Cilk LU-decompositionalgorithm that performs only O( Pn2 log3 n) communication. Our algorithm differs from Randall’s inseveral ways. The most importantdifferenceis that we rely onour

39 5. LU DECOMPOSITION 40

U11 U12 L11U11 L11U12 L11 =

U22 L21U11 L21U12+L22U22 L 21 L22

Figure 5.0.1. A divide-and-conquer algorithm for LU decomposition.

communication-efficient triangular solver, which was described in the pre- vious chapter. To allow the triangular solver to control cache sizes with- out interference, we perform the two triangular solves in the algorithm se- quentially, not in parallel (the work of the algorithm is Θ(n3)). It turns out that this lengthens the critical path, but not by much. We also use our communication- and space-efficient matrix multiplier, CilkSUMMA.The pseudo code of the algorithm follows.

cilk LUD(A,L,U,n) {

call LUD(A11,L11,U11) call NewTriSolver(L11,U12,A12) call NewTriSolver(U11,L21,A21) L ,U , A , n call CilkSUMMA( 21 12 22 2 ) call LUD(A22,L22,U22) }

To bound the amount of communication, we do not use Theorem 2.6.1 di- rectly (we do not use the critical-path length to bound communication but, we do analyze the critical path later in this section). Since there is no paral- lel execution of different procedures on the same time, it is possible to count the number of cache misses by summing the cache misses incurred along any phase while executing NewTriSolver or CilkSUMMA.

LUD Theorem 5.0.4. The amount of communication FP (C, n) in LUD 3 √ √ n O √n n CP 2 n n C when solving a problem of size is ( C + log log( )).

Proof. We count the amount of communication by phases. There are two NewTriSolver phases and one CilkSUMMA phase, which lead to the 5. LU DECOMPOSITION 41 following recurrence: F LUD(C, n)=2F LUD(C, n/2) + 2F TS(C, n/2) + F CS(C, n/2) + Θ(1) P P P P n3 √ √ =2F LUD (C, n/2) + O √ + n CP log n C log n P C n3 √ √ +O √ + n CP log n C +Θ(1) C n3 √ √ =2F LUD (C, n/2) + O √ + n CP log n C log n +Θ(1) P C √ logn n 3 √ k n n C n = 2k · √2 + CP log log C k k k k 2 2 2 =1 n n √ n3 log √ log n C n = √ 4−k + n CP log log C 2k 2k k=1 k=1 n3 √ √ = O √ + n CP log2 n log n C . C 

Theorem 5.0.5. The√ critical path length of LUD when its arguments are all n-by-n is O(n log2(n C)logn). Proof. The critical path of LUD depends on the critical path of the NewTriSolver algorithm to solve triangular systems, and of CilkSUMMA for matrix multiplication, so the critical path is of length LUD LUD TS CS T∞ (C, n)=2T∞ (n/2) + 2T∞ (C, n/2,n/2) + T∞ (C, n/2) + Θ(1) √ CS n Recall that T∞ (C, n)= √ log n C and from Theorem 4.4.3 that √ C TS T∞ (C, n, n)=O n log n C log n . Therefore, LUD LUD TS CS T∞ (C, n)=2T∞ (n/2) + 2T∞ (C, n/2,n/2) + T∞ (C, n/2) + Θ(1) logn n n √ logn n n √ = O 2log n + 2k · log2 C + 2k−1 √ log C 2k 2k 2k C 2k k=1 k=1 √ n √ = O n + n log2 n C log n + √ log n C log n C √ = O n log2 n C log n .  √ n2 2 3 The amount of communication incurred in LUD for C = P is O Pn log n which for large n is much tighter thanthe O(n3 log2 n) of the LU de- compositionalgorithm shownin[ 21]. The parallelism of LUD algorithm LUD 3 2 T1 n n is LUD =Ω 3 =Ω 3 which is factor of log n smaller thanthe T∞ n log n log n implementation presented in [21]. CHAPTER 6

Conclusion and Open Problems

Cilk’s dag consistency employs relaxed consistency model in order to re- alize performance gains, but unlike dag consistency most distributed shared memories take a low level view of parallel programs and can not give analyt- ical performance bound. In this thesis we used the analytical tools of Cilk to design algorithms with tighter communication bounds for existing dense matrix multiplication, triangular solver and LU factorization algorithms, than the bounds obtained by [21]. Several experimental versions, such as Cilk-3 with the implementation of BACKER coherence algorithm were developed for the runtime system of the Connection Machine Model CM5 parallel super computer. However, official distribution of Cilk version with dag-consistent shared memory was never released and therefore it was not feasible to implement the above algorithms for distributed-shared memory environment. We leave it as anopenquestionwhether it is possible to tightenthe bounds on the number of communication and memory requirements for a factor of log n than the existing bounds, without compromising the other performance parameters and whether the dynamic cache size control prop- erty is required for obtaining such low communication bounds.

42 Bibliography

[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi and P. Palkar. A Three- Dimensional Approach to Parallel Matrix Multiplication. IBM Journal of Research and Development, 39:575-582, 1995. [2] Alok Aggarwal, Ashok K. Chandra and Marc Snir. Communication Complexity of PRAMs. Theoretical Computer Science, 71:3-28, 1990. [3] J. Berntsen. Communication efficient matrix multiplication on hypercubes. , 12:335-342, 1989. [4] Robert D.Blumof. Executing Multithreaded Programs Efficiently. Phd thesis, De- partment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. September 1995. [5] Robert D. Blumofe and Charles E. leiserson. Schedulig multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356-368, Santa Fe, New Mexico, November 1994. [6] Robert D. Blumofe, Matteo Frigo, Christofer F.Joerg, Charles E. Leiserson, and Keith H. Randall. Dag-consistent distributed shared memory. In Proceedings of the 10th International Parallel Processing Symposium (IPPS), pages 132-141, Honolulu, Hawaii, April 1996. [7] L.E. Cannon. A cellular computer to implement the Kalman Filter Algorithm. Tech- nical report, Ph.D. Thesis, Montana State University, 1969. [8] E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal of Computing, 10:657-673, 1981. [9] Michel Dubois, Chirstoph Schristoph Scheurich, and Faye Briggs. Memory access buffering in multiprocessors. In Proceedings of the 13th Annual International Sympo- sium on Computer Architecture, pages 434-442, June 1986. [10] G.C. Fox, S.W. Otto, and A.J.G. Hey, Matrix algorithms on a hypercube I:Matrix multiplication Parallel Computing, 4:17-31, 1987. [11] Guang R. Gao and Vivek Sarkar. Location consistency: Stepping beyond the barrier. In proceedings of the 24th International Conference on Parallel Processing, Aconomowoc, Wisconsin, August 1995. [12] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA), pages 15-26, Seattle, Washington, June 1990. [13] Anshul Gupta and Vipin Kumar. The of Matrix Multiplication Algorithms on Parallel Computers. Department of Computer Science, University of Minnesota, 1991. Available on the Internet from ftp://ftp.cs.umn.edu/users/kumar/matrix.ps. [14] H. Gupta and P. Sadayappan. Communication efficient matrix multiplication on hy- percubes. In Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA94), pages 320-329, June 1994. [15] Dror Irony and Sivan Toledo. Trading replication for communication in parallel distributed-memory dense solvers. Submitted to Parallel Processing Letters,July 2001. [16] Dror Irony and Sivan Toledo. Communication lower bounds for distributed-memory matrix multiplication. Submitted to Journal of Parallel and , April 2001.

43 BIBLIOGRAPHY 44

[17] S. Lennart Johnsson. Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Computing, 19:1235–1257, 1993. [18] C.T. Ho, S.L. Johnsson and A. Edelman. Matrix multiplication on hypercubes using full bandwidth and constant storage. In Proceeding of the Sixth Distributed Memory Computing Conference, 447-451,1991. [19] Leslie Lamport. How to make a multiprocessor computer that correctly executes mul- tiprocess programs. IEEE Transactions on Computers, C-28(9):690-691, September 1979. [20] Charles E. Leiserson and Harald Prokop. A Minicourse on Multithreaded Program- ming. MIT Laboratory for Computer Science, Cambridge, Massachusetts, July 1998. [21] Keith H. Randall. Cilk:Efficient Multithreaded Computing. Phd thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. May 1998. [22] Cilk-5.3.1 Reference Manual. Supercomputing Technologies Group. MIT Lab- oratory for Computer Science. June 2000. Available on the Internet from http://supertech.lcs.mit.edu/Cilk [23] Robert van de Geijn and Jerrell Watts. SUMMA: scalable universal matrix multipli- cation algorithm. Concurrency: Practice and Experience, 9(1997):255–274. [24] Sivan Toledo. Locality of reference in LU decomposition with partial pivoting, SIAM Journal on Matrix Analysis and Applications, 18(1997):1065–1081. [25] Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. in Ex- ternal Memory Algorithms, James M. Abello and Jeffrey Scott Vitter, eds., DIMACS Series in Discrete Mathematics and Theoretical Computer Science, American Math- ematical Society, 1999, pages 161–179. [26] I-Chen Wu and H. T. Kong. Communication Complexity for Parallel Divide-and- Conquer. School of Computer Science. Camegie Mellon University, Pittsburgh.