TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE
Designing Communication-Efficient Matrix Algorithms in Distributed-Memory Cilk
Thesis submitted in partial fulfillment of the requirements for the M.Sc. degree of Tel-Aviv University by Eyal Baruch
The research work for this thesis has beencarried out at Tel-Aviv University under the direction of Dr. Sivan Toledo
November 2001
Abstract
This thesis studies the relationship between parallelism, space and com- munication in dense matrix algorithms. We study existing matrix multipli- cationalgorithms, specifically those that are designedfor shared-memory multiprocessor machines (SMP’s). These machines are rapidly becoming commodity in the computer industry, but exploiting their computing power remains difficult. We improve algorithms that originally were designed us- ing an algorithmic multithreaded language called Cilk (pronounced silk), and we present new algorithms. We analyze the algorithms under Cilk’s dag- consistent memory model. We show that by dividing the matrix-multiplication process into phases that are performed ina sequence, we canobtainlower communication bound without significantly limiting parallelism and with- out consuming significantly more space. Our new algorithms are inspired by distributed-memory matrix algorithms. Inparticular, we have developed algorithms that mimic the so-called two-dimensional and three-dimensional matrix multiplicationalgorithms, which are typically implementedusing message-passing mechanisms, not using share-memory programming. We focus onthree key matrix algorithms: matrix multiplication,solutionof tri- angular linear systems of equations, and the factorization of matrices into triangular factors.
3 Contents
Abstract 3 Chapter 1. Introduction 5 1.1. Two New Matrix MultiplicationAlgorithms 6 1.2. New Triangular Solver and LU Algorithms 7 1.3. Outline of The Thesis 7 Chapter 2. Background 8 2.1. Parallel Matrix MultiplicationAlgorithms 8 2.2. The Cilk Language 9 2.3. The Cilk Work Stealing Scheduler 12 2.4. Cilk’s Memory Consistency Model 12 2.5. The BACKER Coherence Algorithm 14 2.6. A Model of Multithreaded Computation15 Chapter 3. Communication-Efficient Dense Matrix Multiplication in Cilk 18 3.1. Space-Efficient Parallel Matrix Multiplication 18 3.2. Trading Space for Communication in Parallel Matrix Multiplication23 3.3. A Comparison of Message-Passing and Cilk Matrix- MultiplicationAlgorithms 27 Chapter 4. A Communication-Efficient Triangular Solver in Cilk 29 4.1. Triangular Solvers in Cilk 29 4.2. Auxiliary Routines 31 4.3. Dynamic Cache-Size Control 34 4.4. Analysis of the New Solver with Dynamic Cache-Size Control 35 Chapter 5. LU Decomposition39 Chapter 6. Conclusion and Open Problems 42 Bibliography 43
4 CHAPTER 1
Introduction
The purpose of parallel processing is to perform computations faster than can be done with a single processor by using a number of processors concurrently. The need for faster solutions and for solving large problems arises in wide variety of applications. These include fluid dynamics, weather prediction, image processing, artificial intelligence and automated manufac- turing. Parallel computers canbe classified accordingto variety of architec- tural features and modes of operations. In particular, most of the existing machines may be broadly grouped into two classes: machines with shared- memory architectures (examples include most of the small multiprocessors inthe market, such as Pentium-based servers, andseveral large multiproces- sors, such as the SGI Origin 2000) and machines with distributed-memory architecture (examples include the IBM SP systems and clusters of work- stations and servers). In a shared-memory architecture, processors commu- nicate by reading from and writing into the shared memory. In distributed- memory architectures, processors communicate by sending messages to each other. This thesis focuses on the efficiency of parallel programs that run un- der the Cilk programming environment. Cilk is a parallel programming system that offers the programmer a shared-memory abstractionontop a distributed memory hardware. Cilk includes a compiler for its programming language, which is also referred to as Cilk, and a run-time system consisting of a scheduler and a memory consistency protocol. (The memory consis- tency protocol, which this thesis focuses on, is only part of one version of Cilk; the other versions assume a shared-memory hardware.) The Cilk parallel multithreaded language has been developed in order to make high-performance parallel shared-memory programming easier. Cilk is built around a provably efficient algorithm for scheduling the execution of fully strict multithreaded computations, based on the technique of work stealing [21][4][26][22][5]. Inhis PhD thesis [ 21], Randall developed a memory-consistency for running Cilk programs on distributed-memory par- allel computers and clusters. His protocol allows the algorithm designer to analyze the amount of communication in a Cilk program and the impact of this communication on the total running time of the program. The analyti- cal tools that he developed, along with earlier tools, also allows the designer to estimate the space requirements of a program. Randall demonstrated the power of these results by implementing and analyzing several algorithms, including matrix multiplication and LU factorization algorithms. However, the communication bounds of Randall’s algorithms are quite loose compared to known distributed-memory message-passing algorithms.
5 1.1. TWO NEW MATRIX MULTIPLICATION ALGORITHMS 6
This is alarming, since extensive communication between processors may significantly slow down parallel computations even if the work and commu- nicationis equally distributed betweenprocessors. Inthis thesis we show that it is possible to tightenthe communication bound with respect to the cache size using new Cilk algorithms that we have designed. We demonstrate new algorithms for matrix multiplication, for solution of triangular linear systems of equations, and for the factorization of matrices into triangular factors. By the term Cilk algorithms we essentially mean Cilk implementation of conventional matrix algorithms. Programming languages allow the program- mer to specify a computation (how to compute intermediate and final results from previously-computed results). But most programming languages also force the designer to constrain the schedule of the computation. For exam- ple, a C program essentially specifies a complete ordering of the primitive operations. The compiler may change the order of computations only if it can prove that the new ordering produces equivalent results. Parallel message-passing programs fully specify the schedule of the parallel compu- tation. Cilk programs, in contrast, declare that some computations may be performed inparallel but let a run-time scheduler decide onthe exact schedule. Our analysis, as well as previous analyses of Cilk programs, es- sentially show that a given program admits an efficient schedule and that Cilk’s run-time scheduler is indeed likely to choose such a schedule.
1.1. Two New Matrix Multiplication Algorithms The maincontributionof this thesis is inpresentinga newapproach for designing algorithms implemented in Cilk for achieving lower communi- cationbound. Inthe distributed-memory applicationworld there exists a traditionalclassificationof matrix multiplicationalgorithms. So-called two- dimensional (2D) algorithms, such as those of Cannon [7], or of Ho, Johns- sonandEdelman[ 18], use only a little amount of extra memory. Three- dimensional (3D) algorithms use more memory but perform asymptotically less communication; examples include the algorithms of as Gupta and Sa- dayappan[ 14], of Berntsen [3], of Dekel, Nassimi and Sahni [8]andofFox, Otto and Hey [10]. Cilk’s shared-memory abstraction, in comparison to message-passing mechanisms, simplifies programming by allowing of each procedure, no mat- ter which processor runs it, to access the entire memory space of the pro- gram. The Cilk runtime system provides support for scheduling decisions and the programmer needs not specify which processor executes which pro- cedure, nor exactly when each procedure should be executed. These factors make it substantially easier to develop parallel programs using Cilk than with other parallel-programming environments. One may suspect that the ease-of-programming comes at a cost: reduced performance. We show in this thesis that this is not necessarily the case, at least theoretically (up to logarithmic factors), but that careful programming is required in order to match existing bounds. More naive implementations of algorithms, in- cluding those proposed by Randall, do indeed suffer from relatively poor theoretical performance bounds. 1.3. OUTLINE OF THE THESIS 7
We give tighter communication bounds for new Cilk matrix multiplica- tionalgorithms that canbe classified as 2D and3D algorithms andprove that it is possible to designsuch algorithms with the simple programming environment of Cilk almost without compromising on performance. In the 3D case we have evenslightly improved parallelism. The analysisshows we canimplementa 2D-like algorithm for multiplying n × n matrices on P n2 a machine√ with processors, each with P memory, with communication bound O( Pn2 log n). In comparison, Randall’s notempul algorithm, which is equivalent in the sense that it uses little space beyond the space required for the input and output, performs O(n3) communication. We also present 3 O √n CP √n n a 3D-like algorithm with communication bound ( C + log C log ), where C is the memory size of each processor, which is lower thanexisting Cilk implementations for any amount of memory per processor.
1.2. New Triangular Solver and LU Algorithms Solving a linear system of equations is one of the most fundamental prob- lems in numerical linear algebra. The classic Gaussian elimination scheme to solve an arbitrary linear system of equations reduces the given system to a triangular form and then generates the solution by using the standard forward and backward substitution algorithm. This essentially factors the coefficient matrix into two triangular factors, one lower triangular and the other upper triangular. The contribution of this thesis is showing that if we can dynamically control the amount of memory that processors use to cache data locally, thenwe designcommunicationefficientalgorithms for solving dense linear systems. In other words, to achieve low communication bounds, we limit the amount of data that a processor may cache during cer- tainphases√ of the algorithm. Our algorithms perform asymptotically√ a√ fac- tor of C√ less communication than Randall’s (where C>log(n C)), log(n C) but our algorithms have somewhat less parallelism.
1.3. Outline of The Thesis The rest of the thesis is organized as follows. In Chapter 2 we present an overview of existing parallel linear-algebra algorithms together, and we present Cilk, an algorithmic multithreaded language. Chapter 2 also intro- duces the tools that Randall and others have developed for analysing the performance of Cilk programs. In Chapter 3 we present new Cilk algorithms for parallel matrix multiplication and analyze our algorithms. In Chapter 4 we present a new triangular solver and demonstrate how controlling the size of the cache can reduce communication. In Chapter 5 we use the results con- cerning the triangular solver to design communication-efficient LU decom- position algorithm. We present our conclusions and discuss open problems inChapter 6. CHAPTER 2
Background
This chapter provides background material required in the rest of the thesis. The first sectiondescribes parallel distributed-memory matrix mul- tiplication algorithms. Our new Cilk algorithms are inspired by these algo- rithms. The other sections describe Cilk and the analytical tools that allow us to analyze the performance of Cilk programs. Some of the material on Cilk follows quite closely the Cilk documentation and papers.
2.1. Parallel Matrix Multiplication Algorithms R AB r n a b n The product = is defined as ij = k=1 ik kj,where is the number of columns of A and rows in B. Implementing matrix multiplication according to the definition requires n3 multiplications and n2(n−1) additions whenthe matrices are n-by-n. Inthis thesis we ignore o(n3) algorithms, such as Strassen’s, which are not widely-used in practice. Matrix multiplication is a regular computationthat parallelizes well. The first issue when implementing such algorithms on parallel machines is how to assigntasks to processors. We cancompute all the elements of the product inparallel, so we canclearly employ n2 processors for n time steps. We canactually compute all the products ina matrix multiplication computationinonestep if we canuse n3 processors. But to compute the n2 sums of product, we need additional log n steps for the summations. Note that with n3 processors, most of them would remainidle duringmost of the time, since there are only 2n3 − 1 arithmetic operations to perform during log n + 1 time steps. Another issue when implementing parallel algorithms is the mechanisms used to support communication among different processors. In a distributed- memory architectures each processor has its ownlocal memory which it can address directly and quickly. A processor may or may not be able to address the memory of other processors directly and in any case, accessing remote memories is slower thanaccessingits ownlocal memory. Programming a distributed-memory machine with message passing poses two challenges. The first challenge is a software engineering one, since the memory of the computer is distributed and since the running program is composed of multiple processes, each with its ownvariables, we must dis- tribute data structures among the processors. The second and more funda- mental challenge is to choose the assignment of data-structure elements and computational tasks to process in a way that minimizes communication, since transferring data between memories of different processors is much slower thanaccessingdata ina processor ownlocal memory, reducingdata transfers usually reduces the running time of a program. Therefore, we
8 2.2. THE CILK LANGUAGE 9 must analyze the amount of communication in matrix algorithms when we attempt to design efficient parallel algorithms and predict their performance. There are two well known implementation concepts for parallel matrix multiplication on distributed-memory machines. The first and more natural implementation concept is to lay√ out the√ matrix in blocks. The P processors are arranged√ in a 2-dimensional P -by- P grid (we assume√ for√ simplicity here that P is an integer), split the three matrices into P -by- P block matrices and store each block on the corresponding processor. The grid of processors is simply a map from 2 dimensional processor indices to the usual 1-dimensional rank (processor indexing). This form of distributing a matrix is called a 2-dimensional (2D) block distribution, because we distrib- ute both the rows and the columns of the matrix among processors. The basic idea of the algorithm is to assignprocessor ( i, j) the computationof √ R P A B R R ij = k=1 ik kj (here ij is a block√ of the matrix , and similarly for A and B). The algorithm consists of P mainphases. Ineach phase, every n2 processor sends two messages of size P words (and receives two such mes- √n 3 sages as well), and performs 2( P ) floating point operations, resulting in √ 2 2 2 P n √n n P = P communication cost and P memory space per processor. A second kind of distributed-memory matrix multiplication algorithm uses less communication but more space than the 2D algorithm. The basic idea is to arrange the processors in a p-by-p-by-p 3D grid, where p = P 1/3, and to split the matrices into p-by-p block matrices. The first phase of the algorithm distributes the matrix so that processor (i, j, k)storesAik and Bkj. The next phase computes on processor (i, j, k) the product AikBkj. Inthe third andlast phase of the algorithm the processors sum up the products AikBkj to produce Rij. More specifically, the group of processors with indices (i, j, k)withk =1..p sum up Rij. The computational load in the 3D algorithm is nearly perfectly balanced. Each processor multiplies two blocks and adds at most two. Some processors add none. The 2D algorithm requires each processor to store exactly 3 submatrices 2 √ √n P √n n2 P of order P during the algorithm and performs total of ( P )= communication. The 3D algorithm stores at each processor 3 submatrices n P n2 n2P 1/3 of order P 1/3 and performs a total of ( P 2/3 )= communication. 2.2. The Cilk Language The philosophy behind Cilk is that a programmer should concentrate on structuring his program to expose parallelism and exploit locality, leaving the runtime system with the responsibility of scheduling the computation to run efficiently on the given platform. Cilk’s runtime system takes care of details like load balancing and communication protocols. Unlike other mul- tithreaded languages, however, Cilk is algorithmic in that the runtime sys- tem’s scheduler guarantees provably efficient and predictable performance. Cilk’s algorithmic multithreaded language for parallel programming gen- eralizes the semantics of C by introducing linguistic constructs for parallel control. The basic Cilk language is simple. It consists of C with the ad- ditionof three keywords: cilk, spawn and sync to indicate parallelism and synchronization. A Cilk program, when run on one processor, has the same 2.2. THE CILK LANGUAGE 10 semantics as the C program that results when the Cilk keywords are deleted. Cilk extends the semantics of C in a natural way for parallel execution so procedure may spawn subprocedures in parallel and synchronize upon their completion. A Cilk procedure definition is identified by the keyword cilk and has an argument list and body just like a C function and its declaration can be used anywhere an ordinary C function declarations can be used. The main procedure must be named main, as in C, but unlike C, however, Cilk insists that the returntype of main be int.Sincethemain procedure must also be Cilk procedure, it must be defined with the cilk keyword. Most of the work ina Cilk procedure is executed serially, just like C, but parallelism is created whenthe invocationof a Cilk procedure is immediately preceded by the keyword spawn. A spawnis the parallel analogof a C function call, and like a C function call, when a Cilk procedure is spawned, execution proceeds to the child. Unlike a C function call, however, where the parent is not resumed until after its child returns, in the case of a Cilk spawn, the parent can continue to execute in parallel with the child. Indeed, the parent can continue to spawn off children, producing a high degree of parallelism. Cilk’s scheduler takes the responsibility of scheduling the spawned procedures on the processors of the parallel computer. A Cilk procedure cannot safely use the returnvalues (or data written to shared data structures) of the childrenit has spawned until it executes a sync statement. If all of its children have not completed when it executes a sync, the procedure suspends and does not resume until all of its children have completed. InCilk, a sync waits only for the spawned children of the procedure to complete and not for all procedures currently executing. Whenall its childrenreturn,executionof the procedure resumes at the point immediately following the sync statement. As an aid to programmers, Cilk inserts an implicit sync before every return, if it is not present already. As a consequence, a procedure never terminates while it has outstanding children. The program inFigure 2.2.1 demonstrates how Cilk works. The figure shows a Cilk procedure that computes the n-th Fibonacci number. InCilk’s terminology, a thread is a maximal sequence of instructions that ends with a spawn, sync or return (either explicit or implicit) state- ment (the evaluation of arguments to these statements is considered part of the thread preceding the statement). Therefore, We can visualize a Cilk program executionas a directed acyclic graph, or dag, inwhich vertices are threads (instructions) and edges denote ordering constraints imposed by control statements. A Cilk program execution consists of a collection of procedures, each of which is broken into a sequence of non blocking threads. The first thread that executes whena procedure is called is the procedure initial thread, and the subsequent threads are successor threads.At runtime, the binary spawn relation causes procedure instances to be struc- tured as a rooted tree, and the dependencies among their threads form a dag embedded inthis spawn-tree. For example, the computationgenerated by the executionof fib(4) from the program inFigure 2.2.1 generates the dag showninFigure 2.2.2. A correct executionof a Cilk program must obey all the dependencies in the dag, since a thread can not be executed until 2.2. THE CILK LANGUAGE 11
cilk int fib(int n) { if (n<2) return n; else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } }
Figure 2.2.1. A simple Cilk procedure to compute the nth Fibonacci number in parallel (using an exponential work method while logarithmic-time methods are known). Delet- ing the cilk, spawn,andsync keywords would reduce the procedure to a valid and correct C procedure.
fib(4)
fib(2) fib(3)
fib(2)
Figure 2.2.2. A dag of threads representing the multi- threaded computationof fib(4) from Figure 2.2.1. Each procedure, shown as a rounded rectangle, is broken into se- quences of threads, shown as circles. A downward edge in- dicates the spawning of a subprocedure. A horizontal edge indicates the continuation to a successor thread. An upward edge indicates the returning of a value to a parent procedure. All three types of edges are dependencies which constrain the orders inwhich thread may be scheduled. The figure is from [22]. all the threads onwhich it depends have completed. Note that the use of the term thread here is different from the common use in programming environments such as Win32 or POSIX threads, where the same term refers to a process-like object that shares anaddress space with other threads and which competes with other threads and processes for CPU time. 2.4. CILK’S MEMORY CONSISTENCY MODEL 12
2.3. The Cilk Work Stealing Scheduler The spawn and sync keywords specify logical parallelism, as opposed to actual parallelism. That is, these keywords indicate which code may pos- sibly execute inparallel but what actually runsinparallel is determined by the scheduler, which maps dynamically unfolding computations onto the available processors. To execute a Cilk program correctly, Cilk’s underlying scheduler must obey all the dependencies in the dag, since a thread can not be executed until all the threads on which it depends have completed. These dependencies form a partial order, permitting many ways of scheduling the threads in the dag. A scheduling algorithm must ensure that enough threads remain concurrently active to keep the processors busy. Simultaneously, it should ensure that the number of concurrently active threads remains within reasonable limits so that memory requirements can be bounded. Moreover, the scheduler should also try to maintain related threads on the same pro- cessor, if possible, so that communicationbetweenthem canbe minimized. Needless to say, achieving all these goals simultaneously can be difficult. Two scheduling paradigms address the problem of scheduling multi- threaded computations: work sharing and work stealing. In work sharing, whenever a processor generates new threads, the scheduler attempts to mi- grate some of them to other processors inhopes of distributingthe work to under utilized processors. In work stealing, however, under utilized proces- sors take the initiative: they attempt to steal threads from other processors. Intuitively, the migration of threads occurs less frequently with work steal- ing than with work sharing, since if all processors have plenty of work to do, no threads are migrated by a work-stealing scheduler, but threads are always migrated by a work-sharing scheduler. Cilk’s work stealing scheduler executes any Cilk computation in nearly optimal time [21][4][5]. Along the execution of a Cilk program, when a processor runs out of work, it asks another processor, chosen at random, for work to do. Locally, a processor executes procedures inordinary serial order (just like C), exploring the spawn tree in a depth-first manner. When a child procedure is spawned, the processor saves local variables of the parent (activationframe) onthe bottom of a stack, which is a ready deque (doubly ended queue from which procedures can be added or deleted) and commences work on the child (the convention is that the stack grows downward, and the items are pushed and popped from the bottom of the stack). When the child returns, the bottom of the stack is popped (just like C) and the parent resumes. Whenanother processor requests work, however, work is stolen from the top of the stack, that is, from the end opposite to the one normally used by the worker.
2.4. Cilk’s Memory Consistency Model Cilk’s shared memory abstraction greatly enhances the programmability of a multiprocessor. Incomparisonto a message-passingarchitecture, the ability of each processor to access the entire memory simplifies programming by reducing the need for explicit data partitioning and data movement. The single address space also provides better support for parallelizing compilers 2.4. CILK’S MEMORY CONSISTENCY MODEL 13 and standard operating systems. Since shared-memory systems allow multi- ple processors to simultaneously read and write the same memory locations, programmers require a conceptual model for the semantics of memory op- erations to allow them to correctly use the shared memory. This model is typically referred to as a memory consistency model or memory model. To maintain the programmability of shared-memory systems, such a model should be intuitive and simple to use. The intuitive memory model assumed by most programmers requires the executionof a parallel program ona multiprocessor to appear as some interleaving of the execution of the parallel processes on a uniprocessor. This intuitive model was formally defined by Lamport as sequential consis- tency [19]:
Definition 2.4.1. A multiprocessor is sequentially consistent if the re- sult of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear inthis sequenceinthe order specified by its program.
Sequential consistency maintains the memory behavior that is intuitively expected by most programmers. Each processor is required to issue its memory operations in program order. Operations are serviced by memory one-at-a-time, and thus, they appear to execute atomically with respect to other memory operations. The memory services operations from different processors based onanarbitrary but fair global schedule. This leads to an arbitrary interleaving of operations from different processors into a single sequential order. The fairness criteria guarantees eventual completion of all processor requests. The above requirements lead to a total order on all memory operations that is consistent with the program order dictated by each processors program. Unfortunately, architects of shared memory systems for parallel comput- ers who have attempted to support Lamport’s strong model of sequential consistency have generally found that Lamport’s model is difficult to im- plement efficiently and hence relaxed models of shared-memory consistency have beendeveloped [ 9][11][12]. These models adopt weaker semantics to allow a faster implementation. By and large, all of these consistency mod- els have had one thing in common: they are processor centric in the sense that they define consistency in terms of actions by physical processors. In contrast, Cilk’s dag consistency is defined on the abstract computation dag of a Cilk program and hence is computational centric. To define a computation-centric memory model like dag consistency it suffices to define what values are allowed to be returned by a read. We now define dag consistency in terms of the computation. A compu- tationis representedby its graph G =(V,E), where V is a set of vertices representing threads of the computation, and E is a set of edges representing ordering constraints on the threads. For two threads u and v,wesaythat u (strictly) precedes v which we write u ≺ v if u = v and there is a directed path in G from u to v. 2.5. THE BACKER COHERENCE ALGORITHM 14
Definition 2.4.2. The shared memory M of a computation G =(V,E) is a dag consistent if for every object x inthe shared memory, there exists an observer function fx : V → V such that the following conditions hold: 1. For all instructions u ∈ V , the instruction fx(u)writestox. 2. If an instructions u writes to x, thenwe have fx(u)=u. 3. If an instructions u reads to x, it receives a value of fx(u). 4. For all instructions u ∈ V ,wehaveu ≺ fx(u). 5. For each triple u, v, w of instructions such that u ≺ v ≺ w,iffx(u) = fx(v) holds, thenwe have fx(w) = fx(u).
Informally, the observer function fx(u) represents the viewpoint of in- struction u on the content of object x. For deterministic programs, this definition implies the intuitive notion that a read can see a write in the dag consistency model only if there is some serial execution order consistent with the dag in which the read sees the write. Unlike sequential consistency, but similar to certainprocessor-centric models [ 11][12] dag consistency allows different reads to return values that are based on different serial orders, but the values returned must respect the dependency in the dag. Thus, the writes performed by a thread are seenby its successors, but threads that are incomparable in the dag may or may not see each other’s writes. The Primary motivation for any weak consistency model, including dag consistency, is performance. In addition, however, a memory model must be understandable by a programmer. In the dag consistency model if the programmer wishes to ensure that a read sees the write, he must ensure that there is a path inthe computationdag from the write to the read. Using Cilk can ensure that such a path exists by placing a sync statement between the write and the read in his program.
2.5. The BACKER Coherence Algorithm Cilk’s maintains dag consistency using a coherence protocol called BACKER1 [21]. In this protocol, versions of shared-memory objects can reside simultane- ously inanyof the processors’ local caches or inthe global backingstore. Each procedure’s cache contains objects recently used by the threads that have executed on that processor and the backing store provides a global storage locationfor each object. Inorder for a thread executingonthe processor to read or write anobject, the object must be inthe processor’s cache. Each object inthe cache has a dirty bit to record whether the object has beenmodified since it was brought into the cache. Three basic actions are used by the BACKER to manipulate shared- memory objects: fetch, reconcile and flush. A fetch copies anobject from the backing store to a processor cache and marks the cached object as clean. A reconcile copies a dirty object from a processor cache to the backing store and marks the cached as clean. Finally, a flush removes a cleanobject from a processor cache. Unlike implementations of other models of consistency, all three actions are bilateral between a processors cache and the backing store and other processors’ caches are never involved.
1The BACKER coherence algorithm was designed and implemented as part of Keith Randall’s PhD thesis, but it is not included in the Cilk versions that are actively maintained. 2.6. A MODEL OF MULTITHREADED COMPUTATION 15
The BACKER coherence algorithm operates as follows. When the pro- gram performs a read or write actiononanobject, the actionis performed directly ona cached copy of the object. If the object is notinthe cache, it is fetched from the backing store before the action is performed. if the action is a write , the dirty bit of the object is set. To make space inthe cache for a new object, a clean object can be removed by flushing it from the cache. To remove a dirty object, it is reconciled and then flushed. Besides performing these basic operations in response to user reads and writes, the BACKER performs additional reconciles and flushes to enforce dag consistency. For each edge i → j inthe computationdag, if threads i and j are scheduled ondifferentprocessors, say P and Q, thenBACKER reconcilesall of P ’s cached objects after P executes i, but before P enables j, and it reconciles and flushes all of Q’s cached object before Q executes j. The key reason BACKER works is that it always safe, at any point during the execution, for a processor P to reconcile an object or to flush a cleanobject. The BACKER algorithm uses this safety property to guarantee dag consistency even when there is communication. BACKER causes P to reconcile all its cached object after executing i but before enabling j and it causes Q to reconcile and flush its entire cache before executing j.Atthis point, the state of Q’s cache (empty) is the same as P ’s if j had executed with i onprocessor P , but a reconcile and flush had occurred between them. Consequently, BACKER, ensures dag consistency.
2.6. A Model of Multithreaded Computation Cilk supports analgorithmic model of multithreaded computationwhich equips us with an algorithmic foundation for predicting the performance of Cilk programs. A multithreaded computationis composed of a set of threads, each of which is a sequential ordering of unit-size instructions. A processor takes one unit of time to execute one instruction. The instructions of a threads must execute insequentialorder from the first instructionto the last instruction. From anabstract theoretical perspective, there are two fundamental limits to how fast a Cilk program could run[ 21][4][5]. Let us denote by Tp the executiontime of a givencomputationon P processors. The work of the computation, denoted T1, is the total number of instructions in the dag, which corresponds to the amount of time required by a one-processor execution (ignoring cache misses and other complications). Notice that with T1 T1 work and P processors, the lower bound Tp ≥ P must hold, since in one step, a P-processor computer cando at most P work (this, again, ignores cache misses). The second limit is based on the program’s critical path length, denoted by T∞, which is the maximum number of instructions on any directed path in the dag, which corresponds to the amount of time required by an infinite-processor execution, or equivalently, the time needed to execute threads along the longest path of dependency. The corresponding lower bound is simply Tp ≥ T∞, since a P-processor computer can do no more work inonestep thananinfinite-processor computer. The work T1 and the critical path length T∞ are not intended to denote the execution time on any real single-processor or infinite-processor machine. 2.6. A MODEL OF MULTITHREADED COMPUTATION 16
These quantities are abstractions of a computation and are independent of any real machine characteristics such as communication latency. We can think of T1 and T∞ as execution times on an ideal machine with no sched- uling overhead and with a unit-access-time memory system. Nevertheless, Cilk work-stealing scheduler executes a Cilk computation that does not use locks onP processors inexpected time [ 21]
T T 1 O T , P = P + ( ∞)
T1 which is asymptotically optimal, since P and T∞ are both lower bounds. Empirically, the constant factor hidden by the big O is oftenclose to 1 T1 or 2 [22] and the formula Tp = P + T∞ provides a good approximation of the running time on shared-memory multiprocessors. This performance model holds for Cilk programs that do not use locks. If locks are used, Cilk cannot not guarantee anything [22]. This simple performance model allows the programmer to reasonabout the performanceof his Cilk program by examining the two simple metrics: work and critical path. P T1 The speedup computationon processors is the ratio Tp which in- dicates how many times faster the P-processor execution is than a one- T1 P processor execution. if Tp =Θ( ), thenwe say that the P-processor execu- T1 tionexhibits linear speedup. The maximum possible speedup is T∞ which is also called the parallelism of the computation, because it represents the average amountof work that canbe doneinparallel ineach time step along the critical path. We denote the parallelism of a computation by P . In order to model performance for Cilk programs that use dag-consistent shared memory, we observe that running times will vary as a function of the size C of the cache that each processor uses. Therefore, we must introduce metrics that account for this dependence. We define a new work measure, the total work, that accounts for the cost of cache misses in the serial execution. Let Γ be the time to service a cache miss inthe serial execution. We assign weight to the instructions of the dag. Each instruction that generates a cache miss in the one-processor execution with the standard, depth-first serial executionorder andwith a cache of size C has weight Γ + 1, and all other instructions have weight 1. The total work, denoted T1(C), is the total weight of all instructions in the dag, which corresponds to the serial executiontime if cache misses take Γ unitsof time to be serviced. The work term T1, which was defined before, corresponds to the serial execution time if all cache misses take zero time to be serviced. Unlike T1, T1(C) depends onthe serial executionorder of the computation. It further differs from T1 T1(C) inthat P is not a lower bound on the execution time for P processors. T1(C) Consequently, the ratio T∞ is defined to be the average parallelism of the computation. We canboundthe amountof space used by parallel Cilk executionin terms of its serial space. Denote by Sp the space required for a P-processor execution. Then S1 is the space required for anexecutionononeprocessor. Cilk’s guarantees [21]thatforaP processor executionwe have SP ≤ S1P . This bound implies that if a computation uses a certain amount of memory 2.6. A MODEL OF MULTITHREADED COMPUTATION 17 on one processor, it will use no more space per processor on average when it runs in parallel. The amount of interprocessors communication can be related to the number of cache misses that a Cilk computation incurs when it runs on P processors using the implementation of the BACKER coherence algorithm with cache size C. Let us denote by Fp(C) the amount of cache misses performed by a P -processor Cilk computation. Randall [21]showsthat Fp(C) ≤ F1(C)+2Cs,wheres is the total number of steals executed by the scheduler. The 2Cs term represents cache misses due to warming up the processors’ caches. Randall has performed empirical measurements that indicated that the warm-up events are much smaller inpractice thanthe theoretical bound. Randall shows that this bound can be further tightened, if we assumes that the accesses to the backing store behave as if they were random and independent. Under this assumption, the following theorem predicts the performance of a distributed-memory Cilk program [21]: Theorem 2.6.1. Consider any Cilk program executed on P processors, each with an LRU cache of C elements, using Cilk’s work stealing sched- uler in conjunction with the BACKER coherence algorithm. Assume that accesses to the backing store are random and independent. Suppose the com- putation has F1(C) serial cache misses and T∞ critical path length. Then, for any !>0, the number of cache misses is at most F1(C)+O(CPT∞ + 1 CP log( )) with probability at least 1 − !. Moreover, the expected number of cachemissesisatmostF1(C)+O(CPT∞). The standard assumptionin[ 21] is that the backing store consists half the physical memory of each processor, and that the other half is used as a cache. Inother words, C is roughly a 1/2P fractionof the total memory of the machine. It is, therefore, convenient to assess the communication re- quirements of algorithms under this assumption, although C can, of course, be smaller. Finally, from here onwe focus onthe expected performancemeasures (communication, time, cache misses, and space). CHAPTER 3
Communication-Efficient Dense Matrix Multiplication in Cilk
Dense matrix multiplication is used in a variety of applications and is one of the core component in many scientific computations. The standard way of multiplying two matrices of size n × n requires O(n3) floating-point op- erations on a sequential machine. Since dense matrix multiplication is com- putationally expensive, the development of efficient algorithms is of great interest. This chapter discusses two types of parallel algorithms for multiplying n × n dense matrices A and B to yield the product matrix R = A × B using Cilk programs. We analyze the communication cost and space requirements of specific Cilk algorithms and show new algorithms that are efficient with respect to the measures of communication and space. Specifically, we prove upper bounds on the amount of communication on SMP machines with P processors and shared-memory cache of size C when dag consistency is maintained by the BACKER coherence algorithm and under the assumption that accesses to the backing store are random and independent.
3.1. Space-Efficient Parallel Matrix Multiplication Previous papers onCilk [ 21][20][6] presented two divide-and-conquer algorithms for multiplying n-by-n matrices. The first algorithm uses Θ(n2) memory and Θ(n) critical-path length (as stated above, we only focus on conventional Θ(n3)-work algorithms). In[ 21, page 56], this algorithm is called notempmul, which is the name we will use to refer to it. This algo- n n rithm divides the two input matrices into four 2 -by- 2 blocks or submatrices, computes recursively the first four products and store the result in the out- put matrix, thencomputes recursively, inparallel, the last four products and then concurrently adds the new results to the output matrix. The notempmul algorithm is showninFigure 3.1.1. Inessence,the algorithm uses the following formulation: R11 R12 A11B11 A11B12 R R = A B A B 21 22 21 11 21 12 R R A B A B 11 12 += 12 21 12 22 . R21 R22 A22B21 A22B22 Under the assumption that C is 1/2P of the total memory of the ma- chine, and that the backing store’s size is Θ(n2)(soC = n2/P ), the commu- nication upper bound for notempmul√ that Theorem 2.6.1 implies is O(n3), which is a lot more thanthe Θ( n2 P ) bound for 2D message-passing algo- rithms.
18 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 19
cilk void notempmul (long nb, block *A, block *B, block *R) { if (nb == 1) multiplyadd block(A,B,R); else{ block *C, *D, *E, *F, *G, *H, *I, *J; block *CGDI, *CHDJ, *EGFI, *EHFJ;
/* get pointers to input submatrices */ partition (nb, A, &C, &D, &E, &F); partition (nb, B, &G, &H, &I, &J);
/* get pointers to result submatrices */
partition (nb, R, &CGDI, &CHDJ, &EGFI, &EHFJ); /* solve subproblem recursively */ spawn notempmul(nb/2, C, G, CGDI); spawn notempmul(nb/2, C, H, CHDJ); spawn notempmul(nb/2, E, H, EHFJ); spawn notempmul(nb/2, E, G, EGFI); sync; spawn notempmul(nb/2, D, I, CGDI); spawn notempmul(nb/2, D, J, CHDJ); spawn notempmul(nb/2, F, J, EHFJ); spawn notempmul(nb/2, F, I, EGFI); sync; } return; }
Figure 3.1.1. A Cilk code for notempmul algorithm. It is a no-temporary version of recursive blocked matrix multipli- cation.
We suggest another way to perform n × n matrix multiplication. Our new algorithm, called CilkSUMMA, is inspired by the SUMMA matrix mul- tiplicationalgorithm [ 23]. The algorithm divides the multiplicationprocess into phases that are performed one after the other, not concurrently; all the parallelism is withinphases. Figure 3.1.2 illustrates the algorithm. For some constant r,0 B1 B2 B3 . = A1 A2 A3 ....An/r R . . . Bn/r Figure 3.1.2. The CilkSUMMA algorithm for parallel matrix n multiplication. CilkSUMMA divides A into r vertical blocks n of size n×r and B into r horizontal blocks of size r ×n.The corresponding blocks of A and B are iteratively multiplied to produce n × n product matrix. Each such matrix is stored in matrix R with the previous iterationresult values. Finally R will hold the product of A and B. B1 B2 A1B1 A1B2 A1 = A2B1 A2B2 A2 Figure 3.1.3. Recursive rank-r update to an n-by-n matrix R. { R =0 n for k =1.. r { spawn RankRUpdate(Ak,Bk,R,n,r) sync } } The RankRUpdate procedure recursively divides an n × r block of A into n × r r × n B r × n two blocks of size 2 and an block of into two blocks of size 2 and multiplies the corresponding blocks in parallel, recursively. If n =1, thenthe multiplicationreduces to a dot product, for which we give the code later. The algorithm is showninFigure 3.1.3, andits code is givenbelow: cilk RankRUpdate(A, B, R, n, r) 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 21 { if n=1 R += DotProd(A, B, r) else { A ,B ,R , n ,r spawn RankRUpdate( 1 1 11 2 ) A ,B ,R , n ,r spawn RankRUpdate( 1 2 12 2 ) A ,B ,R , n ,r spawn RankRUpdate( 2 1 21 2 ) A ,B ,R , n ,r spawn RankRUpdate( 2 2 22 2 ) } } The recursive Cilk procedure DotProd, shownbelow, is executed at the bottom of the rank-r recursion. If r = 1, the code returns the scalar multi- plicationof the inputs. Otherwise, the code splits each of the the r-length input vectors a and b into two subvectors of r/2 elements, and multiplies the two halves recursively, and returns the sum of the two dot products. Clearly, the code performs Θ(r) work and has critical path Θ(log r). The details are as follows: cilk DotProd(a, b, r) { if (r =1) return a1 · b1 else { x a ,b , r = spawn DotProd( [1,...,r/2] [1,...,r/2] 2 ) y a ,b , r = spawn DotProd( [r/2+1,...,r] [r/2+1,...,r] 2 ) sync return(x + y) } } The analysis of communication cost is organized as follows. First, we prove a lemma describing the amount of communication performed by RankRUpdate. Next, we obtain a bound on the amount of communication in CilkSUMMA. RRU Lemma 3.1.1. The amount of communication in RankRUpdate, FP (C, n), incurred by BACKER running on P processors,√ each with a shared-memory cache of C elements, and with√ block size r = C/3, when solving a problem of size n, is O(n2 + CP log(n C)). Proof. To find the number of RankRUpdate cache misses, we use The- orem 2.6.1. The work and critical path for RankRUpdate canbe computed usingre- currences. We find the number of cache misses incurred when RankRUpdate algorithm is executed ona singleprocessor andthenassignthem inTheorem 2.6.1. T RRU n, r T RRU ,r T DOTPROD r The work bound 1 ( )satisfies 1 (1 )= 1 ( )= r T RRU n, r T RRU n/ n> T RRU n, r Θ( )and 1 ( )=4 1 ( 2) for 1. Therefore, 1 ( )= Θ(n2r). RRU To derive recurrence for the critical path length T∞ (n, r), we observe that with an infinite number of processors the 4 block multiplications can RRU RRU execute inparallel, therefore T∞ (n, r)=T∞ (n/2,r)+Θ(1)forn>1. 3.1. SPACE-EFFICIENT PARALLEL MATRIX MULTIPLICATION 22 RRU DOTPROD For n =1T∞ (1,r)=T∞ (r) + Θ(1) = Θ(log r). Consequently, RRU the critical path satisfies T∞ (n, r)=Θ(logn +logr). F RRU C, n Next, we bound 1 ( ), the number of cache misses occur when the RankRUpdate algorithm is used to solve a problem of size n with the stan- dard, depth-first serial executionorder ona singleprocessor with anLRU cache of size C. At each node of the computational tree of RankRUpdate, k2 elements of R ina k × k block are updated using the results of k2 dot products of size r. To perform such anoperationentirely inthe cache, the cache must store√k2 elements of R, kr elements of A,andkr elements of B.When√ k ≤ C/3, the three submatrices fit into the cache. Let k = r = C/3. Clearly, the (n/k)2 updates to k-by-k blocks canbe performed entirely in the cache, so the total number of cache misses is at F RRU C, n ≤ n/k 2 × k 2 kr ≤ n2 most 1 ( ) ( ) [( ) +2 ] Θ( ). By Theorem 2.6.1 the amount of communication that RankRUpdate per- forms, whenrunon P processors using Cilk’s scheduler and the BACKER coherence algorithm, is F RRU C, n F RRU C, n O CPTRRU . p ( )= 1 ( )+ ( ∞ ) Since the critical path length of RankRUpdate√ is Θ(log nr), the total number of cache misses is O(n2 + CP log(n C)). Next, we analyze the amount of communication in CilkSUMMA. CS Theorem 3.1.2. The number of CilkSUMMA cache misses FP (C, n), incurred by BACKER running on P processors,√ each with a shared-memory cache of C elements and block√ size r = C/3 when solving a problem of size n is O( √n (n2 + CP log(n C))). In addition the total amount SCS(n) of C √ P space taken by the algorithm is O(n2 + P log(n C)). Proof. Notice that the CilkSUMMA algorithm only performs sequential calls to the parallel algorithm RankRUpdate.Thesync statement at the end of each iteration guarantees that the procedure suspends and does not resume until all the RankRUpdate childrenhave completed. Each such itera- tionis a phase inwhich onlyonecall to RankRUpdate is invoked so the only parallel execution is of the parent procedure and its own children. Thus, we canboundthe total number of cache misses to be at most the sum of all the cache misses incurred during each phase n/r CS RRU FP (C, n) ≤ FP (C, n) . 1 √ RRU 2 By Lemma 4.2.1, we have FP (C, n)=O(n + CP log(n C)), yielding n CS 2 FP (C, n)=O n + CP log (nr) r n √ = O √ n2 + CP log n C . C The total amount of space used by the algorithm, is the space allocated for the product matrix, and the space for the activation frames allocated by the runtime system. The Cilk runtime system uses activationframes 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION23 to represent procedure instances. Each such representation is of constant size, including the program counter and all live, dirty variables. The frame is pushed into a deque on the heap and it is deallocated at the end of the procedure call. Therefore, inthe worst case the total space saved for the activation frames is the longest possible chain of procedure instances, for each processor, which is the critical path length, resulting in O(P log nr) SCS n n2 O P nr total space allocated√ at any time, thus P ( )=Θ( )+ ( log )= O(n2 + P log(n C)). We have bounded the work and critical path of CilkSUMMA. Using these values we can compute the total work and estimate the total running time T CS C, n T CS n n3 P ( ). The computational work of CilkSUMMA is 1 ( )=Θ( ), so T CS C, n T CS n F CS C, n n3 the total work is 1 ( )= 1 ( )+Γ 1 ( )=Θ( ),√ assuming Γ is T CS C, n √n n C a constant. The critical path length is ∞ ( )= C log( ), so using the performance model in [21], the total expected time for CilkSUMMA on P processors is T C, n n3 √ √ T C, n O 1( ) CT n O n C n C . P ( )= P +Γ ∞( ) = P +Γ log 2 3 Consequently, if P = O √ n √ , the algorithm runs in O n time, Γ C log(n C) P obtaining linear speedup. CilkSUMMA√ uses the√ processor cache more effectively than notempmul whenever C>log(n C), which holds asymptotically for all C =Ω(n). n2 If we consider the size of the cache to be C = P ,whichisthemem- ory size of each one of the P processors ina distributed-memory machine n × n F CS C, n whensolving2D,√ matrix multiplicationalgorithms, then p ( )= 2 CS 2 Θ( Pn log n)andSP (n)=O(n ). These results are comparable to the communication and space requirements of 2D distributed-memory matrix multiplication algorithms, and they are significantly better than the Θ(n3) communication bound. We also improved the average parallelism of the algorithm over notempmul T1(C,n) n3 2 for r =Ω(n), since T n = n nr >n . ∞( ) r log( ) 3.2. Trading Space for Communication in Parallel Matrix Multiplication The matrix-multiplicationalgorithm showninFigure 3.2.1 is perhaps the most natural matrix-multiplication algorithm in Cilk. The code is from [21, page 55], but the same algorithm also appears in[ 20]. The motivationfor this algorithm, called blockedmul, is to increase parallelism, at the expense of using more memory. Its critical path is only Θ(log2 n), as opposed to Θ(n)innotempmul, but it uses Θ(n2P 1/3)space[21, page 148], as opposed to only Θ(n2)innotempmul. Inthe message-passingliterature onmatrix algorithms, space is traded for a reduction in communication, not for parallelism. So-called 3D matrix multiplicationalgorithm [ 1, 2, 3, 13, 14, 17] replicate the input matrices P 1/3 times inorder to reduce the total amountof communicationfrom Θ(n2P 1/2) in2D algorithms downto Θ( n2P 1/3). Irony and Toledo have 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION24 cilk void blockedmul (long nb, block *A, block *B, block *R) { if (nb == 1) multiply block(A,B,R); else{ block *C, *D, *E, *F, *G, *H, *I, *J; block *CG, *CH, *EG, *EH, *DI, *Dj, *FI, *Fj; block tmp[nb*nb]; /* get pointers to input submatrices */ partition (nb, A, &C, &D, &E, &F); partition (nb, B, &G, &H, &I, &J); /* get pointers to result submatrices */ partition (nb, R, &CG, &CH, &EG, &EH); partition (nb, tmp, &DI, &DJ, &FI, &FJ); /* solve subproblem recursively */ spawn blockedmul(nb/2, C, G, CG); spawn blockedmul(nb/2, C, H, CH); spawn blockedmul(nb/2, E, H, EH); spawn blockedmul(nb/2, E, G, EG); spawn blockedmul(nb/2, D, I, DI); spawn blockedmul(nb/2, D, J, DJ); spawn blockedmul(nb/2, F, J, FJ); spawn blockedmul(nb/2, F, I, FI); sync; /* add results together into R*/ spawn matrixadd(nb,tmp,R); sync; } return; } Figure 3.2.1. A Cilk code for recursive blocked matrix multiplication. It uses divide-and-conquer to solve one n × n n × n multiplicationproblem by splittingit into8 2 2 multipli- cation subproblems and combining the results with one n×n addition. A temporary matrix of size n × n is allocated at each divide step. A serial matrix multiplicationroutineis calledtodothebasecase. shown that the additional memory is necessary for reducing communication, and that the tradeoff is asymptotically tight [16]. 2 2/3 Substituting C = n /P in Randall’s communication analysis for blockedmul , we find out that with that much memory, the algorithm performs O n2P 1/3 log2 n 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION25 communication. That is, if we provide the program with caches large enough to replicate the input Θ(P 1/3) times, as in3D message-passingal- gorithms, the amount of communication that it performs is at most a factor of Θ(log2 n) more thanmessage-passing3D algorithms. Inother words, blockedmul is a Cilk analog of 3D algorithms. We propose a slightly more communication efficient algorithm than blockedmul. Like our previous algorithm, CilkSUMMA, obtaining optimal performance from this algorithm requires explicit knowledge and use of the cache size parameter C. This makes the algorithm more efficient but less elegant than blockedmul, which exploits a large cache automatically without explicit use of the cache-size parameter. Onthe other hand, blockedmul may simply fail if it cannot allocate temporary storage (a real-world implementation should probably synchronize after 4 recursive calls, as notempmul does, if it cannot allocate a temporary matrix). The code for SpaceMul is givenbelow. It uses anauxiliary procedure, MatrixAdd, which is not shown here, to sum an array of n × n matrices. We assume that MatrixAdd sums k matrices of dimension n using Θ(kn2)work and critical path Θ(log k log n); such analgorithm is trivial to implementin Cilk. For simplicity, we assume that n is a power of 2. cilk spawnhelper(cilk procedure f, array [Y1,Y2,... Yk]) { if (k =1) spawn f(Y1) else { spawn spawnhelper(f,[Y1,Y2,... Yk/2]) spawn spawnhelper(f,[Yk/2+1,... Yk]) } } cilk SpaceMul(A, B, R) { /* comment: A, B, R are n-by-n */ n Allocate r matrices, each n-by-n, denoted R1..R n n r Partition A into r block columns A1..A n n r B B ..B n Partition into r block rows 1 r spawn spawnhelper(RankRUpdate, [(A ,B ,R ),...,(A n ,Bn ,Rn )] 1 1 1 r r r ) sync R, R ,R , ..., R n spawn MatrixAdd( 1 2 r ) return R } SM Theorem 3.2.1. The number of SpaceMul cache misses FP (C, n),in- curred by BACKER running on P processors,√ each with a shared-memory cache of C elements and block size r = C/3 when solving a problem of size 3.2. TRADING SPACE FOR COMMUNICATION IN PARALLEL MATRIX MULTIPLICATION26 3 O √n CP √n n SSM n nis ( C + log C log ). The total amount P ( ) of space used by 3 O √n the algorithm is ( C ). Proof. The amount of communication in SpaceMul is bounded by the sum of the communication incurred until the sync and the communication SM 1 2 1 incurred after the sync.Thus,FP (C, n)=FP (C, n)+FP (C, n)whereFP , 2 FP represent the communication in the two phases of the algorithm. ADD First, we compute the work and critical path of SpaceMul.LetTP (n, k) be the P -processor running time of MatrixAdd for summing k matrices of di- RRU mension n,letTP (n, r)betheP -processor running time of RankRUpdate SM to perform a rank-r update onan n×n matrix, and let TP (n) be the total running time of SpaceMul. T RRU n, r n2r Recall from the previous sectionthat is 1 ( )=Θ( ) and that RRU T∞ (n, r)=Θ(lognr) . As discussed above, it is trivial to implement T ADD n, n/r n3/r T ADD n, n/r n n MatrixAdd so that 1 ( )=Θ( )and ∞ ( )=Θ(logr log ). We now bound the work and critical path of SpaceMul.Theworkfor SpaceMul is n n n T SM n, r T RRU n, r T ADD n, 1 ( )= r + r 1 ( )+ 1 ( r ) n n n n2r n2 n3 = r + r + r =Θ( ) (There is nothing surprising about this: this is essentially a schedule for SM the conventional algorithm). The critical path for SpaceMul is T∞ (n)= n RRU ADD n log r + T∞ (n, r)+T∞ (n, r ). The first term accounts for spawning n SM n the r parallel rank-r updates. Therefore, T∞ (n)=Θ(logr +lognr + n n log r log n)=Θ(logr log n). Next, we compute the amount of communication in SpaceMul.From F RRU C, n, r n2 the proof√ of Lemma 3.1.1 we know that 1 ( )= (recall that n 2 r = C/3). A sequential execution of MatrixAdd performs O( r n )cache misses, at most 3n2 during the addition of every pair of n-by-n matrices. SM Using Theorem 2.6.1, we can bound the total communication FP (C, n) in SpaceMul, F SM(C, n)=F 1 (C, n)+F 2 (C, n) P P P n3 n3 n O CP n O CP n = r + log + r + log r log n3 n = O √ + CP log √ log n . C C n The space used by the algorithm consists of the space for the r product matrices and the space of the activation frames which are bounded by PT∞. 3 O n n2 P n n O √n Therefore the total space cost is ( r + log r log )= ( C ). Conclusion 3.2.2 . The communication upper bound of SpaceMul is log n smaller a factor of Ω √n thanthe boundof blockedmul for any cache log C size. 3.3. A COMPARISON OF MESSAGE-PASSING AND CILK MATRIX-MULTIPLICATION ALGORITHMS27 Algorithm SP FP T1 T∞ 3 n2 √n CPn n3 n notempmul C + 3 n2P 1/3 √n CP 2 n n3 2 n blockedmul C + log log 3 √ √ √ n2 √n CPn n C n3 √n n C CilkSUMMA C + log( ) C log( ) 3 3 √n √n CP n √n n3 √n n SpaceMul C C + log log( C ) log C log Table 1. Asymptotic upper bounds on the performance metrics of the four Cilk matrix multiplicationalgorithm, whenapplied to n-by-n matrices ona computer with P pro- cessors and cache size C.