Designing Communication-Efficient Matrix Algorithms in Distributed-Memory Cilk Eyal Baruch

TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Designing Communication-Efficient Matrix Algorithms in Distributed-Memory Cilk Thesis submitted in partial fulfillment of the requirements for the M.Sc. degree of Tel-Aviv University by Eyal Baruch The research work for this thesis has beencarried out at Tel-Aviv University under the direction of Dr. Sivan Toledo November 2001 Abstract This thesis studies the relationship between parallelism, space and communication in dense matrix algorithms. We study existing matrix multiplicationalgorithms, specifically those that are designedfor shared-memory multiprocessor machines (SMP’s). These machines are rapidly becoming commodity in the computer industry, but exploiting their computing power remains difficult. We improve algorithms that originally were designed using an algorithmic multithreaded language called Cilk (pronounced silk), and we present new algorithms. We analyze the algorithms under Cilk’s dag- consistent memory model. We show that by dividing the matrix-multiplication process into phases that are performed ina sequence, we canobtainlower communication bound without significantly limiting parallelism and without consuming significantly more space. Our new algorithms are inspired by distributed-memory matrix algorithms. Inparticular, we have developed algorithms that mimic the so-called two-dimensional and three-dimensional matrix multiplicationalgorithms, which are typically implementedusing message-passing mechanisms, not using share-memory programming. We focus onthree key matrix algorithms: matrix multiplication,solutionof triangular linear systems of equations, and the factorization of matrices into triangular factors. 3 Contents Abstract 3 Chapter 1. Introduction 5 1.1. Two New Matrix MultiplicationAlgorithms 6 1.2. New Triangular Solver and LU Algorithms 7 1.3. Outline of The Thesis 7 Chapter 2. Background 8 2.1. Parallel Matrix MultiplicationAlgorithms 8 2.2. The Cilk Language 9 2.3. The Cilk Work Stealing Scheduler 12 2.4. Cilk’s Memory Consistency Model 12 2.5. The BACKER Coherence Algorithm 14 2.6. A Model of Multithreaded Computation15 Chapter 3. Communication-Efficient Dense Matrix Multiplication in Cilk 18 3.1. Space-Efficient Parallel Matrix Multiplication 18 3.2. Trading Space for Communication in Parallel Matrix Multiplication23 3.3. A Comparison of Message-Passing and Cilk Matrix- MultiplicationAlgorithms 27 Chapter 4. A Communication-Efficient Triangular Solver in Cilk 29 4.1. Triangular Solvers in Cilk 29 4.2. Auxiliary Routines 31 4.3. Dynamic Cache-Size Control 34 4.4. Analysis of the New Solver with Dynamic Cache-Size Control 35 Chapter 5. LU Decomposition39 Chapter 6. Conclusion and Open Problems 42 Bibliography 43 4 CHAPTER 1 Introduction The purpose of parallel processing is to perform computations faster than can be done with a single processor by using a number of processors concurrently. The need for faster solutions and for solving large problems arises in wide variety of applications. These include fluid dynamics, weather prediction, image processing, artificial intelligence and automated manufac- turing. Parallel computers canbe classified accordingto variety of architec- tural features and modes of operations. In particular, most of the existing machines may be broadly grouped into two classes: machines with shared- memory architectures (examples include most of the small multiprocessors inthe market, such as Pentium-based servers, andseveral large multiprocessors, such as the SGI Origin 2000) and machines with distributed-memory architecture (examples include the IBM SP systems and clusters of work- stations and servers). In a shared-memory architecture, processors communicate by reading from and writing into the shared memory. In distributed- memory architectures, processors communicate by sending messages to each other. This thesis focuses on the efficiency of parallel programs that run under the Cilk programming environment. Cilk is a parallel programming system that offers the programmer a shared-memory abstractionontop a distributed memory hardware. Cilk includes a compiler for its programming language, which is also referred to as Cilk, and a run-time system consisting of a scheduler and a memory consistency protocol. (The memory consistency protocol, which this thesis focuses on, is only part of one version of Cilk; the other versions assume a shared-memory hardware.) The Cilk parallel multithreaded language has been developed in order to make high-performance parallel shared-memory programming easier. Cilk is built around a provably efficient algorithm for scheduling the execution of fully strict multithreaded computations, based on the technique of work stealing [21][4][26][22][5]. Inhis PhD thesis [ 21], Randall developed a memory-consistency for running Cilk programs on distributed-memory parallel computers and clusters. His protocol allows the algorithm designer to analyze the amount of communication in a Cilk program and the impact of this communication on the total running time of the program. The analyti- cal tools that he developed, along with earlier tools, also allows the designer to estimate the space requirements of a program. Randall demonstrated the power of these results by implementing and analyzing several algorithms, including matrix multiplication and LU factorization algorithms. However, the communication bounds of Randall’s algorithms are quite loose compared to known distributed-memory message-passing algorithms. 5 1.1. TWO NEW MATRIX MULTIPLICATION ALGORITHMS 6 This is alarming, since extensive communication between processors may significantly slow down parallel computations even if the work and commu- nicationis equally distributed betweenprocessors. Inthis thesis we show that it is possible to tightenthe communication bound with respect to the cache size using new Cilk algorithms that we have designed. We demonstrate new algorithms for matrix multiplication, for solution of triangular linear systems of equations, and for the factorization of matrices into triangular factors. By the term Cilk algorithms we essentially mean Cilk implementation of conventional matrix algorithms. Programming languages allow the programmer to specify a computation (how to compute intermediate and final results from previously-computed results). But most programming languages also force the designer to constrain the schedule of the computation. For exam- ple, a C program essentially specifies a complete ordering of the primitive operations. The compiler may change the order of computations only if it can prove that the new ordering produces equivalent results. Parallel message-passing programs fully specify the schedule of the parallel computation. Cilk programs, in contrast, declare that some computations may be performed inparallel but let a run-time scheduler decide onthe exact schedule. Our analysis, as well as previous analyses of Cilk programs, essentially show that a given program admits an efficient schedule and that Cilk’s run-time scheduler is indeed likely to choose such a schedule. 1.1. Two New Matrix Multiplication Algorithms The maincontributionof this thesis is inpresentinga newapproach for designing algorithms implemented in Cilk for achieving lower communi- cationbound. Inthe distributed-memory applicationworld there exists a traditionalclassificationof matrix multiplicationalgorithms. So-called two- dimensional (2D) algorithms, such as those of Cannon [7], or of Ho, Johns- sonandEdelman[ 18], use only a little amount of extra memory. Three- dimensional (3D) algorithms use more memory but perform asymptotically less communication; examples include the algorithms of as Gupta and Sa- dayappan[ 14], of Berntsen [3], of Dekel, Nassimi and Sahni [8]andofFox, Otto and Hey [10]. Cilk’s shared-memory abstraction, in comparison to message-passing mechanisms, simplifies programming by allowing of each procedure, no mat- ter which processor runs it, to access the entire memory space of the program. The Cilk runtime system provides support for scheduling decisions and the programmer needs not specify which processor executes which procedure, nor exactly when each procedure should be executed. These factors make it substantially easier to develop parallel programs using Cilk than with other parallel-programming environments. One may suspect that the ease-of-programming comes at a cost: reduced performance. We show in this thesis that this is not necessarily the case, at least theoretically (up to logarithmic factors), but that careful programming is required in order to match existing bounds. More naive implementations of algorithms, including those proposed by Randall, do indeed suffer from relatively poor theoretical performance bounds. 1.3. OUTLINE OF THE THESIS 7 We give tighter communication bounds for new Cilk matrix multiplicationalgorithms that canbe classified as 2D and3D algorithms andprove that it is possible to designsuch algorithms with the simple programming environment of Cilk almost without compromising on performance. In the 3D case we have evenslightly improved parallelism. The analysisshows we canimplementa 2D-like algorithm for multiplying n × n matrices on P n2 a machine√ with processors, each with P memory, with communication bound O( Pn2 log n). In comparison, Randall’s notempul algorithm, which is equivalent in the sense that it uses little space beyond the space required for the input and output, performs O(n3) communication. We also present

Designing Communication-Efficient Matrix Algorithms in Distributed-Memory Cilk Eyal Baruch

High Performance Computing Through Parallel and Distributed Processing

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Scheduling on Asymmetric Parallel Architectures

CUDA C++ Programming Guide

14. Parallel Computing 14.1 Introduction 14.2 Independent

CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

Parallel Programming in Openmp About the Authors

CUDA Dynamic Parallelism