An Overview of the PARADIGM Compiler for Distributed-Memory

appeared in IEEE Computer, Volume 28, Number 10, October 1995 An Overview of the PARADIGM Compiler for DistributedMemory Multicomputers y Prithvira j Banerjee John A Chandy Manish Gupta Eugene W Ho dges IV John G Holm Antonio Lain Daniel J Palermo Shankar Ramaswamy Ernesto Su Center for Reliable and HighPerformance Computing University of Illinois at UrbanaChampaign W Main St Urbana IL Phone Fax banerjeecrhcuiucedu Abstract Distributedmemory multicomputers suchasthetheIntel Paragon the IBM SP and the Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all the computational p ower from these machines requires users to write ecientsoftware for them which is a lab orious pro cess The PARADIGM compiler pro ject provides an automated means to parallelize programs written using a serial programming mo del for ecient execution on distributedmemory multicomput ers In addition to p erforming traditional compiler optimizations PARADIGM is unique in that it addresses many other issues within a unied platform automatic data distribution commu nication optimizations supp ort for irregular computations exploitation of functional and data parallelism and multithreaded execution This pap er describ es the techniques used and pro vides exp erimental evidence of their eectiveness on the Intel Paragon the IBM SP and the Thinking Machines CM Keywords compilers distributed memorymulticomputers parallelizing compilers parallel pro cessing This research was supp orted in part by the Oce of Naval Research under Contract NJ by the National Aeronautics and Space Administration under Contract NASA NAG and in part byanATT graduate fellowship a FulbrightMEC fellowship an IBM graduate fellowship and an ONR graduate fellowship We are also grateful to the National Center for Sup ercomputing Applications the San Diego Sup ercomputing Center and the Argonne National Lab for providing access to their machines y currently working at IBM T J Watson ResearchCenter PO Box Yorktown Heights NY A preliminary version of this pap er app eared in the First Intl Workshop on Parallel Pro cessing Bangalore India Decemb er Intro duction Distributedmemory massively parallel multicomputers can provide the high levels of p erformance required to solve the Grand Challenge computational science problems Multicomputers such as the Intel Paragon the IBM SPSP and the Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all the computational power from these machines requires users to write ecient software for them which is a lab orious pro cess One ma jor reason for this diculty is the absence of a global address space As a result the programmer has to manually distribute computations and data across pro cessors and manage communication explicitly The PARADIGM PARAllelizing compiler for DIs tributed memory Generalpurp ose Multicomputers pro ject at the University of Illinois addresses this problem by developing an automated means to parallelize and optimize sequential programs for ecient execution on distributedmemory m ulticomputers To understand the complexity of writing programs in a messagepassing mo del it is useful to examine the parallel co de generated by the compiler which roughly corresp onds to what an exp erienced programmer might write for a small example Figure a shows an example serial program for Jacobis iterative metho d for solving systems of linear equations In Figure b a highly ecient parallel program is shown for an Intel Paragon with a variable numb er of pro cessors From this example it is apparent that if a programmer were required to manually parallelize even a mo derately sized application the eort would be tremendous Furthermore co ding with explicit communication op erations commonly results in errors which are notoriously hard to nd By automating the parallelization pro cess it will b ecome p ossible to oer high levels of p erformance to the scientic computing community at large Some of the other ma jor research eorts in this area include Fortran D Fortran D and the Superb compiler In order to standardize parallel programming with data distribution directives High Performance Fortran HPF has also been develop ed in a collab orative eort between researchers in industry and academia A number of commercial HPF compilers are be ginning to app ear including pro ducts from Applied Parallel Research Convex Cray Digital IBM The Portland Group Silicon Graphics Thinking Machines and others PARADIGM pro ject apart from other compiler eorts for distributedmemory What sets the multicomputers is the broad range of research topics that are b eing addressed In addition to p er program jacobi if myp le mnum then parameter np ncycles call crecvBm bB mbB real Anp np Bnp np end if np np if myp le mnum then do k ncycles call csendBm bAmbBmto do j np end if do i np if myp ge then Ai j Bi j Bi j call crecvBm bB Bi j Bi j end if end do if myp ge then end do minc f packB m bBm buf do j np call csendmbufmincmto do i np end if Bi j Ai j if myp le mnum then end do call crecvmbuf end do minc f unpackBm bBmbA end do mbBm buf end end if if myp le mnum then minc f packBm bAmbB a Serial version mbBm buf call csendmbufmincmto program jacobi end if character mbuf if myp ge then integer myp mbA mbA mbB mbB call crecvmbuf integer mnumdim mnum mto minc minc f unpackB m bBm buf real A B end if mnumdim fnumber of mesh dimensionsg dojmaxmbAmyp mnum fminimum mesh congurationg minmbAmyp mbA mnum do i maxmbAmyp call mgetnummnumnumno des minmbAmyp mbA call mgridinitmnumdimmnumnumno des Ai j Bi j Bi j Bi j call mgridco ordmyno demyp Bi j mto mgridrelmyp end do mto mgridrelmyp end do mto mgridrelmyp dojmaxmbBmyp mto mgridrelmyp minmbBmyp mbB mbA ceiloat mnum do i maxmbBmyp mbA ceiloat mnum minmbBmyp mbB mbB ceiloat mnum Bi j Ai j mbB ceiloat mnum end do do k end do if myp ge then end do call csendBm bBmto end end if b Parallel version for a variable numb er of pro cessors with a minimum mesh conguration Figure Example program Jacobis Iterative Metho d forming traditional compiler optimizations to distribute computations and to reduce communication overheads the research in the PARADIGM pro ject aims at combining the following asp ects p erforming automatic data distribution for regular computations optimizing communication for regular computations supp orting irregular computations using a combination of compiletime analysis and runtime supp ort exploiting functional and data parallelism simultaneously and generating multithreaded messagedriven co de to hide communication latencies Current eorts in the pro ject aim at integrating all of these capabilities into the PARADIGM framework In this article we briey describ e the techniques used and provide exp erimental results to demonstrate their utility Sidebar Distributed Memory Compiler Terminology Data Parallelism Parallelism that exists by p erforming similar op erations simultaneously across dierent elements of a data set SPMD Single Program Multiple Data Functional Parallelism Parallelism that exists by p erforming p otentially dierent op erations on dierent data sets simultaneously MPMD Multiple Program Multiple Data Regular Computations Computations that typically use dense regular matrix structures regular accesses can usually b e characterized using compiletime analysis Irregular Computations Computations that typically use sparse irregular matrix structures irregular accesses are usually input data dep endent requiring runtime analysis Data Partitioning The physical distribution of data across the pro cessors of a parallel machine in order to eciently use available memory and improve the lo calityof reference in parallel programs Global IndexAddress Index used to access an elementofanarray dimension when the entire dimension is physically allo cated on the pro cessor equivalent to the index used in a serial program Lo cal IndexAddress Index pair pro cessor index used to access an element of an array dimension when the dimension is partitioned across multiple pro cessors the lo cal index can also refer to just the index p ortion of the pair Owner Computes Rule States that all computations mo difying the value of a data elementare to b e p erformed by the pro cessor to which the element is assigned by the data partitioning User Level Thread A context of execution under user control which has its own stack and registers Multithreading A set of user level threads that share the same user data space and co op eratively execute a program PARADIGM: PARAllelizing compiler for Distributed-memory General-purpose Multicomputers Sequential FORTRAN 77 Program / HPF Parafrase-2: Program Analysis and Dependence Passes Automatic DataIrregular Pattern Task Graph Partitioning Analysis Synthesis Regular Pattern Irregular Pattern Task Allocation Static Analysis Run-time Support & Scheduling & Optimizations SPMD MPMD Multithreading Transformation and Run-time Support Generic Library Interface & Code Generation Optimized Parallel Program Figure PARADIGM Compiler Overview Compiler Framework Figure shows a functional illustration of howweenvision the complete PARADIGM compilation system The compiler accepts either a sequential FORTRAN or High Performance Fortran HPF program and pro duces an explicit message passing version FORTRAN program with calls to the selected message

An Overview of the PARADIGM Compiler for Distributed-Memory

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Parallel Computer Architecture

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms

In Reference to RPC: It's Time to Add Distributed Memory

Parallel Breadth-First Search on Distributed Memory Systems

Introduction to Parallel Computing

Multi-Core Architectures

Scalable Problems and Memory-Bounded Speedup

14. Parallel Computing 14.1 Introduction 14.2 Independent