Efficient High-Level Parallel Programming

Theoretical Computer Science EISEVIER Theoretical Computer Science 196 (1998) 71-107 Efficient high-level parallel programming George Horaliu Botoroga,*,‘, Herbert Kuchenb aAachen University of Technology, Lehrstuhl fiir Informatik IL Ahornstr. 55, D-52074 Aachen, Germany b University of Miinster, Institut fir Wirtschaftsinformatik, Grevener Str. 91, D-48159 Miinster, Germany Abstract Algorithmic skeletons are polymorphic higher-order functions that represent common parallel- ization patterns and that are implemented in parallel. They can be used as the building blocks of parallel and distributed applications by embedding them into a sequential language. In this paper, we present a new approach to programming with skeletons. We integrate the skeletons into an imperative host language enhanced with higher-order functions and currying, as well as with a polymorphic type system. We thus obtain a high-level programming language, which can be implemented very efficiently. We then present a compile-time technique for the implementation of the functional features which has an important positive impact on the efficiency of the language. Afier describing a series of skeletons which work with distributed arrays, we give two examples of parallel algorithms implemented in our language, namely matrix multiplication and Gaussian elimination. Run-time measurements for these and other applications show that we approach the efficiency of message-passing C up to a factor between 1 and 1.5. @ 1998-Elsevier Science B.V. All rights reserved Keywords: Algorithmic skeletons; High-level parallel programming; Efficient implementation of functional features; Distributed arrays 1. Introduction Although parallel and distributed systems gain more and more importance nowadays, the state of the art in the field of parallel software is far from being satisfactory. Not only is the programming of such systems a tedious and time-consuming task, but its outcome is usually machine-dependent and hence its portability is restricted. One of the main reasons for these difficulties is the rather low level of parallel programming languages available today, in which the user has to account explicitly * Corresponding author. E-mail: [email protected]. ’ The work of this author was supported by the “Graduiertenkolleg Informatik und Technik” at the Aachen University of Technology. 0304-3975/98/$19.00 @ 1998-Elsevier Science B.V. All rights reserved PZZ SO304-3975(97)00196-S 72 G.H. Botorog, H. Kuchenl Theoretical Computer Science 196 (1998) 71-107 for all aspects of parallelism, i.e., for communication, synchronization, data and task distribution, load balancing, etc. Moreover, program testing is impeded by deadlocks, non-deterministic program runs, non-reproducible errors, and sparse debugging facili- ties. Finally, programs run only on certain machines or on restricted classes of machines. However, due to the nearness to the machine, a high performance can be achieved. Attempts were made to design high-level languages that hide the details of parallelism and allow the programmer to concentrate on his problem. Such high-level approaches would naturally lead to less efficient programs than those written directly in low-level languages, and the crux is to find a good trade-off between efficiency losses and gains in easiness of programming, as well as in safety and portability of programs. In this paper, we present such an approach, namely a language with algorithmic skeletons, which combines the advantages of high-level (declarative) languages and those of efficient (imperative) ones. The main aim is to enable easy parallel programming and at the same time to get an efficient implementation. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel [6]. Skeletons are embedded into a sequential host language, thus being the only source of parallelism in a program. Most skeletons, like for instance map, farm, and divide&conquer [7] are polymorphic higher-order functions, and can thus be defined in functional languages in a straightforward way. This is why most languages with skeletons employ a functional host [8,14]. However, programs written in these languages are still 5 to 10 times slower than their low-level counterparts, e.g., C with message passing [14]. A series of approaches use an imperative language as a host, like for instance P3L [ 11, which builds on top of C++, and in which skeletons are internal language constructs. The main drawbacks are the difficulty to add new skeletons and the fact that only a restricted number of skeletons can be used. A related approach is PCN [9], where templates provide a means to define reusable parallel program structures. The language also contains lower-level features, such as dejinitional variables, by which processes can communicate, but which can lead to problems such as deadlocks. In the language WAS [18], arithmetic and logic operators are extended by overloading to global operators that can be applied pointwise to the elements of matrices. These operations are no skeletons in the sense of the former definition, but their functionality resembles that of certain skeletons. SPP(X) [8] uses a ‘two-layer’ language: a high-level, functional language for the application and a low-level language (at present Fortran) for efficient sequential code that is coordinated by the skeletons. In this paper, we present a new approach to parallel programming with algorithmic skeletons. We detine an imperative language called SkiI,’ which we use as a framework in which we embed the skeletons. Skeletons are thus not part of the language, but they are implemented in it. In order to enable a straightforward integration 2 S/cd is an acronym for Skeleton Imperative Language. G.H. Botorog, H. Kuchenl Theoretical Computer Science 196 (1998) 71-107 73 of the skeletons, Ski1 contains a series of functional features, mainly higher-order tunctions, currying and polymorphic types. These features are eliminated at compile- time by an instantiation procedure, which generates for each polymorphic higher-order function one or more monomorphic first-order instances. Thus, as our results show, an efficient overall implementation can be achieved. On the other hand, since the transformation is static, a restriction has to be imposed on certain recursively defined higher-order functions. However, this restriction applies only to special cases, which seldom appear in practice. Skil does not support nested parallelism, like NESL [2] or P3L [l] does, because the emphasis was placed here on the efficiency of the implementation and on the integration of the functional and imperative features, rather than on the composition of skeletons. Using Skil, we construct the data structure ‘distributed array’ and define a series of data-parallel skeletons operating on it. We then show how two parallel applications can be programmed in a sequential style by using these skeletons. Run-time measurements show that our results are considerably better than those obtained by using pure functional languages with skeletons, approaching those of direct low-level (C) implementations. The paper is organized as follows. Section 2 presents the language SkiZ with an emphasis on its functional features. Section 3 describes distributed arrays and Section 4 presents a series of skeletons working on this data structure. Section 5 describes the core of the implementation of Skil, namely the instantiation procedure. Section 6 then presents two parallel applications implemented by means of skeletons and gives run- time results, as well as comparisons with other implementations. Finally, Section 7 concludes the paper. 2. The language Ski1 As mentioned in the introduction, Ski1 aims at combining a high programming level with an efficient implementation. We have therefore started with an imperative language, namely C3 and enhanced it by a series of functional constructs. The goal was to allow skeletons, which are polymorphic higher-order functions, to be specified in Ski1 as naturally as in functional languages. As an example, consider the function map, which applies a given argument function to a given list of items, yielding a list of results. In Haskell [ll], this function can be defined as follows: map :: (a 4 b) 4 [a] -+ [b] maPf[l=[l map fx:xs= fx: mapfxs 3 This choice is only a pragmatic one, based on the widespread use of C, for which mature compilers are available (we use a C-compiler as a back-end in our implementation). However, other imperative languages can equally well be used instead. 74 G.H. Botorog, H. Kuchenl Theoretical Computer Science 196 (1998) 71-107 We shall now present the language enhancements that are necessary in order to make such function definitions possible. 2.1. Higher-order functions Ski1 allows the definition of higher-order functions (HOFs), i.e., of functions with functional arguments or result. Although this feature could be simulated in C by passing/returning pointers to functions, this simulation would not allow the use of functional arguments created dynamically by partial applications. Hence, we allow the explicit definition of HOFs. For instance, (a for the time being monomorphic version of) the function map - working on lists of integers - can be defined in Ski1 as follows, provided the standard list functions nil, cons, and empty are available: intlist map (int f (int), intlist 1) { if (empty (I)) return nil () ; else return cow ( f (1+ elem), map ( f, 1 -+ next)) ; } Although functions can be arguments of other functions, they cannot occur in data structures and certain expressions. More exactly, the following restrictions apply: l No element of a data structure may have a functional type. Pointers to functions are nevertheless allowed for compatibility to C. l Expressions with functional type may occur only in function applications and in return statements. In this paper, we concentrate on HOFs with functional arguments, both in the description of the implementation of HOFs and in the examples. HOFs with functional results are not so important from the point of view of algorithmic skeletons, therefore we shall only outline them here.

Efficient High-Level Parallel Programming

CMPS 161 – Algorithm Design and Implementation I Current Course Description

Skepu 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems

Software Requirements Implementation and Management

Map, Reduce and Mapreduce, the Skeleton Way$

Cleanroom Software Engineering Implementation of the Capability Maturity Model (Cmmsm) for Software

Computer Science (CS) Implementation Guidance

Software Design Pattern Using : Algorithmic Skeleton Approach S

5 Programming Language Implementation

Skeleton Programming for Heterogeneous GPU-Based Systems

Theory of Computational Implementation

Implementation Phase During Software Development Life Cycle: Interpreting Tools and Methods

Blossom V: a New Implementation of a Minimum Cost Perfect Matching Algorithm