.
FEATURE ARTICLE
OpenMP: An Industry- Standard API for Shared- Memory Programming
LEONARDO DAGUM AND RAMESH MENON SILICON GRAPHICS INC. ♦♦♦
OpenMP, the portable alternative to message passing, offers a powerful new way to achieve scalability in software. This article compares OpenMP to existing parallel-programming models. ♦
pplication developers have long recognized ence. These are generally called scalable shared memory that scalable hardware and software are nec- multiprocessor architectures.1 For SSMP systems, the na- essary for parallel scalability in application tive programming model is shared memory, and message performance. Both have existed for some time passing is built on top of the shared-memory model. On inA their lowest common denominator form, and scalable such systems, software scalability is straightforward to hardware—as physically distributed memories connected achieve with a shared-memory programming model. through a scalable interconnection network (as a multi- In a shared-memory system, every processor has di- stage interconnect, k-ary n-cube, or fat tree)—has been rect access to the memory of every other processor, commercially available since the 1980s. When develop- meaning it can directly load or store any shared address. ers build such systems without any provision for cache The programmer also can declare certain pieces of mem- coherence, the systems are essentially “zeroth order” ory as private to the processor, which provides a simple scalable architectures. They provide only a scalable in- yet powerful model for expressing and managing paral- terconnection network, and the burden of scalability falls lelism in an application. on the software. As a result, scalable software for such Despite its simplicity and scalability, many parallel ap- systems exists, at some level, only in a message-passing plications developers have resisted adopting a shared- model. Message passing is the native model for these ar- memory programming model for one reason: portabil- chitectures, and developers can only build higher-level ity. Shared-memory system vendors have created their models on top of it. own proprietary extensions to Fortran or C for paral- Unfortunately, many in the high-performance com- lel-software development. However, the absence of puting world implicitly assume that the only way to portability has forced many developers to adopt a achieve scalability in parallel software is with a message- portable message-passing model such as the Message passing programming model. This is not necessarily true. Passing Interface (MPI) or Parallel Virtual Machine A class of multiprocessor architectures is now emerging (PVM). This article presents a portable alternative to that offers scalable hardware support for cache coher- message passing: OpenMP.
46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING .
OpenMP was designed to exploit certain char- Table 1: Comparing standard parallel-programming models. acteristics of shared-memory architectures. The X3H5 MPI Pthreads HPF OpenMP ability to directly access memory throughout the Scalable no yes sometimes yes yes system (with minimum latency and no explicit address mapping), combined with fast shared- Incremental yes no no no yes memory locks, makes shared-memory architec- parallelization tures best suited for supporting OpenMP.
Portable yes yes yes yes yes Why a new standard? Fortran binding yes yes no yes yes The closest approximation to a standard shared- memory programming model is the now- High level yes no no yes yes dormant ANSI X3H5 standards effort.2 X3H5 was never formally adopted as a standard largely Supports data yes no no yes yes because interest waned as distributed-memory parallelism message-passing systems (MPPs) came into vogue. However, even though hardware vendors Performance no yes no tries yes support it to varying degrees, X3H5 has limita- oriented tions that make it unsuitable for anything other than loop-level parallelism. Consequently, ap- plications adopting this model are often limited in their parallel scalability. applications, as well as government laboratories, MPI has effectively standardized the message- have a large volume of Fortran 77 code that passing programming model. It is a portable, needs to get parallelized in a portable fashion. widely available, and accepted standard for writ- The rapid and widespread acceptance of shared- ing message-passing programs. Unfortunately, memory multiprocessor architectures—from message passing is generally a difficult way to the desktop to “glass houses”—has created a program. It requires that the program’s data pressing demand for a portable way to program structures be explicitly partitioned, so the entire these systems. Developers need to parallelize ex- application must be parallelized to work with isting code without completely rewriting it, but the partitioned data structures. There is no in- this is not possible with most existing parallel- cremental path to parallelize an application. language standards. Only OpenMP and X3H5 Furthermore, modern multiprocessor architec- allow incremental parallelization of existing tures increasingly provide hardware support for code, of which only OpenMP is scalable (see cache coherence; therefore, message passing is Table 1). OpenMP is targeted at developers who becoming unnecessary and overly restrictive for need to quickly parallelize existing scientific these systems. code, but it remains flexible enough to support a Pthreads is an accepted standard for shared much broader application set. OpenMP pro- memory in low-end systems. However, it is not vides an incremental path for parallel conver- targeted at the technical, HPC space. There is sion of any existing software. It also provides little Fortran support for pthreads, and it is not scalability and performance for a complete re- a scalable approach. Even for C applications, the write or entirely new development. pthreads model is awkward, because it is lower- level than necessary for most scientific applica- tions and is targeted more at providing task par- What is OpenMP? allelism, not data parallelism. Also, portability At its most elemental level, OpenMP is a set of to unsupported platforms requires a stub library compiler directives and callable runtime library or equivalent workaround. routines that extend Fortran (and separately, C Researchers have defined many new languages and C++) to express shared-memory parallelism. for parallel computing, but these have not found It leaves the base language unspecified, and ven- mainstream acceptance. High-Performance For- dors can implement OpenMP in any Fortran tran (HPF) is the most popular multiprocessing compiler. Naturally, to support pointers and al- derivative of Fortran, but it is mostly geared to- locatables, Fortan 90 and Fortran 95 require the ward distributed-memory systems. OpenMP implementation to include additional Independent software developers of scientific semantics over Fortran 77.
JANUARY–MARCH 1998 47 .
Table 2: Comparing X3H5 directives, OpenMP, and MIPS Pro Doacross functionality. Several vendors have prod- X3H5 OpenMP MIPS Pro ucts—including compilers, de- Overview velopment tools, and perfor- Orphan scope None, lexical Yes, binding Yes, through mance-analysis tools—that are scope only rules specified callable runtime OpenMP aware. Typically, these Query functions None Standard Yes tools understand the semantics Runtime functions None Standard Yes of OpenMP constructs and Environment variables None Standard Yes hence aid the process of writing Nested parallelism Allowed Allowed Serialized programs. The OpenMP Archi- Throughput mode Not defined Yes Yes tecture Review Board includes Conditional compilation None _OPENMP,!$ C$ representatives from Digital, Sentinel C$PAR !$OMP C$ Hewlett-Packard, Intel, IBM, C$PAR& !$OMP& C$& Kuck and Associates, and Sili- con Graphics. Control structure All of these companies are Parallel region Parallel Parallel Doacross actively developing compilers Iterative Pdo Do Doacross and tools for OpenMP. Open Noniterative Psection Section User coded MP products are available to- Single process Psingle Single, User coded day from Silicon Graphics and Master other vendors. In addition, a Early completion Pdone User coded User coded number of independent soft- Sequential Ordering Ordered PDO Ordered None ware vendors plan to use Open- MP in future products. (For Data environment information on individual pro- Autoscope None Default(private) shared default ducts, see www.openmp.org.) Default(shared) Global objects Instance Parallel Threadprivate Linker: -Xlocal A simple example (p + 1 instances) (p instances) (p instances) Reduction attribute None Reduction Reduction Figure 1 presents a simple ex- Private initialization None Firstprivate None ample of computing using 4 Copyin Copyin OpenMP. This example il- Private persistence None Lastprivate Lastlocal lustrates how to parallelize a simple loop in a shared-mem- Synchronization ory programming model. The Barrier Barrier Barrier mp_barrier code would look similar with Synchronize Synchronize Flush synchronize either the Doacross or the X3- Critical section Critical Section Critical mp_setlock H5 set of directives (except mp_unsetlock that X3H5 does not have a Atomic update None Atomic None reduction attribute, so you Locks None Full mp_setlock would have to code it yourself). functionality mp_unsetlock Program execution begins as a single process. This initial process executes serially, and we can set up our problem in a standard sequential manner, OpenMP leverages many of the X3H5 con- reading and writing stdout as necessary. cepts while extending them to support coarse- When we first encounter a parallel con- grain parallelism. Table 2 compares OpenMP struct, in this case a parallel do, the runtime with the directive bindings specified by X3H5 forms a team of one or more processes and cre- and the MIPS Pro Doacross model,3 and it sum- ates the data environment for each team mem- marizes the language extensions into one of ber. The data environment consists of one pri- three categories: control structure, data envi- vate variable, x, one reduction variable, ronment, or synchronization. The standard also sum, and one shared variable, w. All references includes a callable runtime library with accom- to x and sum inside the parallel region address panying environment variables. private, nonshared copies. The reduction at-
48 IEEE COMPUTATIONAL SCIENCE & ENGINEERING .
program compute_pi it becomes fairly straightforward to parallelize integer n, i individual loops incrementally and thereby im- double precision w, x, sum, pi, f, a mediately realize the performance advantages of c function to integrate a multiprocessor system. f(a) = 4.d0 / (1.d0 + a*a) For comparison with message passing, Figure print *, ‘Enter number of intervals: ‘ 2 presents the same example using MPI. Clearly, read *,n there is additional complexity just in setting up c calculate the interval size the problem, because we must begin with a team w = 1.0d0/n of parallel processes. Consequently, we need to sum = 0.0d0 isolate a root process to read and write stdout. !$OMP PARALLEL DO PRIVATE(x), SHARED(w) Because there is no globally shared data, we !$OMP& REDUCTION(+: sum) must explicitly broadcast the input parameters do i = 1, n (in this case, the number of intervals for the in- x = w * (i - 0.5d0) tegration) to all the processors. Furthermore, we sum = sum + f(x) must explicitly manage the loop bounds. This re- enddo quires identifying each processor (myid) and pi = w * sum knowing how many processors will be used to ex- print *, ‘computed pi = ‘, pi stop end program compute_pi include ‘mpif.h’ Figure 1. Computing in parallel using OpenMP. double precision mypi, pi, w, sum, x, f, a integer n, myid, numprocs, i, rc c function to integrate f(a) = 4.d0 / (1.d0 + a*a) tribute takes an operator, such that at the end of the parallel region it reduces the private copies call MPI_INIT( ierr ) to the master copy using the specified operator. call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) All references to w in the parallel region address call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) the single master copy. The loop index variable, i, is private by default. The compiler takes if ( myid .eq. 0 ) then care of assigning the appropriate iterations to print *, ‘Enter number of intervals:‘ the individual team members, so in parallelizing read *, n this loop the user need not even on know how endif many processors it runs. call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) There might be additional control and syn- c calculate the interval size chronization constructs within the parallel re- w = 1.0d0/n gion, but not in this example. The parallel re- sum = 0.0d0 gion terminates with the end do, which has an do i = myid+1, n, numprocs implied barrier. On exit from the parallel region, x = w * (i - 0.5d0) the initial process resumes execution using its sum = sum + f(x) updated data environment. In this case, the only enddo change to the master’s data environment is the mypi = w * sum reduced value of sum. c collect all the partial sums This model of execution is referred to as the call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, fork/join model. Throughout the course of a pro- $ MPI_COMM_WORLD,ierr) gram, the initial process can fork and join many c node 0 prints the answer. times. The fork/join execution model makes it if (myid .eq. 0) then easy to get loop-level parallelism out of a se- print *, ‘computed pi = ‘, pi quential program. Unlike in message passing, endif where the program must be completely decom- call MPI_FINALIZE(rc) posed for parallel execution, the shared-mem- stop ory model makes it possible to parallelize just at end the loop level without decomposing the data structures. Given a working sequential program, Figure 2. Computing in parallel using MPI.
JANUARY–MARCH 1998 49 .
#include