Openmp: an Industry- Standard API for Shared- Memory Programming

FEATURE ARTICLE OpenMP: An Industry- Standard API for Shared- Memory Programming LEONARDO DAGUM AND RAMESH MENON SILICON GRAPHICS INC. ♦♦♦ OpenMP, the portable alternative to message passing, offers a powerful new way to achieve scalability in software. This article compares OpenMP to existing parallel-programming models. ♦ pplication developers have long recognized ence. These are generally called scalable shared memory that scalable hardware and software are nec- multiprocessor architectures.1 For SSMP systems, the na- essary for parallel scalability in application tive programming model is shared memory, and message performance. Both have existed for some time passing is built on top of the shared-memory model. On inA their lowest common denominator form, and scalable such systems, software scalability is straightforward to hardware—as physically distributed memories connected achieve with a shared-memory programming model. through a scalable interconnection network (as a multi- In a shared-memory system, every processor has di- stage interconnect, k-ary n-cube, or fat tree)—has been rect access to the memory of every other processor, commercially available since the 1980s. When develop- meaning it can directly load or store any shared address. ers build such systems without any provision for cache The programmer also can declare certain pieces of mem- coherence, the systems are essentially “zeroth order” ory as private to the processor, which provides a simple scalable architectures. They provide only a scalable in- yet powerful model for expressing and managing paral- terconnection network, and the burden of scalability falls lelism in an application. on the software. As a result, scalable software for such Despite its simplicity and scalability, many parallel ap- systems exists, at some level, only in a message-passing plications developers have resisted adopting a shared- model. Message passing is the native model for these ar- memory programming model for one reason: portabil- chitectures, and developers can only build higher-level ity. Shared-memory system vendors have created their models on top of it. own proprietary extensions to Fortran or C for paral- Unfortunately, many in the high-performance com- lel-software development. However, the absence of puting world implicitly assume that the only way to portability has forced many developers to adopt a achieve scalability in parallel software is with a message- portable message-passing model such as the Message passing programming model. This is not necessarily true. Passing Interface (MPI) or Parallel Virtual Machine A class of multiprocessor architectures is now emerging (PVM). This article presents a portable alternative to that offers scalable hardware support for cache coher- message passing: OpenMP. 46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING . OpenMP was designed to exploit certain char- Table 1: Comparing standard parallel-programming models. acteristics of shared-memory architectures. The X3H5 MPI Pthreads HPF OpenMP ability to directly access memory throughout the Scalable no yes sometimes yes yes system (with minimum latency and no explicit address mapping), combined with fast shared- Incremental yes no no no yes memory locks, makes shared-memory architec- parallelization tures best suited for supporting OpenMP. Portable yes yes yes yes yes Why a new standard? Fortran binding yes yes no yes yes The closest approximation to a standard shared- memory programming model is the now- High level yes no no yes yes dormant ANSI X3H5 standards effort.2 X3H5 was never formally adopted as a standard largely Supports data yes no no yes yes because interest waned as distributed-memory parallelism message-passing systems (MPPs) came into vogue. However, even though hardware vendors Performance no yes no tries yes support it to varying degrees, X3H5 has limita- oriented tions that make it unsuitable for anything other than loop-level parallelism. Consequently, applications adopting this model are often limited in their parallel scalability. applications, as well as government laboratories, MPI has effectively standardized the message- have a large volume of Fortran 77 code that passing programming model. It is a portable, needs to get parallelized in a portable fashion. widely available, and accepted standard for writ- The rapid and widespread acceptance of shared- ing message-passing programs. Unfortunately, memory multiprocessor architectures—from message passing is generally a difficult way to the desktop to “glass houses”—has created a program. It requires that the program’s data pressing demand for a portable way to program structures be explicitly partitioned, so the entire these systems. Developers need to parallelize ex- application must be parallelized to work with isting code without completely rewriting it, but the partitioned data structures. There is no in- this is not possible with most existing parallel- cremental path to parallelize an application. language standards. Only OpenMP and X3H5 Furthermore, modern multiprocessor architec- allow incremental parallelization of existing tures increasingly provide hardware support for code, of which only OpenMP is scalable (see cache coherence; therefore, message passing is Table 1). OpenMP is targeted at developers who becoming unnecessary and overly restrictive for need to quickly parallelize existing scientific these systems. code, but it remains flexible enough to support a Pthreads is an accepted standard for shared much broader application set. OpenMP pro- memory in low-end systems. However, it is not vides an incremental path for parallel conver- targeted at the technical, HPC space. There is sion of any existing software. It also provides little Fortran support for pthreads, and it is not scalability and performance for a complete re- a scalable approach. Even for C applications, the write or entirely new development. pthreads model is awkward, because it is lower- level than necessary for most scientific applications and is targeted more at providing task par- What is OpenMP? allelism, not data parallelism. Also, portability At its most elemental level, OpenMP is a set of to unsupported platforms requires a stub library compiler directives and callable runtime library or equivalent workaround. routines that extend Fortran (and separately, C Researchers have defined many new languages and C++) to express shared-memory parallelism. for parallel computing, but these have not found It leaves the base language unspecified, and ven- mainstream acceptance. High-Performance For- dors can implement OpenMP in any Fortran tran (HPF) is the most popular multiprocessing compiler. Naturally, to support pointers and al- derivative of Fortran, but it is mostly geared to- locatables, Fortan 90 and Fortran 95 require the ward distributed-memory systems. OpenMP implementation to include additional Independent software developers of scientific semantics over Fortran 77. JANUARY–MARCH 1998 47 . Table 2: Comparing X3H5 directives, OpenMP, and MIPS Pro Doacross functionality. Several vendors have prod- X3H5 OpenMP MIPS Pro ucts—including compilers, de- Overview velopment tools, and perfor- Orphan scope None, lexical Yes, binding Yes, through mance-analysis tools—that are scope only rules specified callable runtime OpenMP aware. Typically, these Query functions None Standard Yes tools understand the semantics Runtime functions None Standard Yes of OpenMP constructs and Environment variables None Standard Yes hence aid the process of writing Nested parallelism Allowed Allowed Serialized programs. The OpenMP Archi- Throughput mode Not defined Yes Yes tecture Review Board includes Conditional compilation None _OPENMP,!$ C$ representatives from Digital, Sentinel C$PAR !$OMP C$ Hewlett-Packard, Intel, IBM, C$PAR& !$OMP& C$& Kuck and Associates, and Sili- con Graphics. Control structure All of these companies are Parallel region Parallel Parallel Doacross actively developing compilers Iterative Pdo Do Doacross and tools for OpenMP. Open Noniterative Psection Section User coded MP products are available to- Single process Psingle Single, User coded day from Silicon Graphics and Master other vendors. In addition, a Early completion Pdone User coded User coded number of independent soft- Sequential Ordering Ordered PDO Ordered None ware vendors plan to use Open- MP in future products. (For Data environment information on individual pro- Autoscope None Default(private) shared default ducts, see www.openmp.org.) Default(shared) Global objects Instance Parallel Threadprivate Linker: -Xlocal A simple example (p + 1 instances) (p instances) (p instances) Reduction attribute None Reduction Reduction Figure 1 presents a simple ex- ␲ Private initialization None Firstprivate None ample of computing using 4 Copyin Copyin OpenMP. This example il- Private persistence None Lastprivate Lastlocal lustrates how to parallelize a simple loop in a shared-mem- Synchronization ory programming model. The Barrier Barrier Barrier mp_barrier code would look similar with Synchronize Synchronize Flush synchronize either the Doacross or the X3- Critical section Critical Section Critical mp_setlock H5 set of directives (except mp_unsetlock that X3H5 does not have a Atomic update None Atomic None reduction attribute, so you Locks None Full mp_setlock would have to code it yourself). functionality mp_unsetlock Program execution begins as a single process. This initial process executes serially, and we can set up our problem in a standard

Load more