.

FEATURE ARTICLE

OpenMP: An Industry- Standard API for Shared- Memory Programming

LEONARDO DAGUM AND RAMESH MENON SILICON GRAPHICS INC. ♦♦♦

OpenMP, the portable alternative to message passing, offers a powerful new way to achieve in . This article compares OpenMP to existing parallel-programming models. ♦

pplication developers have long recognized ence. These are generally called scalable that scalable hardware and software are nec- multiprocessor architectures.1 For SSMP systems, the na- essary for parallel scalability in application tive programming model is shared memory, and message performance. Both have existed for some time passing is built on top of the shared-memory model. On inA their lowest common denominator form, and scalable such systems, software scalability is straightforward to hardware—as physically distributed memories connected achieve with a shared-memory programming model. through a scalable interconnection network (as a multi- In a shared-memory system, every processor has di- stage interconnect, k-ary n-cube, or fat tree)—has been rect access to the memory of every other processor, commercially available since the 1980s. When develop- meaning it can directly load or store any shared address. ers build such systems without any provision for cache The also can declare certain pieces of mem- coherence, the systems are essentially “zeroth order” ory as private to the processor, which provides a simple scalable architectures. They provide only a scalable in- yet powerful model for expressing and managing paral- terconnection network, and the burden of scalability falls lelism in an application. on the software. As a result, scalable software for such Despite its simplicity and scalability, many parallel ap- systems exists, at some level, only in a message-passing plications developers have resisted adopting a shared- model. Message passing is the native model for these ar- memory programming model for one reason: portabil- chitectures, and developers can only build higher-level ity. Shared-memory system vendors have created their models on top of it. own proprietary extensions to or for paral- Unfortunately, many in the high-performance com- lel-software development. However, the absence of puting world implicitly assume that the only way to portability has forced many developers to adopt a achieve scalability in parallel software is with a message- portable message-passing model such as the Message passing programming model. This is not necessarily true. Passing Interface (MPI) or Parallel A class of multiprocessor architectures is now emerging (PVM). This article presents a portable alternative to that offers scalable hardware support for cache coher- message passing: OpenMP.

46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING .

OpenMP was designed to exploit certain char- Table 1: Comparing standard parallel-programming models. acteristics of shared-memory architectures. The X3H5 MPI Pthreads HPF OpenMP ability to directly access memory throughout the Scalable no yes sometimes yes yes system (with minimum latency and no explicit address mapping), combined with fast shared- Incremental yes no no no yes memory locks, makes shared-memory architec- parallelization tures best suited for supporting OpenMP.

Portable yes yes yes yes yes Why a new standard? Fortran binding yes yes no yes yes The closest approximation to a standard shared- memory programming model is the now- High level yes no no yes yes dormant ANSI X3H5 standards effort.2 X3H5 was never formally adopted as a standard largely Supports data yes no no yes yes because interest waned as distributed-memory parallelism message-passing systems (MPPs) came into vogue. However, even though hardware vendors Performance no yes no tries yes support it to varying degrees, X3H5 has limita- oriented tions that make it unsuitable for anything other than loop-level parallelism. Consequently, ap- plications adopting this model are often limited in their parallel scalability. applications, as well as government laboratories, MPI has effectively standardized the message- have a large volume of Fortran 77 code that passing programming model. It is a portable, needs to get parallelized in a portable fashion. widely available, and accepted standard for writ- The rapid and widespread acceptance of shared- ing message-passing programs. Unfortunately, memory multiprocessor architectures—from message passing is generally a difficult way to the desktop to “glass houses”—has created a program. It requires that the program’s data pressing demand for a portable way to program structures be explicitly partitioned, so the entire these systems. Developers need to parallelize ex- application must be parallelized to work with isting code without completely rewriting it, but the partitioned data structures. There is no in- this is not possible with most existing parallel- cremental path to parallelize an application. language standards. Only OpenMP and X3H5 Furthermore, modern multiprocessor architec- allow incremental parallelization of existing tures increasingly provide hardware support for code, of which only OpenMP is scalable (see ; therefore, message passing is Table 1). OpenMP is targeted at developers who becoming unnecessary and overly restrictive for need to quickly parallelize existing scientific these systems. code, but it remains flexible enough to support a Pthreads is an accepted standard for shared much broader application set. OpenMP pro- memory in low-end systems. However, it is not vides an incremental path for parallel conver- targeted at the technical, HPC space. There is sion of any existing software. It also provides little Fortran support for pthreads, and it is not scalability and performance for a complete re- a scalable approach. Even for C applications, the write or entirely new development. pthreads model is awkward, because it is lower- level than necessary for most scientific applica- tions and is targeted more at providing task par- What is OpenMP? allelism, not . Also, portability At its most elemental level, OpenMP is a set of to unsupported platforms requires a stub library directives and callable runtime library or equivalent workaround. routines that extend Fortran (and separately, C Researchers have defined many new languages and C++) to express shared-memory parallelism. for , but these have not found It leaves the base language unspecified, and ven- mainstream acceptance. High-Performance For- dors can implement OpenMP in any Fortran tran (HPF) is the most popular compiler. Naturally, to support pointers and al- derivative of Fortran, but it is mostly geared to- locatables, Fortan 90 and Fortran 95 require the ward distributed-memory systems. OpenMP implementation to include additional Independent software developers of scientific semantics over Fortran 77.

JANUARY–MARCH 1998 47 .

Table 2: Comparing X3H5 directives, OpenMP, and MIPS Pro Doacross functionality. Several vendors have prod- X3H5 OpenMP MIPS Pro ucts—including , de- Overview velopment tools, and perfor- Orphan scope None, lexical Yes, binding Yes, through mance-analysis tools—that are scope only rules specified callable runtime OpenMP aware. Typically, these Query functions None Standard Yes tools understand the semantics Runtime functions None Standard Yes of OpenMP constructs and Environment variables None Standard Yes hence aid the of writing Nested parallelism Allowed Allowed Serialized programs. The OpenMP Archi- Throughput mode Not defined Yes Yes tecture Review Board includes Conditional compilation None _OPENMP,!$ C$ representatives from Digital, Sentinel C$PAR !$OMP C$ Hewlett-Packard, Intel, IBM, C$PAR& !$OMP& C$& Kuck and Associates, and Sili- con Graphics. Control structure All of these companies are Parallel region Parallel Parallel Doacross actively developing compilers Iterative Pdo Do Doacross and tools for OpenMP. Open Noniterative Psection Section User coded MP products are available to- Single process Psingle Single, User coded day from Silicon Graphics and Master other vendors. In addition, a Early completion Pdone User coded User coded number of independent soft- Sequential Ordering Ordered PDO Ordered None ware vendors plan to use Open- MP in future products. (For Data environment information on individual pro- Autoscope None Default(private) shared default ducts, see www..org.) Default(shared) Global objects Instance Parallel Threadprivate Linker: -Xlocal A simple example (p + 1 instances) (p instances) (p instances) Reduction attribute None Reduction Reduction Figure 1 presents a simple ex- Private initialization None Firstprivate None ample of computing using 4 Copyin Copyin OpenMP. This example il- Private persistence None Lastprivate Lastlocal lustrates how to parallelize a simple loop in a shared-mem- Synchronization ory programming model. The Barrier Barrier Barrier mp_barrier code would look similar with Synchronize Synchronize Flush synchronize either the Doacross or the X3- Critical section Critical Section Critical mp_setlock H5 set of directives (except mp_unsetlock that X3H5 does not have a Atomic update None Atomic None reduction attribute, so you Locks None Full mp_setlock would have to code it yourself). functionality mp_unsetlock Program execution begins as a single process. This initial process executes serially, and we can set up our problem in a standard sequential manner, OpenMP leverages many of the X3H5 con- reading and writing stdout as necessary. cepts while extending them to support coarse- When we first encounter a parallel con- grain parallelism. Table 2 compares OpenMP struct, in this case a parallel do, the runtime with the directive bindings specified by X3H5 forms a team of one or more processes and cre- and the MIPS Pro Doacross model,3 and it sum- ates the data environment for each team mem- marizes the language extensions into one of ber. The data environment consists of one pri- three categories: control structure, data envi- vate variable, x, one reduction variable, ronment, or synchronization. The standard also sum, and one shared variable, w. All references includes a callable runtime library with accom- to x and sum inside the parallel region address panying environment variables. private, nonshared copies. The reduction at-

48 IEEE COMPUTATIONAL SCIENCE & ENGINEERING .

program compute_pi it becomes fairly straightforward to parallelize integer n, i individual loops incrementally and thereby im- double precision w, x, sum, pi, f, a mediately realize the performance advantages of c function to integrate a multiprocessor system. f(a) = 4.d0 / (1.d0 + a*a) For comparison with message passing, Figure print *, ‘Enter number of intervals: ‘ 2 presents the same example using MPI. Clearly, read *,n there is additional complexity just in setting up c calculate the interval size the problem, because we must begin with a team w = 1.0d0/n of parallel processes. Consequently, we need to sum = 0.0d0 isolate a root process to read and write stdout. !$OMP PARALLEL DO PRIVATE(x), SHARED(w) Because there is no globally shared data, we !$OMP& REDUCTION(+: sum) must explicitly broadcast the input parameters do i = 1, n (in this case, the number of intervals for the in- x = w * (i - 0.5d0) tegration) to all the processors. Furthermore, we sum = sum + f(x) must explicitly manage the loop bounds. This re- enddo quires identifying each processor (myid) and pi = w * sum knowing how many processors will be used to ex- print *, ‘computed pi = ‘, pi stop end program compute_pi include ‘mpif.h’ Figure 1. Computing in parallel using OpenMP. double precision mypi, pi, w, sum, x, f, a integer n, myid, numprocs, i, rc c function to integrate f(a) = 4.d0 / (1.d0 + a*a) tribute takes an operator, such that at the end of the parallel region it reduces the private copies call MPI_INIT( ierr ) to the master copy using the specified operator. call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) All references to w in the parallel region address call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) the single master copy. The loop index variable, i, is private by default. The compiler takes if ( myid .eq. 0 ) then care of assigning the appropriate iterations to print *, ‘Enter number of intervals:‘ the individual team members, so in parallelizing read *, n this loop the user need not even on know how endif many processors it runs. call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) There might be additional control and syn- c calculate the interval size chronization constructs within the parallel re- w = 1.0d0/n gion, but not in this example. The parallel re- sum = 0.0d0 gion terminates with the end do, which has an do i = myid+1, n, numprocs implied barrier. On exit from the parallel region, x = w * (i - 0.5d0) the initial process resumes execution using its sum = sum + f(x) updated data environment. In this case, the only enddo change to the master’s data environment is the mypi = w * sum reduced value of sum. c collect all the partial sums This model of execution is referred to as the call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, fork/join model. Throughout the course of a pro- $ MPI_COMM_WORLD,ierr) gram, the initial process can fork and join many c node 0 prints the answer. times. The fork/join execution model makes it if (myid .eq. 0) then easy to get loop-level parallelism out of a se- print *, ‘computed pi = ‘, pi quential program. Unlike in message passing, endif where the program must be completely decom- call MPI_FINALIZE(rc) posed for parallel execution, the shared-mem- stop ory model makes it possible to parallelize just at end the loop level without decomposing the data structures. Given a working sequential program, Figure 2. Computing in parallel using MPI.

JANUARY–MARCH 1998 49 .

#include ecute the loop (numprocs). #include When we finally get to the pthread_mutex_t reduction_mutex; loop, we can only sum into our pthread_t *tid; private value for mypi. To re- int n, num_threads; double pi, w; duce across processors we use double f(a) the MPI_Reduce routine and double a; sum into pi. The storage for { pi is replicated across all pro- return (4.0 / (1.0 + a*a)); cessors, even though only the } root process needs it. As a gen- eral rule, message-passing pro- void *PIworker(void *arg) grams waste more storage than { shared-memory programs.5 int i, myid; double sum, mypi, x; Finally, we can print the result, /* set individual id to start at 0 */ again making sure to isolate myid = pthread_self()-tid[0]; just one process for this step to /* integrate function */ avoid printing numprocs sum = 0.0; messages. for (i=myid+1; i<=n; i+=num_threads) { It is also interesting to see x = w*((double)i - 0.5); how this example looks using sum += f(x); pthreads (see Figure 3). Natu- } rally, it’s written in C, but we mypi = w*sum; /* reduce value */ can still compare functionality pthread_mutex_lock(&reduction_mutex); with the Fortran examples giv- pi += mypi; en in Figures 1 and 2. pthread_mutex_unlock(&reduction_mutex); The pthreads version is more return(0); complex than either the Open- } MP or the MPI versions: void main(argc,argv) int argc; • First, pthreads is aimed at char *argv[]; providing , { int i; whereas the example is one /* check command line */ of data parallelism—paral- if (argc != 3) { lelizing a loop. The exam- printf(“Usage: %s Num-intervals Num-threads\n”, argv[0]); ple shows why pthreads exit(0); has not been widely used } for scientific applications. /* get num intervals and num threads from command line */ • Second, pthreads is some- n = atoi(argv[1]); what lower-level than we num_threads = atoi(argv[2]); need, even in a task- or w = 1.0 / (double) n; pi = 0.0; threads-based model. This tid = (pthread_t *) calloc(num_threads, sizeof(pthread_t)); becomes clearer as we go /* initialize lock */ through the example. if (pthread_mutex_init(&reduction_mutex, NULL)) fprintf(stderr, “Cannot init lock\n”), exit(1); As with the MPI version, we /* create the threads */ need to know how many threads for (i=0; i

50 IEEE COMPUTATIONAL SCIENCE & ENGINEERING .

need for reducing our partial sums into a global Table 3: Time (in seconds) to compute π using 109 intervals sum for . Our basic approach is to start a with three standard parallel-programming models. worker thread, PIworker, for every processor CPUs OpenMP MPI Pthreads we want to work on the loop. In PIworker, we first compute a zero-based thread ID and use 1 107.7 121.4 115.4 this to map the loop iterations. The loop then 2 53.9 60.7 62.5 computes the partial sums into mypi. We add 4 27.0 30.3 32.4 pi these into the global result , making sure to 6 17.9 20.4 22.0 protect against a by locking. Fi- 8 13.5 15.2 16.7 nally, we need to explicitly join all our threads before we can print out the result of the inte- gration. All the data scoping is implicit; that is, global variables are shared and automatic variables are sist of only a few parallel regions, with most of private. There is no simple mechanism in p- the execution taking place within those regions. threads for making global variables private. Also, In implementing a coarse-grained parallel al- implicit scoping is more awkward in Fortran be- gorithm, it becomes desirable, and often neces- cause the language is not as strongly scoped as sary, to be able to specify control or synchro- C. nization from anywhere inside the parallel In terms of performance, all three models are region, not just from the lexically contained por- comparable for this simple example. Table 3 tion. OpenMP provides this functionality by presents the elapsed time in seconds for each specifying binding rules for all directives and al- program when run on a Silicon Graphics Ori- lowing them to be encountered dynamically in gin2000 server, using 109 intervals for each in- the call chain originating from the parallel re- tegration. All three models are exhibiting excel- gion. In contrast, X3H5 does not allow direc- lent scalability on a per node basis (there are two tives to be orphaned, so all the control and syn- CPUs per node in the Origin2000), as expected chronization for the program must be lexically for this embarrassingly . visible in the parallel construct. This limitation restricts the programmer and makes any non- trivial coarse-grained parallel application virtu- Scalability ally impossible to write. Although simple and effective, loop-level par- allelism is usually limited in its scalability, be- cause it leaves some constant fraction of se- A coarse-grain example quential work in the program that by Amdahl’s To highlight additional features in the standard, law can quickly overtake the gains from parallel Figure 4 presents a slightly more complicated execution. It is important, however, to distin- example, computing the energy spectrum for a guish between the type of parallelism (for ex- field. This is essentially a histogramming prob- ample, loop-level versus coarse-grained) and lem with a slight twist—it also generates the se- the programming model. The type of paral- quence in parallel. We could easily parallelize lelism exposed in a program depends on the al- the histogramming loop and the sequence gen- gorithm and data structures employed and not eration as in the previous example, but in the in- on the programming model (to the extent that terest of performance we would like to his- those algorithms and data structures can be rea- togram as we compute in order to preserve sonably expressed within a given model). locality. Therefore, given a parallel algorithm and a scal- The program goes immediately into a parallel able shared-memory architecture, a shared- region with a parallel directive, declaring memory implementation scales as well as a mes- the variables field and ispectrum as sage-passing implementation. shared, and making everything else private OpenMP introduces the powerful concept of with a default clause. The default clause orphan directives that simplify the task of imple- does not affect common blocks, so setup re- menting coarse-grain parallel algorithms. Or- mains a shared data structure. phan directives are directives encountered out- Within the parallel region, we call initial- side the lexical extent of the parallel region. ize_field() to initialize the field and is- Coarse-grain parallel algorithms typically con- pectrum arrays. Here we have an example of

JANUARY–MARCH 1998 51 .

parameter(N = 512, NZ = 16) implicit barrier. Finally, we use the single di- common /setup/ npoints, nzone rective when we initialize a single internal field dimension field(N), ispectrum(NZ) point. The end single directive also can take data npoints, nzone / N, NZ / a nowait clause, but to guarantee correctness !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(field, ispectrum) we need to synchronize here. call initialize_field(field, ispectrum) The field gets computed in compute_field. call compute_field(field, ispectrum) This could be any parallel Laplacian solver, but call compute_spectrum(field, ispectrum) in the interest of brevity we don’t include it here. !$OMP END PARALLEL With the field computed, we are ready to com- call display(ispectrum) pute the spectrum, so we histogram the field val- stop ues using the atomic directive to eliminate race end conditions in the updates to ispectrum. The end do here has a nowait because the parallel subroutine initialize_field(field, ispectrum) region ends after compute spectrum() and common /setup/ npoints, nzone there is an implied barrier when the threads join. dimension field(npoints), ispectrum(nzone) !$OMP DO do i=1, nzone OpenMP design objective ispectrum(i) = 0.0 OpenMP was designed to be a flexible standard, enddo easily implemented across different platforms. !$OMP END DO NOWAIT As we discussed, the standard compriss four dis- !$OMP DO tinct parts: do i=1, npoints field(i) = 0.0 • control structure, enddo • the data environment, !$OMP END DO NOWAIT • synchronization, and !$OMP SINGLE • the runtime library. field(npoints/4) = 1.0 !$OMP END SINGLE Control structure return OpenMP strives for a minimalist set of con- end trol structures. Experience has indicated that only a few control structures are necessary for subroutine compute_spectrum(field, ispectrum) writing most parallel applications. For example, common /setup/ npoints, nzone in the Doacross model, the only control struc- dimension field(npoints), ispectrum(nzone) ture is the doacross directive, yet this is ar- !$OMP DO guably the most widely used shared-memory do i= 1, npoints programming model for scientific computing. index = field(i)*nzone + 1 Many of the control structures provided by !$OMP ATOMIC X3H5 can be trivially programmed in OpenMP ispectrum(index) = ispectrum(index) + i with no performance penalty. OpenMP includes enddo control structures only in those instances where !$OMP END DO NOWAIT a compiler can provide both functionality and return performance over what a user could reasonably end program. Our examples used only three control struc- Figure 4. A coarse-grained example. tures: parallel, do, and single. Clearly, the compiler adds functionality in parallel and do directives. For single, the compiler adds orphaning the do directive. With the X3H5 di- performance by allowing the first thread reach- rectives, we would have to move these loops into ing the single directive to execute the code. the main program so that they could be lexically This is nontrivial for a user to program. visible within the parallel directive. Clearly, that restriction makes it difficult to write good Data environment modular parallel programs. We use the nowait Associated with each process is a unique data clause on the end do directives to eliminate the environment providing a context for execution.

52 IEEE COMPUTATIONAL SCIENCE & ENGINEERING . ConcurrencyIEEE ComingComing inin 19981998

Engineering of Complex Distributed Systems track IEEE Presenting requirements for complex distributed systems, recent research results, and technological developments apt to be transferred into mature applications and ConcurrencyPARALLEL, DISTRIBUTED & MOBILE COMPUTING / OCTOBER–DECEMBER 1997 products. Actors & Agents Representing a cross sectin of current work involving actors and agents—autonomy, identity, interaction, communication, coordination, mobility, persistence, protocols,

Like Comparing distribution, and parallelism. Alligators to Armadillos Object-Oriented Systems track Showcasing traditional and innovative uses of object-oriented languages, systems, and technologies. Also, regular columns on mobile computing, distributed multimedia applications, distributed databases, and high-performance computing trends from around the world. Distributed databases: rethinking integrity Parallel computing trends in European industry Application-centric parallel multimedia software IEEE Concurrency chronicles the latest advances in high-performance computing, Also Distributed programming tools IEEE SOCIETY ¨ ¨ distributed systems, parallel processing, mobile computing, embedded systems, multimedia applications, and the Internet. Check us out at http://computer.org/concurrency

The initial process at program start-up has an X3H5, but experience has indicated a real need initial data environment that exists for the du- for them. ration of the program. It contructs new data Global objects can be made private with environments only for new processes created the threadprivate directive. In the interest during program execution. The objects consti- of performance, OpenMP implements a “p- tuting a data environment might have one of copy” model for privatizing global objects: three basic attributes: shared, private, or threadprivate will create p copies of the reduction. global object, one for each of the p members in The concept of reduction as an attribute is the team executing the parallel region. Often, generalized in OpenMP. It allows the compiler however, it is desirable either from memory to efficiently implement reduction opera- constraints or for algorithmic reasons to priva- tions. This is especially important on cache- tize only certain elements because of a com- based systems where the compiler can eliminate pound global object. OpenMP allows individ- any false sharing. On large-scale SSMP archi- ual elements of a compound global object to tectures, the compiler also might choose to im- appear in a private list. plement tree-based reductions for even better performance. Synchronization OpenMP has a rich data environment. In ad- There are two types of synchronization: im- dition to the reduction attribute, it allows plicit and explicit. Implicit synchronization private initialization with firstprivate points exist at the beginning and end of parallel and copyin, and private persistence with constructs and at the end of control constructs lastprivate. None of these features exist in (for example, do and single). In the case of

JANUARY–MARCH 1998 53 .

do sections, and single, the implicit syn- sentinel. This allows calls to the runtime library chronization can be removed with the nowait to be protected as compiler directives, so Open- clause. MP code can be compiled on non-OpenMP sys- The user specifies explicit synchronization to tems without linking in a stub library or using manage order or data dependencies. Synchro- some other awkward workaround. nization is a form of interprocess communication OpenMP provides standard environment and, as such, can greatly affect program perfor- variables to accompany the runtime library mance. In general, minimizing a program’s syn- functions where it makes sense and to simplify chronization requirements (explicit and implicit) the start-up scripts for portable applications. achieves the best performance. For this reason, This helps application developers who, in addi- OpenMP provides a rich set of synchronization tion to creating portable applications, need a features so developers can best tune the synchro- portable runtime environment. nization in an application. We saw an example using the Atomic direc- tive. This directive allows the compiler to take advantage of available hardware for implement- penMP is supported by a number of ing atomic updates to a variable. OpenMP also hardware and software vendors, and provides a Flush directive for creating more we expect support to grow. OpenMP complex synchronization constructs such as has been designed to be extensible point-to-point synchronization. For ultimate Oand evolve with user requirements. The performance, point-to-point synchronization OpenMP Architecture Review Board was created can eliminate the implicit barriers in the energy- to provide long-term support and enhancements spectrum example. All the OpenMP synchro- of the OpenMP specifications. The OARB char- nization directives can be orphaned. As discussed ter includes interpreting OpenMP specifications, earlier, this is critically important for imple- developing future OpenMP standards, address- menting coarse-grained parallel algorithms. ing issues of validation of OpenMP implementa- tions, and promoting OpenMP as a de facto stan- Runtime library and environment variables dard. In addition to the directive set described, Possible extensions for Fortran include greater OpenMP provides a callable runtime library support for nested parallelism and support for and accompanying environment variables. The shaped arrays. Nested parallelism is the ability runtime library includes query and lock func- to create a new team of processes from within an tions. The runtime functions allow an applica- existing team. It can be useful in problems tion to specify the mode in which it should run. exhibiting both task and data parallelism. For ex- An application developer might wish to maxi- ample, a natural application for nested paral- mize the system’s throughput performance, lelism would be parallelizing a task queue where- rather than time to completion. In such cases, in the tasks involve large matrix multiplies. the developer can tell the system to dynamically Shaped arrays refers to the ability to explicitly set the number of processes used to execute assign the storage for arrays to specific memory parallel regions. This can have a dramatic effect nodes. This ability is useful for improving per- on the system’s throughput performance with formance on Non-Uniform Memory architec- only a minimal impact on the program’s time to tures (NUMAs) by reducing the number of completion. non-local memory references made by a proces- The runtime functions also allow a developer sor. to specify when to enable nested parallelism, The OARB is currently developing the spec- which allows the system to act accordingly ification of C and C++ bindings and is also de- when it encounters a nested parallel construct. veloping validation suites for tesing OpenMP On the other hand, by disabling it, a developer implementations. ♦ can write a parallel library that will perform in an easily predictable fashion whether encoun- tered dynamically from within or outside a par- allel region. References OpenMP also provides a conditional compi- 1. D.E. Lenoski and W.D. Weber, Scalable Shared- lation facility both through the C language pre- Memory Multiprocessing, Morgan Kaufmann, San processor (CPP) and with a Fortran comment Francisco, 1995.

54 IEEE COMPUTATIONAL SCIENCE & ENGINEERING .

IEEE Coming Next Issue

& their applications Feature Transformation and Subset Selection As computer and database technologies have advanced, humans are relying more heavily on to accumulate, process, and make use of data. Machine learning, knowledge discovery, and data mining are some of the AI tools that help us accomplish those tasks. To use those tools effectively, however, data must be preprocessed before it can be presented to any learning, dis- covering, or visualizing algorithm. As this issue will show, feature transformation and subset selec- tion are two vital data-preprocessing tools for making effective use of data. Also Coming in 1998

AI IN HEALTH CARE • Self-Adaptive Software INTELLIGENT AGENTS SELF-A • Autonomous Space Systems DAPTIVE S D OFTWARE ATA-MINING • Knowlege Representation: Ontologies INTELLIGENT TOOLS AUTONOMOUS VEHICLES • Intelligent Agents: The Crossroads between AI and Information Technology SPACE SYSTEMS ...AND MORE • Intelligent Vehicles ! • Intelligent Information Retrieval

IEEE IEEE Intelligent Systems (formerly IEEE Expert) covers the full range of intelligent system develop- ments for the AI practitioner, researcher, educator, and user. IEEE Intelligent Systems

2. B. Leasure, ed., Parallel Processing Model for High- parallel systems. He is the author of over 30 refereed Level Programming Languages, proposed draft, publications relating to these subjects. He received his American National Standard for Information Pro- MS and PhD in aeronautics and astronautics from cessing Systems, Apr. 5, 1994. Stanford. Contact him at M/S 580, 2011 N. Shoreline 3. MIPSpro Fortran77 Programmer’s Guide, Silicon Blvd., Mountain View, CA 94043-1389; dagum@ Graphics, Mountain View, Calif., 1996; http:// sgi.com. techpubs.sgi.com/library/dynaweb_bin/0640/bi/ nph-dynaweb.cgi/dynaweb/SGI_Developer/ MproF77_PG/. 4. S. Ragsdale, ed., Parallel Programming Primer, Intel Scientific Computers, Santa Clara, Calif., March Ramesh Menon is Silicon Graphics’ representative to 1990. the OpenMP Architicture Review Board and served as 5. J. Brown, T. Elken, and J. Taft, Silicon Graphics the board’s first chairman. He managed the writing of Technical Servers in the High Throughput Environ- the OpenMP Fortran API. His research interests include ment, Silicon Graphics Inc., 1995; http://www. parallel-programming models, performance charac- sgi.com/tech/challenge.html. terization, and computational mechanics. He received an MS in mechanical engineering from Duke Univer- sity and a PhD in aerospace engineering from Texas A&M. He was awarded a National Science Foundation Leonardo Dagum works for Silicon Graphics in the Fellowship and was a principal contributor to the NSF System Performance group, where he helped define Grand Challenge Coupled Fields project at the Uni- the OpenMP Fortran API. His research interests include veristy of Colorado, Boulder. Contact him at parallel algorithms and performance modelling for [email protected].

JANUARY–MARCH 1998 55