Advances, Applications and Performance of The

1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer. This management is processor. The goal of GA is to free the programmer from achieved by calls to functions that transfer data between a the low level management of communication and allow global address space (a distributed array) and local stor- them to deal with their problems at the level at which they age (Figure 1). In this respect, the GA model has similarities were originally formulated. At the same time, compatibility to the distributed shared-memory models that provide an of GA with MPI enables the programmer to take advan- tage of the existing MPI software/libraries when available explicit acquire/release protocol, e.g. Zhou, Iftode, and Li and appropriate. The variety of applications that have (1996). However, the GA model acknowledges that remote been implemented using Global Arrays attests to the data is slower to access than local data and allows data attractiveness of using higher level abstractions to write locality to be specified by the programmer and hence parallel code. managed. GA is related to the global address space languages such as UPC (Carlson et al. 1999), Titanium (Yelick Key words: Global Arrays, Global Address Space, Shared et al. 1998), and, to a lesser extent, Co-Array Fortran1 (Num- Memory, Data Distribution, Parallel Programming, Data rich and Reid 1998). In addition, by providing a set of Abstraction data-parallel operations, GA is also related to data-parallel languages such as HPF (High Performance Fortran Forum 1993), ZPL (Snyder 1999), and Data Parallel C (Hatcher and Quinn 1991). However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving parallel efficiency. It also supports a combination of task- and data- parallelism and is available as an extension of the mes- 1COMPUTATIONAL SCIENCES AND MATHEMATICS DEPARTMENT, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352. The International Journal of High Performance Computing Applications, ([email protected]) Volume 20, No. 2, Summer 2006, pp. 203–231 2WILLIAM R. WILEY ENVIRONMENTAL MOLECULAR DOI: 10.1177/1094342006064503 © 2006 SAGE Publications SCIENCES LABORATORY, PACIFIC NORTHWEST Figures 7–9, 13, 24 appear in color online: http://hpc.sagepub.com NATIONAL LABORATORY, RICHLAND, WA 99352 ARRAYS SHARED MEMORY PROGRAMMING 203 Fig. 1 Dual view of GA data structures (left). Any part of GA data can be accessed independently by any process at any time (right). sage-passing (MPI) model. The GA model exposes to the separation of the GA internal one-sided communication programmer the hierarchical memory of modern high- engine from the high-level data structure. A new portable, performance computer systems (Nieplocha, Harrison, and general, and GA-independent communication library called Foster 1996), and by recognizing the communication over- ARMCI was created (Nieplocha and Carpenter 1999). head for remote data transfer, it promotes data reuse and New capabilities were later added to GA without the need locality of reference. Virtually all the scalable architectures to modify the ARMCI interfaces. The GA toolkit evolved possess non-uniform memory access characteristics that in multiple directions: reflect their multi-level memory hierarchies. These hierarchies typically comprise processor registers, multiple • Adding support for a wide range of data types and vir- levels of cache, local memory, and remote memory. Over tually arbitrary array ranks (note that the Fortran limit time, both the number of levels and the cost (in processor for array rank is seven). cycles) of accessing deeper levels has been increasing. It • Adding advanced or specialized capabilities that address is important for any scalable programming model to address the needs of some new application areas, e.g. ghost cells memory hierarchy since it is critical to the efficient exe- or operations for sparse data structures. cution of scalable applications. • Expansion and generalization of the existing basic func- Before the DoE-2000 ACTS program was established tionality. For example, mutex and lock operations were (ACTS; DOE ACTS), the original GA package (Nieplocha, added to better support the development of shared mem- Harrison, and Littlefield 1994, 1996; Nieplocha et al. ory style application codes. They have proven useful 2002a) offered basic one-sided communication operations, for applications that perform complex transformations along with a limited set of collective operations on arrays of shared data in task parallel algorithms, such as com- in the style of BLAS (Dongarra et al. 1990). Only two- pressed data storage in the multireference configuration dimensional arrays and two data types were supported. The interaction calculation in the COLUMBUS package underlying communication mechanisms were implemented (Dachsel, Nieplocha, and Harrison 1998). on top of vendor specific interfaces. In the course of ten • Increased language interoperability and interfaces. In years, the package has evolved substantially and the under- addition to the original Fortran interface, C, Python, lying code has been completely rewritten. This included and a C++ class library were developed. These efforts 204 COMPUTING APPLICATIONS were further extended by developing a Common Com- ory, for example networks of SMP workstations or component Architecture (CCA) component version of GA. putational grids, the message-passing model’s classification • Developing additional interfaces to third party librar- of main memory as local or remote can be inadequate. A ies that expand the capabilities of GA, especially in the hybrid model that extends MPI with OpenMP attempts to parallel linear algebra area. Examples are ScaLAPACK address this problem but is very hard to use and often (Blackford et al. 1997) and SUMMA (VanDeGeijn and offers little or no advantages over the MPI-only approach Watts 1997). More recently, interfaces to the TAO opti- (Loft, Thomas, and Dennis 2001; Henty 2000). mization toolkit have also been developed (Benson, In the shared-memory programming model, data is McInnes, and Moré). located either in “private” memory (accessible only by a • Developed support for multi-level parallelism based on specific process) or in “global” memory (accessible to all processor groups in the context of a shared memory pro- processes). In shared-memory systems, global memory is gramming model, as implemented in GA (Nieplocha accessed in the same manner as local memory. Regardless et al. 2005; Krishnan et al. 2005). of the implementation, the shared-memory paradigm elim- inates the synchronization that is required when message- These advances generalized the capabilities of the GA passing is used to access shared data. A disadvantage of toolkit and expanded its appeal to a broader set of appli- many shared-memory models is that they do not expose cations. At the same time the programming model, with the NUMA memory hierarchy of the underlying distrib- its emphasis on a shared memory view of the data struc- uted-memory hardware (Nieplocha, Harrison, and Foster tures in the context of distributed memory systems with a 1996). Instead, they present a flat view of memory, making hierarchical memory, is as relevant today as it was in it hard for programmers to understand how data access 1993 when the project started. This paper describes the patterns affect the application performance or how to exploit characteristics of the Global Arrays programming model, data locality. Hence, while the programming effort involved capabilities of the toolkit, and discusses its evolution. In in application development tends to be much lower than addition, performance and application experience are pre- in the message-passing

Advances, Applications and Performance of The

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

BCL: a Cross-Platform Distributed Data Structures Library

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models

Exascale Computing Project -- Software

The Global Arrays User Manual

Recent Activities in Programming Models and Runtime Systems at ANL

The Opengl ES Shading Language

The Opengl ES Shading Language

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Gain: Distributed Array Computation with Python