An investigation of distributed programming

Prashobh Balasundaram

August 22, 2007

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2007 Abstract

This report presents the results and observations of a study on using the Global Arrays toolkit and Unified Parallel C (Berkeley UPC and IBM XL UPC) for distributed shared memory programming. A goal of this project was to improve the memory and performance of a molecular dynamics application, on processing machines. The main target system for this effort was an IBM Bluegene/L service operated by Edinburgh Centre. Since the original code implemented a data replication strategy, the problem size that could be solved on the Bluegene/L was limited. The scaling of the original application was restricted to 128 processors on the Bluegene/L. Through the use of distributed shared memory programming using the Global Arrays toolkit, the maximum system sizes that could be solved on the Bluegene/L improved by 77.7%. The original application also suffered from poor scaling due to the presence of serial components. By parallelizing the serial routines, the scaling issue was resolved. The distributed shared memory based code is capable of scaling up to 512 processors, while the original application scaling was restricted to 128 processors. In order to compare and contrast the usability and performance of the Global Arrays toolkit against Unified Parallel C, a Global Arrays version of well known benchmarks and Global Arrays and Unified Parallel C versions of an image processing application were developed. The experience gained from this effort was used in a detailed comparison of features, performance and productivity of Global Arrays and Unified Parallel C. Contents

1 Introduction 1

2 Background 3 2.1 Distributed Shared Memory programming model...... 4 2.2 Hardware environment...... 5 2.3 Global Arrays toolkit...... 6 2.3.1 Global Arrays Architecture...... 7 2.3.2 Structure of a GA program...... 8 2.3.3 Productivity of GA programming compared to MPI...... 10 2.3.4 Global Arrays on Bluegene/L...... 13 2.4 Unified Parallel C...... 13 2.4.1 UPC Architecture...... 14 2.4.2 Structure of an UPC program...... 15 2.4.3 Productivity of UPC programming compared to MPI...... 15 2.5 Molecular dynamics...... 16

3 Results of preliminary investigation on Global Arrays 19 3.1 Installation of the Global Arrays toolkit...... 19 3.1.1 Installation of the Global Arrays toolkit on Lomond...... 19 3.1.2 Installation of the Global Arrays toolkit on Bluegene/L...... 20 3.1.3 Installation of the Global Arrays toolkit on HPCx...... 21 3.2 Global Arrays benchmarks...... 21 3.3 Image processing benchmarks using Global Arrays on Bluegene/L...... 24

4 Optimizing memory usage of molecular dynamics code using Global Arrays 29 4.1 Molecular Dynamics...... 29 4.2 The physical system and its effect on memory...... 29 4.3 Structure of the molecular dynamics application...... 31 4.4 Performance characteristics of original code...... 32 4.5 Design of the GA based molecular dynamics application...... 33 4.6 Performance of application using non blocking communication...... 37 4.7 Result of memory optimization using Global Arrays...... 38 4.7.1 Simulation of an Aqueous system...... 40 4.8 Test results...... 40

i 5 Performance optimization of Molecular dynamics code 42 5.1 Analysis of load imbalance and serial components...... 42 5.2 Parallelizing RDF routines...... 43 5.2.1 Result of performance optimization...... 44

6 UPC 46 6.1 Installation of UPC...... 46 6.1.1 Installation of UPC on Lomond...... 46 6.1.2 Installation of IBM XL UPC on HPCx...... 47 6.2 UPC syntax and common usage patterns...... 47 6.3 UPC benchmarks...... 50 6.3.1 Image reconstruction using shared memory and upc_forall...... 52 6.3.2 Image processing using local memory and halo swaps...... 55 6.3.3 Summary...... 56

7 Comparison of Global Arrays and UPC 58 7.1 Portability and availability...... 58 7.2 Comparison of GA and UPC syntax...... 59 7.3 Effect of compiler optimisation...... 61 7.4 Interoperability with MPI...... 61 7.5 Comparison of communication latency of GA,UPC and MPI on HPCx...... 61

8 Conclusions 63 8.1 Future Work...... 64

ii List of Figures

2.1 Shared memory model...... 3 2.2 Message passing model...... 4 2.3 Distributed shared memory model...... 4 2.4 Architecture of the GA toolkit...... 8 2.5 Structure of a generic program using Global Arrays...... 9 2.6 Comparison of MPI two sided communication to one sided communication...... 11 2.7 Complexity and communication pattern...... 12 2.8 Architecture of UPC compiler...... 14 2.9 Structure of a generic program using UPC syntax...... 15

3.1 Ping-pong benchmark using GA & MPI on Lomond...... 22 3.2 Ping-pong benchmark using GA & MPI on Bluegene/L...... 23 3.3 Ping-pong benchmark using GA & MPI on HPCx...... 24 3.4 Program structure of the image reconstruction application...... 25 3.5 Application performance - Image processing using GA with different input sizes on Blue- gene/L...... 26 3.6 Application performance - Image processing using GA vs MPI...... 27 3.7 Application performance - Effect of Virtual node mode on Global Arrays based image processing benchmark...... 27

4.1 The physical system under simulation...... 30 4.2 Memory usage for storing position, velocity related data...... 31 4.3 Scaling of original MD code...... 33 4.4 Main Loop of the molecular dynamics code...... 34 4.5 A detailed view of the conjgradwall routine...... 34 4.6 Usage of the Global Array in the molecular dynamics application...... 35 4.7 Effect of using large buffers...... 36 4.8 Comparison of GA and MPI application performance...... 37 4.9 Blocking Vs Non Blocking GA program...... 39 4.10 Result of memory optimisation using GA...... 39

5.1 Analysis of load imbalance...... 43 5.2 The scaling of the MPI application before and after rdf routine was parallelised..... 44 5.3 Performance of GA and MPI version of MD code on HPCx...... 45

iii 6.1 Ping-pong benchmark of UPC vs MPI communication on Lomond...... 51 6.2 Ping-pong benchmark of IBM XL UPC on HPCx compared to MPI performance.... 51 6.3 Ping-pong benchmark of Berkeley UPC on Lomond, Intel core duo compared to IBM XLUPC on HPCx...... 52 6.4 Image processing using UPC compared to MPI...... 53 6.5 Elapsed time for UPC program using shared memory vs serial program...... 54 6.6 XL UPC vs XLC compiler optimisation...... 55 6.7 Halo swap implemented using upc_memput and upc_memget...... 56 6.8 UPC - shared memory vs UPC - Message passing...... 57 6.9 Speed up of image processing benchmark using IBM XL UPC on HPCx...... 57

7.1 Comparison of communication latency of UPC, GA and MPI on HPCx...... 62

iv Acknowledgements

I thank Dr. Alan Gray for his support and guidance on all phases of this project. I also thank, Dr. Mark Bull for the guidance and reviews provided.

My sincere thanks goes to Mr. Aristides Papadopoulos. His MSc dissertation was used as the documen- tation of the original molecular dynamics code on which this project is based.

I thank the authors of the Global Arrays toolkit for the excellent documentation and tools support pro- vided throughout the project. Chapter 1

Introduction

Recent developments in massively parallel computing systems have contributed to the growing popularity of the distributed shared memory programming (DSM) model [1]. Modern massively parallel processing (MPP) like IBM Bluegene/L are comprised of hundreds of thousands of light weight nodes networked through specialized interconnects. The interconnects of these machines are equipped with remote direct memory access hardware. Programming models like DSM can be used to exploit the aggregate compute power of these machines effectively. The DSM programming model offers many advantages over the message passing programming model. The ease of use of the distributed shared memory programming model leads to high developer productivity while providing good performance and application scalability. The DSM toolkits and compilers available today are optimised for a wide range of machines and are therefore portable. Using the DSM programming model, an application developer can extend the capabilities of existing message passing applications with minimal modifications to the code base. This project will focus on two popular DSM programming languages: the Global Arrays (GA) [2] toolkit and Unified parallel C (UPC) [3]. The main objectives of this project are to investigate DSM programming and use the GA toolkit to enable a molecular dynamics code, currently being used in research, to be more suitable for modern MPP machines. The project also investigates the ease of use and performance of UPC and compares it with that of the GA toolkit.

Data replication strategy is often used in parallel programs to avoid complex parallel decomposition techniques while implementing message passing programs. Data replication limits the scalability of the parallel code as it leads to higher memory usage per node. Many modern MPP supercomputers have limited memory per node. Each chip on IBM Bluegene/L addresses 512 MB of memory. This project investigates the use of DSM model using the GA toolkit to reduce the memory footprint of a molecular dynamics application. This code was developed at the School of Chemistry, University of Edinburgh by Dr. Paul Madden and Dr. Stewart Reed. By using DSM programming model the parallel decomposition strategy remains simple and unaltered, leading to higher developer productivity. The main target machine of this code was the IBM Bluegene/L operated by Edinburgh Parallel Computing Center (EPCC).

The second chapter of this report will detail the background, project goals, and provide an overview of DSM programming model, hardware used in this project, the GA toolkit , the UPC and the molecular dynamics application.

1 The third chapter presents the observations from the installation and benchmarking of the GA toolkit on Lomond and Bluegene/L. A well understood image processing application was modified to use the GA toolkit. The performance of this application is also presented in the third chapter of this report.

The fourth chapter introduces the physical system represented by the molecular dynamics application, its structure, and its performance characteristics and memory utilization. The design of the GA based molecular dynamics application and the result of the memory optimization effort is presented in the fourth chapter.

One of the main issues observed on the original molecular dynamics code was its poor scalability. The fifth chapter presents the result of an investigation into the root cause of the scalability issue, the modifi- cations implemented and the result of the performance optimization effort.

The sixth chapter of this report presents the results of an investigation of UPC programs on Lomond and HPCx. The image processing code was implemented on UPC. Its performance was compared to that of the GA toolkit. This chapter presents the outcome of this effort.

A comparison between the features of GA toolkit and UPC is detailed in the seventh chapter of this report. This section compares GA and UPC in terms of developer productivity, ease of use of the syntax, and compares the performance achieved.

Finally the conclusions from the observations of this project is summarized in the eighth chapter. Future work identified from this project is also detailed as part of this chapter.

2 Chapter 2

Background

The most common parallel programming models are SMP (shared memory programming) model and message passing programming model. While SMP programming model is used widely on symmetric multiprocessor machines, message passing model is used on symmetric multiprocessor , MPP (massively parallel processing) machines and SMP clusters [4].

Symmetric multiprocessor architecture (Figure 2.1) allows all processors to access a central shared mem- ory. All processors incur almost the same communication time to access the shared memory. Shared memory programming model relies on the common shared memory for inter-processor communication. Open-MP is the most commonly used programming language for SMP.

Processor Processor Processor Processor

Shared Memory

Figure 2.1: Shared memory model

MPP systems (Figure 2.2) are designed to use many nodes connected together using a low latency inter- connect. Each node is an independent processing unit complete with dedicated processor and memory. These distributed nodes () communicate with each other by sending messages through the low latency interconnect. The individual nodes of the MPP system may be SMP servers, and such an MPP system is referred to as an SMP cluster.

MPI (Message Passing Interface) is the de facto standard used for programming distributed memory machines. The complexity of programming applications in MPI is considerably higher than Open-MP. This is because message passing requires regular communication patterns as it attempts to match the send

3 operation on one processor to a receive operation on another processor. However MPI is the most widely used parallel programming model, since it provides portability across machine architectures.

Processor

Memory

Processor Network Processor

Memory Memory

Processor

Memory

Figure 2.2: Message passing model

2.1 Distributed Shared Memory programming model

Distributed Shared Memory

Shared Shared Shared Shared Memory Memory Memory Memory

Processor Processor Processor Processor

Local Memory Local Memory Local Memory Local Memory

Figure 2.3: Distributed shared memory model

The DSM programming model (also referred to as partitioned global address space programming model) attempts to provide a shared memory style programming model on a distributed memory machine (Figure 2.3). The memory of each node of the distributed memory machine is partitioned into a global (shared) area and local area. While the local memory is private to the owner , the shared memory can be accessed by all processes. This is accomplished by using remote direct memory access (RDMA) facil- ities provided by the underlying hardware. Many modern supercomputers, like HPCx provides RDMA facilities at the hardware level [5]. If the hardware does not support RDMA, message passing is used to implement the distributed shared memory. The complexity of accessing the remote memory is hidden

4 by the DSM model. However the cost of accessing remote memory is significantly higher than access- ing local memory. The DSM model therefore exposes the non-uniform memory characteristics of the underlying hardware.

2.2 Hardware environment

The investigations described in this report were conducted mainly on three HPC facilities.

Lomond: Sun fire e15k

Lomond [6] is a Sun Fire e15k SMP server, owned and operated by Edinburgh Parallel Computing Center. It is partitioned into a 48 processor back end and a 4 processor front end. It was used as the primary development server for GA programs as well as Berkeley UPC based code.

BlueSky service: Bluegene/L

The Bluesky HPC service [7] was used primarily for investigations of the molecular dynamics application using both GA and MPI. This service is comprised of a single rack Bluegene/L server. The Bluegene/L server uses 1024 IBM 440 dual core chips interconnected using five specialized networks. Each node of the Bluegene/L server can access 512 MB memory. Bluegene supports two modes of operations - communications coprocessor mode and virtual node mode. In the communication coprocessor mode (CO) a core of each chip can access 512 MB memory, while the other core supports communication operations. In the virtual node (VN) mode, each processor accesses 256 MB memory and both processors actively participates in computation. This mode requires two MPI processes to be loaded on a single node.

The architecture of Bluegene/L [8] is unique. Each of the 1024 chips on the Bluegene/L rack is a system- on-a-chip and is designed as an Application Specific Integrated Circuit (ASIC). The ASIC holds two 32 bit IBM PowerPC 440 cores. This chip provides all functionalities of a compute node. The main memory is external to the chip and is mounted on the compute card. Each chip is equipped with a 64 bit double floating point unit and is capable of operating in SIMD (single instruction multiple data) mode. Each PowerPC 440 core has its own instruction and data caches, a small L2 pre-fetch buffer, 4 megabit shared L3 cache built from embedded dynamic random access memory (DRAM) and a double data rate memory controller. Each core of the ASIC chip operates at 700 Mhz, and is designed to minimize heat dissipation.

The two ASIC chips are mounted on a compute card which also holds 512 MB DDR memory each. Sixteen such compute cards are packed on a single node card. Each node card supports up to 2 I/O cards. Each rack is comprised of 32 node cards.

Bluegene/L internal networking is comprised of five different networks integrated into the ASIC chip. The networks include a three dimensional torus network, a global tree network, a global barrier and interrupts network, gigabit ethernet and a JTAG control network. The torus network supports point to point communication, while the global tree network provides fast global operations such as reduction operations and broadcasts. A separate tree network offloads barrier and interrupt communication from the tree network used for other global operations. I/O operations are performed using the gigabit ethernet through the I/O nodes. The JTAG network is also an ethernet network used for system administration and monitoring purposes.

5 HPCx: IBM P575 SMP Cluster

The HPCx [9] is the primary national super computing service of the United Kingdom. It is a cluster of SMP servers from IBM.

Each compute node of the HPCx system is a 16 processor IBM eServer 575 server. The IBM POWER5 processors on the HPCx nodes operate at 1.5 Ghz. The processors support on-chip L1 and L2 caches. The L2 cache is 1.9 MB and is shared between 2 processors. Four chips are packaged into a multi-chip module. Two multi-chip modules form a frame. Each multi-chip module can address 128 MB of L3 cache and 16 GB of main memory. The total memory of each frame is shared between the 16 processors of the frame. The frames are interconnected using IBM HPS (high performance switch).

The main HPCx service is comprised of 160 compute nodes, and provides access to a total of 2560 processors.

2.3 Global Arrays toolkit

The GA toolkit [2] provides a shared memory interface for distributed memory machines. Using the GA toolkit, distributed arrays which span across multiple nodes can be created. All nodes can access the shared array using a common array indexing scheme, irrespective of the locality of the shared array segment. This feature simplifies parallel software development. The GA toolkit implements the com- munication routines needed to enable the shared memory abstraction on distributed memory computers. It is able to use one sided communication on machines which supports remote direct memory access. The GA toolkit is designed to coexist with MPI and hybrid programming using GA with MPI is recom- mended [10]. The GA toolkit can be installed to use MPI, TCGMSG as well as TCGMSG-MPI as its communication layer. TCGMSG [11] is a portable toolkit following the message passing model while TCGMSG-MPI is an implementation of the TCGMSG-MPI model over MPI. The GA toolkit is available on a wide range of machines and offers excellent portability.

The GA toolkit supports mainly four distributed shared memory operations. These are get, put, scatter and gather. It also implements read, increment and accumulate atomic operations. Synchronization is made possible through the use of fences, locks and global barrier based synchronization. These opera- tions are one-sided operations, made possible through the use of global-array-index to address translation feature implemented in the GA toolkit. Thus global arrays provides a higher level interface to distributed shared memory.

The Global Arrays are recommended by its designers for use in algorithms where applications have dy- namic or irregular communication patterns. It is also recommended for applications which need dynamic load balancing. Applications with dynamic communication patterns are hard to parallelise using pure MPI. The MPI is more suitable for applications where the communication patterns are known in ad- vance. The MPI communication calls are two sided and need both sending as well as receiving processes to match. GA communication routines on the other hand use one sided communication. The target pro- cessor information is not declared by the programmer. Instead the programmer specifies the location of the array segment in the global address space. Data locality must be known in advance to use GA based communication. The Global arrays are not suitable for applications with regular point to point communication patterns as it is optimised for bulk data transfer.

6 When a global array is created, the programmer can choose to distribute the global array across the available nodes by specifying chunk sizes. He can also allow the library to select the data distribution. The data distribution block size can be specified explicitly for all dimensions of the multidimensional array. It is also possible to distribute one dimension of the array explicitly and let the library distribute the rest of the dimensions implicitly. Irregular distribution of the array dimensions can be implemented by providing a map of the desired distribution pattern. The locality information of a global array can be queried through user friendly interfaces. This allows programmers to control data locality explicitly and minimize the communication costs.

The GA toolkit implements a rich set of functions to operate on whole arrays as well as array subsec- tions. These operations are collective data parallel operations. Copying a global array to another global array can be accomplished with a single function call. Similarly copying a patch of a global array to another global array is also a data parallel collective operation. It is also possible to transpose the global array while it is transferred to another global array. Many elementary matrix operations are also imple- mented as part of the GA Toolkit. Operations like array transpose, scaling, elemental addition, matrix multiplication, dot product, and basic linear algebra operations are available as part of the GA toolkit.

To perform a halo swap, the GA toolkit defines the shared array with ghost cells at the boundaries. The width of this boundary can be controlled by the programmer. Halo swap between the adjacent cells can be performed with single GA function calls. On SMP clusters GA toolkit can leverage the shared memory to cache data from neighbouring nodes to improve application performance. This technique is referred to as mirroring.

2.3.1 Global Arrays Architecture

The layered architecture of the GA toolkit [12] is depicted in Figure 2.4. The GA toolkit supports many system level interfaces and networking protocols. These include IBM LAPI, GM/Myrinet, Infini- band/VIA, threads, Quadrics Elan4/Elan3, and BGML (IBM Bluegene/L). This provides the GA toolkit excellent portability across machine architectures.

The GA toolkit can be installed to use either MPI or ARMCI as its communication layer. The GA toolkit provides complete interoperability between ARMCI and MPI layer. In many systems it is recommended to use the vendor supplied MPI instead of ARMCI. For example on the IBM Bluegene/L, the MPI layer is customized to utilise the novel architecture of the Bluegene/L hardware. It is therefore beneficial to use the MPI layer for communication purposes.

The GA toolkit uses the Aggregate Remote Memory Copy library (ARMCI) [13] to implement one-sided communication. The ARMCI is a general purpose, portable library and allows remote memory access of contiguous and non contiguous data. The ARMCI provides high performance one-sided communi- cation by exploiting native communication interfaces. On clustered systems it uses high performance network protocols, and supports Myrinet (GM), Quadrics, Giganet (VIA) and Infiniband. It is com- patible with MPI, and is supported on many servers like IBM SP, Cray (X1, T3E, SV1, J90), Fujitsu (VX/VPP, PRIMEPOWER) , NEC SX-5, Hitachi SR8000, and IBM Bluegene/L. The ARMCI imple- ments many critical functionality exposed by the GA toolkit like data transfer operations, atomic op- erations, locks, memory management and synchronization operations. The very low overhead imple- mentation of ARMCI is capable of achieving near peak bandwidth on cluster interconnects and provides

7 Application programming interface F77,F90,Java,C, C++, Python

Distributed Arrays layer memory management, index translation

MPI ARMCI

System specific interface LAPI, GM/Myrinet, threads etc

Figure 2.4: Architecture of the GA toolkit significant performance benefits to Global Arrays. The GA tookit routines can issue multiple non block- ing ARMCI data transfer operations, to overlap computation and one sided communication.

The distributed arrays layer of the GA toolkit implements distributed memory management, and index translation. The distributed memory management is performed by the memory allocator module. The index translation layer implements an user friendly interface to address the distributed global array.

Application programming interfaces are available in many leading HPC languages including Fortran 77, Fortran 90, C, C++, Python and Java.

The GA routines depends on the execution environment provided by MPI, and is designed to enhance the usability of MPI. This allows the GA routines to interoperate with many MPI libraries and can utilise popular HPC numerical packages. The GA toolkit relies on ScaLAPACK [14] for linear algebra function- ality. Interfaces to GA features are available in many common HPC programming languages, including C, C++, Fortran, and Python.

2.3.2 Structure of a GA program

The structure of generic GA programs can be represented as shown in figure 2.5. The GA based programs needs to include files from the GA toolkit. For C programs the files to be included are global.h, ga.h, macdecls.h while for Fortran programs mafdecls.fh, global.fh are included. If the GA toolkit was installed to use MPI as the communication medium, then MPI environment is required for the GA program. Therefore MPI is initialised and then GA environment is initialised. Finally memory allocator is initialised. The GA is created and updated by the program. After GA is updated, each processor can download a part of the GA to a local buffer and work on it or it can access the GA segment resident on it using pointers. After work is completed, the GA environment is terminated, following which the MPI environment is terminated.

If the GA toolkit was compiled with TCGMSG or TCGMSG built over MPI, then the program structure changes very slightly. Instead of using the MPI environment, TCGMSG environment is used.

8 GA installed on MPI GA Installed using TCGMSG or TCGMSG-MPI

Start MPI Start TCGMSG MPI_Init(..) PBEGIN_(..)

Start GA Start GA GA_Initialize() GA_Initialize()

Start Memory Start Memory Allocator Allocator MA_Init(..) MA_Init(..)

Create Global Array Create Global Array GA_Create() GA_Create

Each processor updates Each processor updates part of GA part of GA GA_Put() GA_Put()

Each processor copies Each processor Each processor copies Each processor data from GA to local accesses local part of data from GA to local accesses local part of buffer GA using pointers buffer GA using pointers GA_Get () GA_Access() GA_Get() GA_Access()

Perform work Perform work

Terminate GA Terminate GA environment environment GA_Terminate() GA_Terminate()

Terminate MPI Terminate TCGMSG MPI_Finalize() PEND_()

Figure 2.5: Structure of a generic program using Global Arrays

9 2.3.3 Productivity of GA programming compared to MPI

Developer productivity in high performance computing can be measured in terms of the ease of use of the parallel program and the parallel efficiency of the program. The GA programs differ from MPI based programs in terms of developer productivity. The difference in developer productivity of MPI and GA programs stems primarily from the differences in the nature of MPI and GA communication routines.

The Figure 2.6 represents the two sided communication model followed by MPI communication routines. Here both sending and receiving processors are involved in communication. To transfer a block of data from a data structure on processor N to another data structure in processor 0, the following operations needs to be performed when using MPI. First, the location and size of the data blocks to be transferred to processor 0 are identified by the processor N and processor 0 identifies the location and size of the data it needs to receive from processor N. Then the data that the processor N needs to send is packaged into a buffer. This is effected by calling the MPI send routine. Processor 0 issues a receive operation after allocating the receive buffer. Once the matching sends and receives are paired, the communication takes place. After the communication process is completed, the data from the buffer can be used in computation by copying the data from the buffer.

When using the GA toolkit, the operation mentioned above is simplified into a single call of the nga_get routine [15]. To transfer data from a segment of the global array residing in the processor N, the nga_get(g_a,lo,hi,buffer,ld) call is used. Here the lo and hi arrays marks the data to be transfered from the global array to the calling processor. This syntax can therefore handle data transfer from multidimensional array segments also.

Figure 2.7 depicts a scenario where the MPI program complexity increases with the shape and location of the data to be transferred, within the global array. When the pattern of the data to be accessed from the distributed array becomes more complex, the complexity level of the MPI program increases. The GA program does not become more complex under such a situation, as the only change needed is in the values of the lo and hi arrays.

The GA tookit is completely interoperable with MPI. The GA toolkit can use MPI as its communication layer. The GA toolkit can be used to create distributed arrays and these arrays can be used in MPI applications. Using hybrid programming, applications can achieve a trade off between performance and memory footprint of the application.

Program complexity can arise from syntactic complexity, length of the program or conceptual/semantic complexity. Syntactic complexity is a measure of the complexity of translating the algorithm to code. In the context of parallel computing, syntactic complexity highly depends on the complexity of converting the serial code to parallel code. The presence of the global array index when using the GA toolkit reduces the syntactic complexity of parallel code.

As discussed earlier in this section, the GA syntax is more concise than MPI code when complex data transfers needs to be performed. This reduces the length of the parallel code. The lines of code is therefore significantly reduced.

Conceptual or semantic complexity refers to the complexity of the application arising from the differ- ence in the parallel program compared to the structure of the original program. The structure of a GA based parallel program is simpler than the MPI program due to the nature of the GA toolkit. How-

10 MPI Send receive communication

Identify the location and size of data blocks to be copied from proc N to Proc 0

Copy data to be If my transfered to local buffer Yes rank=Proc N and send the buffer to local buffer of Proc 0

No

If my Receive data from Proc N Yes rank=Proc 0 to a local buffer on Proc 0.

Copy local buffer to local data array for computation.

GA communication

NGA_Get(ga,lo,hi,buffer,ld)

Figure 2.6: Comparison of MPI two sided communication to one sided communication

11 Simple communication when using MPI

P0 P1

SourceP

Target

P2 P3

Complex communication pattern when using MPI P0 P1

Target

P

Source

P2 P3

Figure 2.7: Complexity and communication pattern

12 ever the difference of GA based code when compared to an equivalent serial code is very high. This is because GA programs still needs to perform the parallel decomposition in order to achieve excellent performance. In contrast parallel programming languages like Open-MP will score highly in reduction of conceptual/Semantic complexity as they rely on task parallel programming by distributing serial loops.

2.3.4 Global Arrays on Bluegene/L

IBM Bluegene/L supercomputer supports the ARMCI operations by implementing a low level commu- nication interface to expose the communication hardware features of Bluegene [16]. The GA toolkit, be- came available on Bluegene/L only recently. Previously Bluegene/L software stack supported MPICH2 only. The implementation of the low level communication interface allows ARMCI, MPI-2 one-sided, and UPC to take advantage of the communication hardware features of Bluegene/L. This message layer is a protocol engine that processes incoming and outgoing packages by cooperating with the hardware layer. The protocols defined on the message layer enable one-sided communication on the Bluegene/L.

Three protocols enable one-sided communication on Bluegene/L. The PUT protocol copies data to target node from origin while the GET protocol is a reverse PUT operation. In the PUT protocol the target sends an acknowledgement on completion of the data transfer. However after a GET operation is completed the target does not send an acknowledgement. The third protocol is the RMW (Read-Modify-Write) protocol. The target node performs the operation requested by the source node, upon receipt of the task initialization packet. Upon completion of the requested operation the target node sends a response packet to the origin node, to signal the message completion. This operation also supports a swap operation. The data at the target node is swapped with the origin node data. The fetch process implemented as part of the RMW protocol is used by ARMCI to implement the sum or product operations. This performs the sum or product operations on a target node and stores the result on the target node.

Special optimisations are implemented on the Bluegene/L to improve bandwidth and latency of one-sided operations. Bandwidth optimization is implemented by sending data on all available networks exiting a node. This achieves the maximum bandwidth available for communication, and proves to be beneficial when large data transfers are needed. To reduce communication latency, the communication operation is divided into two overlapping phases. In the first stage of the operation, the message context is established. Since the cost of establishing the message context is very high, the first phase simultaneously transfers packets of data with self-describing packets. Once the message context is established full payload data packets can be issued. This latency hiding helps implement a low latency, high bandwidth interconnect system. The ARMCI can support greater bandwidth on large message sizes when compared to MPI mes- sages. The GA inherits the above mentioned performance optimisations and this promotes application scalability.

2.4 Unified Parallel C

The Unified Parallel C is a distributed shared memory programming language, and provides a parallel extension to the C programming language [3]. Many distributions of UPC are available today. This project investigated the use of IBM XL UPC [17] and Berkeley UPC. Berkeley UPC was installed on Lomond and IBM XL UPC was investigated on HPCx.

13 The UPC supports both task parallel and data parallel programming. The primary goal of UPC is to provide an easy to use programming interface to distributed memory machines, without sacrificing per- formance and scalability. Using UPC a collection of threads can be spawned to operate on the global address space partitioned across threads. The threads are data locality aware and has an affinity towards its private space and a portion of the global address space.

The UPC memory model [18] divides the distributed memory of a cluster of machines into private and shared spaces. Each has an affinity to a partition of the shared memory. The UPC allows the creation of shared and private variables. The UPC also allows the creation of shared and private pointers. A shared pointer is used to point to addresses in the shared address space, while the private pointer can reference the addresses in its private space or in the local portion of the shared space.

Using the UPC the programmer can easily distribute data across processors using simple declarations. The UPC syntax allows easy exploitation of data locality. It is easy to identify the data locality of the elements in the shared space. This information can be used to implement parallel programs that leverage the data affinity. The UPC also provides synchronization facilities, in the form of barriers and locks. Since data is shared between threads, the UPC implements two memory consistency models - strict and relaxed. The strict mode enforces ordering of independent shared access operations. The UPC also supports dynamic memory allocation on both shared address space and private space.

UPC Code Translator

Translator generated C code Platform independent layer

Berkeley UPC runtime Network independent layer

GASNET communication layer

Network Hardware

Figure 2.8: Architecture of UPC compiler

2.4.1 UPC Architecture

The Berkeley UPC compiler is comprised of multiple layers of abstraction (Figure 2.8). Above the network layer the GASNET [19] (Global-Address Space Networking) layer is implemented. GASNET is language independent and provides network independent, low level, high performance communication primitives used to implement distributed shared memory programming languages like UPC, Titanium [20] and CO-Array Fortran [21]. A network independent layer called the Berkeley UPC runtime provides the communication functionality needed by the UPC programs. The UPC programs are translated by UPC translator to C syntax embedded with calls to the Berkeley UPC runtime. Thus the translator generated C code is platform independent. The layered architecture of UPC makes it easier to port the code across

14 different platforms.

2.4.2 Structure of an UPC program

The UPC syntax is very different from that of MPI. The UPC attempts to provide a shared memory model programming syntax over a distributed memory hardware architecture. Unlike the GA toolkit which uses a library based approach where the syntax needs a large number of parameters to identify the data locality, the UPC implements data locality aware threads. The UPC threads possess affinity to the processor which owns the data that it operates upon.

Figure 2.9 represents the structure of a generic UPC program. The UPC programs can be implemented in three different ways. The first approach uses the loop parallelism and memory affinity of the threads to parallelise the work distribution across many threads. This is the most simple method of implementing an UPC program when a serial program is provided as the starting point. The second approach is to parallelise the application using domain decomposition techniques and implement the halo swaps using upc_memput and upc_memget commands. This technique increases the syntactical complexity of the parallel program. Here the parallel program uses buffers updated by one sided communication calls to the shared memory. The presence of the globally accessible shared memory makes the code structure simpler than the MPI. In this mode, the UPC program structures are comparable to the GA program structure. The third approach is to access the shared data structures using shared and private pointers. This mode of coding is more complex than the prior two modes.

Include UPC include files

Create private and shared/ global variables

Use shared and private pointers to access remote/ local data Exploit loop parallelism using UPC_FORALL

Perform work

Terminate C program

Figure 2.9: Structure of a generic program using UPC syntax

2.4.3 Productivity of UPC programming compared to MPI

The UPC syntax is tightly integrated with the C programming language. This eliminates one level of complexity faced when using the Global Arrays and the MPI - the complexity of installation of the

15 library, and the process of linking it with the parallel code.

The UPC syntax is very simple [22] as it attempts to implement an OpenMP style programming model. Loop parallelism is implemented in UPC through the use of upc_forall statement. This allows a loop to be parallized among the N threads, easilly. The data structures declared as shared are distributed across each node. The thread’s affinity to a part of this shared array is used automatically to parallelise the loop. This eliminates the need for the implementation of the halo swaps. The MPI programs as well as the GA programs needs these halo swaps to be implemented explicitly. Therefore the syntactic complexity of the UPC programs is very low.

The UPC programs implemented using the distributed shared memory and upc_forall statement do not use explicit halos and this reduces the length of the program to a great extent.

The UPC syntax scores highly in the reduction of semantic complexity as well. Since the serial program structure can be retained in most practical scenarios, the conceptual complexity of the parallel code is almost the same as that of the serial code.

A major part of this project investigated the productivity of GA as well as UPC. In a high performance computing environment, the productivity also depends on the ease of achieving good parallel perfor- mance. This investigation was performed by using a well known image processing example, imple- mented using both GA toolkit as well as UPC. The results of this investigation will be presented in the seventh chapter of this report.

2.5 Molecular dynamics

Molecular dynamics is a powerful computer simulation technique which solves the classical many body problem at the atomic level by allowing atoms and molecules to interact over time constrained by the known laws of physics [23]. Molecular dynamics simulation uses numerical methods to determine the properties of complex systems and is considered to be a simulation experiment. It is a statistical me- chanics method using which the evolution of many interacting atoms is determined by integrating their equations of motion. Molecular dynamics simulations are used widely to study many physical systems like behavior of liquids [24], defects in crystals, fracture of solids, surfaces, friction, behavior of clusters of molecules, biomolecules, electronic properties and dynamics of materials.

The laws of classical mechanics, like Newton’s laws of motion, are applied to each atom in a complex system of atoms. The long range and short range forces acts on each atom due to its interaction with every other atom in the system. Since the atoms are in motion, their relative positions and the forces acting on them changes continuously. A potential energy function is used to calculate the forces acting on the atoms. This function depends on the positions of the particles in 3D space. The potential function is chosen to reflect the nature of the material as well as the physical conditions of the simulation. This is a key component of the model for the physical system. The Lennard-Jones (L-J) pair potential [23] is widely used to represent the interaction model in molecular dynamics simulations. The L-J potential is a mathematical model that approximates the behavior of atoms under attractive long range forces (van der waals force) and repulsive short range forces (Pauli repulsion). In systems used to study electronic properties, Coulomb (electrostatic) forces plays a role in determining the state of the system.

The molecular dynamics simulations is therefore a many body problem. These simulations are compute

16 intensive and scales as O(N 2). The accuracy of the simulation improves with the number of compute iterations performed. The size of the system is represented by the number of atoms interacting with each other. The size of the system increases as the square of the number of interacting atoms.

A molecular dynamics simulation proceeds in discrete steps. The evolution of the system is descretized into small time steps. The duration of these time steps are designed to avoid discretization effects. A trade off is made between the execution time of the simulation and the desired accuracy so that accurate solutions can be arrived at within acceptable time frames. As the computation progresses, statistical data is collected and written down periodically. Upon completion of the simulation, the configuration of the final state of the system is written down into restart files. These files can be used to start the next run.

The molecular dynamics simulations are very compute intensive. Many molecular dynamics codes utilise parallel computing techniques to decrease the execution time. This enables more accurate simulations to be performed in the same time frame. The maximum size of the system that can be simulated is dependent on the memory availability of the computer. Larger system sizes can be simulated on machines with more memory. However, SMP systems with large memory capability are often more expensive than clusters of commodity servers or MPP systems of the same size. This motivates the use of DSM programming on molecular dynamics codes. Using DSM a trade off can be reached between aggregate memory usage and performance of the MD code. This project aims to investigate the use of DSM programming on molecular dynamics code currently used by researchers at the School of Chemistry, University of Edinburgh to study electron transfer from electrodes to electrolyte. The code has many applications in the industry. One such application is the study of electrolytic smelting of aluminium. This process is very energy inefficient, and research is aimed at reducing the energy use of this process.

Molecular dynamics simulations can be parallelised as the computation of forces acting on each particle, as well as updates to the velocities and positions of each particles can be computed independent from each other [25]. Most significant loop that is parallelised is the calculation of the forces. Many parallel decomposition techniques are available for use in molecular dynamics codes. Of these techniques the most common techniques are atom decomposition, pair decomposition, checkerboard pair decomposi- tion, force decomposition, systolic loops and domain decomposition. These decomposition techniques vary significantly in complexity and ease of implementation, as well as the memory requirements and performance of the parallel code. The work by Aristides Papadopoulos reviewed several of these de- composition techniques while parallelising the original serial molecular dynamics code. A few of the observations made by Aristides [26] is described in the following paragraph.

Atom decomposition techniques uses a replicated data strategy. This technique is easy to implement, as- suming that the developer starts working on the serial molecular dynamics code. In atom decomposition, work is distributed over N processors by parallelising the iterative loops used in the molecular dynamics code. This technique leads to wastage of memory since the data is replicated on all nodes of the parallel computer. The cyclic-pair decomposition technique can reduce the memory requirements by using only the elements of the force matrix below the main diagonal. However the memory requirements of the code are still high enough to prevent the simulation of large systems. The force decomposition technique uses a block decomposition of the force matrix. This technique aims to reduce the global communication needed to parallelise the code. Force decomposition technique leads to increased complexity of the par- allelisation process, while the performance gains on the modern MPP machines is not significant. The Domain decomposition technique splits the simulation volume into many cells and assigns each cell to

17 a processor. This helps in reducing the memory requirements of the code as each node does not need to store the full configuration data. Domain decomposition adds additional complexity to the code since particles need to be reassigned to new processors as they cross sub-domain boundaries. The data replica- tion strategy was used by Aristedes to parallelise the serial molecular dynamics application, considering the simplicity of the parallel decomposition strategy and the time constraints under which the project was executed.

This project investigates the use of DSM programming to reduce the memory usage of the molecular dynamics application per node. By storing the large arrays in the distributed shared memory using the GA toolkit, the memory consumption of the code can be reduced significantly, without any drastic alteration to the code structure. This allows larger system sizes to be simulated on modern MPP hardware. This is especially important for machine architectures like the IBM Bluegene/L.

18 Chapter 3

Results of preliminary investigation on Global Arrays

The GA toolkit [2] is an open source software available for free download. It is developed and maintained by the William R. Wiley Environmental Molecular Sciences Laboratory located at the Pacific Northwest National Laboratory, and is available in the public domain from 1994. The GA toolkit can be used on a wide range of hardware architectures including distributed memory machines, shared memory machines and network of workstations. To use the GA toolkit, the source code is downloaded and compiled on the target machine. The compilation process generates a set of include files and libraries. These files can be included in the application code to utilise the distributed shared memory functionality provided by the GA toolkit.

3.1 Installation of the Global Arrays toolkit

The GA toolkit was installed on Lomond, HPCx and Bluegene/L. On Lomond and HPCx the GA toolkit from version 4-0-2 onwards were installed successfully. However, on the Bluegene/L the versions of the GA toolkit prior to 4-0-6 failed to install successfully, though Bluegene/L is available as a target from GA version 4-0-2 onwards. The issues faced in the installation of GA toolkit on Bluegene/L were attributed to bugs in the GA toolkit releases prior to GA 4-0-6 .

The GA toolkit can be installed in three configurations - GA using MPI, GA using TCGMSG and GA us- ing TCGMSG-MPI (refer section 2.3 for details of the GA Architecture). GA using MPI is recommended as the primary mode of installation. This allows GA to benefit from system specific MPI optimizations. All installations of the GA toolkit were performed using MPI as the communication channel.

3.1.1 Installation of the Global Arrays toolkit on Lomond

The Lomond service was used as the primary development machine for investigating GA performance. The GA toolkit installation was found to be straight forward and simple on Lomond.

19 To compile the GA toolkit on Lomond, a makefile is provided with the GA toolkit. The make file can be invoked by issuing the make command from the root folder of the GA toolkit after setting certain environment variables. These environment variables differ from system to system. On Lomond, the following environment variables were used to compile GA 4-0-6.

• TARGET=SOLARIS • USE_MPI=yes • MPI_INCLUDE=/opt/SUNWhpc/include • MPI_LIB=/opt/SUNWhpc/lib • FC=mpf77 • CC=mpcc

The submission scripts for the GA based program is similar to that of an MPI program on Lomond.

3.1.2 Installation of the Global Arrays toolkit on Bluegene/L

Installation of the GA toolkit on Bluegene/L was not a straight forward installation. Though Bluegene/L was supported by the GA toolkit from version 4-0-2 onwards, the installation was stable only with version 4-0-6.

On the EPCC Bluegene/L, the following options were used to compile the GA toolkit.

• BGLSYS_DRIVER=/bgl/BlueLight/ppcfloor • BGLSYS_ROOT=$BGLSYS_DRIVER/bglsys • BLRTS_GNU_ROOT=$BGLSYS_DRIVER/blrts-gnu • BGDRIVER=$BGLSYS_DRIVER • BGCOMPILERS=$BLRTS_GNU_ROOT/bin • TARGET=BGL • USE_MPI=yes • ARMCI_NETWORK=BGMLMPI • MSG_COMMS=BGMLMPI • MPI_LIB=$BGLSYS_ROOT/lib • MPI_INCLUDE=$BGLSYS_ROOT/include • LIBMPI="-L$MPI_LIB -lfmpich_.rts -lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts" • BGMLMPI_INCLUDE=$MPI_INCLUDE • BGMLLIBS=$MPI_LIB • make FC="blrts_xlf90" CC="gcc"

These installation options allows the GA toolkit to utilise the MPI network for GA’s internal communica- tion. Since the MPI on Bluegene/L is designed to utilise the internal networks of Bluegene/L efficiently, GA benefits from the use of the vendor supplied MPI libraries.

20 The choice of compilers used to compile the GA toolkit was made based on feedback from other users and through trial and error experimentation. The GNU-GCC compiler for Bluegene was used to avoid memory alignment problems reported on the blrts_xlc compiler [27].

The submission script of the GA program is similar to that of an MPI program on Bluegene/L. It was observed that for successful compilation and operation of GA programs on Bluegene/L, the -qEXTNAME compiler option needed to be appended with the names of the GA routines used in the program.

3.1.3 Installation of the Global Arrays toolkit on HPCx

Global Arrays are designed to use LAPI (Low-level Application Programming Interface) on IBM P575 SMP clusters. The following options are used to install the Global Arrays toolkit on HPCx. These settings allows end users to exploit the performance benefits offered by the use of the LAPI interface.

• TARGET=LAPI • USE_MPI=yes • FC=mpxlf • CC=mpcc

The performance as well as the stability of the GA programs is dependent on the submission script on HPCx. An environment variable RT_GRQ=ON needs to be specified in the submission script on HPCx. Without this option GA based programs will fail on HPCx. In addition to this option the following settings were used when executing hybrid programs using both MPI and GA on HPCx.

• RT_GRQ=ON • LDR_CNTRL=MAXDATA=0x80000000@DSA • MP_SHARED_MEMORY=yes • MP_CSS_INTERRUPT=yes • AIXTHREAD_SCOPE=S • MP_POLLING_INTERVAL=25000

When using hybrid programming using both MPI and GA, on HPCx, both the messaging layers needs to be enabled. This can be accomplished by using the network.LAPI=csss,not_shared,us option in the submission script, in addition to the network.MPI=csss,not_shared,us.

3.2 Global Arrays benchmarks

The Ping-pong benchmark is used widely to quantify and compare the performance of inter-process communication. The elapsed time to transfer a message from the memory addressed by one processor to memory addressed by another processor yields a measure of the latency of the network connection between two processes. The ping-pong benchmark performs this test of increasing message sizes. Com- munication routines of different message passing libraries have differing capabilities and characteristics in terms of latency. Parallel application performance is heavily dependent on the latency of data transfer.

21 To study the communication performance of GA routines and to compare it to the performance of cor- responding MPI routines, an existing MPI ping-pong benchmark was modified to include tests for GA communication routines. The original benchmarks were included with the Berkeley UPC software, and was available for MPI and UPC.

The ping-pong benchmark used in this test performs the following steps. First, two arrays of size equal to the maximum message size requested by the user is created. Then, random lists of processor commu- nication patterns are generated. This ensures that under test conditions involving many processors, all processors are accounted for while deriving the test results. After the send and receiving processors are identified, a few small messages are transferred to warm-up the communication. This reduces the effect of start-up costs on the final benchmark result. Data is then transferred from the array of source proces- sor to the target processor. Source and target processors are identified and mapped in random order. The elapsed time to complete a fixed number of inter-processor messages is measured by the program. These results are written down to files in csv format. By graphing the elapsed time as a function of the message size, the inter-processor communication characteristics of different machines can be compared. In this section, the results of inter-processor communication on Lomond, Bluegene and HPCx when using the GA toolkit are presented.

The results of the benchmarking tests on Lomond is presented in Figure 3.1. Similar tests were repeated on Bluegene/L and its outcome is presented in Figure 3.2.

GA Non blocking 1 GA Blocking MPI Non Blocking MPI Blocking 1x10-1

1x10-2

1x10-3

Elapsed time in seconds 1x10-4

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 3.1: Ping-pong benchmark using GA & MPI on Lomond

On Lomond the GA blocking routines ga_get and ga_put displayed good performance for small message sizes. However for messages above 0.1MB the performance of the blocking routines deteri- orated considerably. The GA non-blocking routines ga_nbget and ga_nbput outperformed their blocking counter parts for large messages. The MPI routines outperformed GA routines for message

22 GA Non Blocking 1 GA Blocking MPI Non Blocking MPI Blocking 1x10-1

1x10-2

1x10-3 Elapsed time in seconds 1x10-4

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 3.2: Ping-pong benchmark using GA & MPI on Bluegene/L sizes exceeding 0.1 MB.

On Bluegene/L a large difference in communication performance was observed between GA routines and the MPI routines. There was no significant difference in the performance of blocking and nonblocking routines in MPI. The same characteristic was observed on GA routines. GA non blocking routine perfor- mance was observed to deteriorate for a few message sizes. This behavior was observed to be consistent when the tests were repeated multiple times. GA as well as MPI non blocking communications incurred high costs at very low message sizes.

The characteristics of the ping-pong benchmark on HPCx was found to be different from that on Blue- gene/L and Lomond. HPCx supports RDMA through LAPI. An environment variable in the submission script can enable the RDMA support for MPI as well. The GA toolkit is designed to work optimally us- ing the LAPI communication layer directly. This provides it with significantly better performance when compared to MPI with RDMA enabled. The one sided model offered by the GA toolkit is able to utilise the RDMA effectively. For message sizes below 0.1 Megabytes, the GA get and put routines out per- formed the equivalent MPI send-receive routines. Above this limit the GA non blocking communication out performs MPI communication where as the GA blocking communication routines failed to provide significant improvements. As the message sizes increased to over 1 MB it was observed that the MPI offered better bandwidth than GA routines. On Lomond the GA non blocking communication routines suffered a performance degradation compared to GA blocking routines for messages with size less than 0.1 MB. This characteristic was not observed on HPCx. Figure 3.3 presents the communication charac- teristics of GA and compares it against that of MPI on HPCx. Both blocking as well as non blocking routines provided equivalent performance on HPCx for messages of size less than 0.1 MB.

23 GA Non Blocking Get Put 1x10-1 GA Get Put MPI Non Blocking MPI Send Receive

1x10-2

1x10-3

1x10-4 Elapsed time in seconds

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 3.3: Ping-pong benchmark using GA & MPI on HPCx

3.3 Image processing benchmarks using Global Arrays on Blue- gene/L

A simple image reconstruction program was parallelised using both GA and MPI. This program allows the performance of the GA based code to be compared to the MPI based code. The input image for the image processing is the output of an edge detection algorithm. The edge detection algorithm is applied to a grey scale image of size MxN. In order to reconstruct the original image from the edge image, the value of each cell of the image is updated with the difference of the sum of the values of the four neighboring cells and the edge image. This process is iteratively performed on the resulting image to reconstruct the original image. The image processing code uses nearest neighbor communication and has fixed communication patterns. The number of iterations required to reconstruct the image with satisfactory resolution is dependent on the size of the image as well as the number of iterations needed. This program can be used to study the effect of communication/computation ratio on the overall performance and scaling of the parallel program.

Figure 3.4 depicts the program structure of the image processing application using the Global Arrays toolkit. Both the MPI and GA image processing code used one dimensional decomposition. In order to parallelise the application, the edge image was partitioned into as many strips as the number of pro- cessors. Each processor works on its chunk of edge data. After updating all cells holding the image data a halo swap is performed by using ga_get and ga_put commands. The image reconstruction algorithm is performed on the code. This step is the compute intensive loop. The image reconstruction step iterates till the maximum number of iterations is reached. Then the reconstructed image is written

24 Start

Initialise MPI, GA environments Initialisation steps Allocate parallel data structures

Read the edge data

Populate the GA with edge data

Parallel decomposition Populate each processor with its chunk of edge data from Global Array

Perform Haloswap using GA as shared memory

Compute Perform image No Intensive loop reconstruction

Iteration count > Max iteration?

Yes

Write reconstructed image

Finalisation steps Stop

Figure 3.4: Program structure of the image reconstruction application

25 down into portable grey map format by the application. In order to ensure the accuracy of the output, the reconstructed output can be compared against the output from the serial application. There were no differences in the output from serial code and GA based parallel code.

Image size 192,360 Image size 600,840 1000 Imagesize 2568,1841

100

10 Elapsed time in seconds

1

1 10 100 Elapsed time in seconds

Figure 3.5: Application performance - Image processing using GA with different input sizes on Blue- gene/L

The scalability of the GA based image processing application improves with increasing image size. The performance of three image sizes - 192x360, 600x840 and 2568x1841 is presented in Figure 3.5. Figure 3.6 presents the results of a comparison between MPI based image processing application and GA based image Processing application. It was observed that the MPI application provides better performance compared to GA, especially at higher number of processors. From the GA communication benchmarks it can be observed that the performance difference between GA and MPI is greater for smaller messages. The performance difference between GA and MPI versions of the image processing application is small when the code is run on four processors. The difference in performance increases with increasing number of processors. This confirms that the performance difference between the two applications arises mainly due to the costly communication routines of the GA program.

Bluegene/L supports two modes of operations - Co-processor mode (CO) and Virtual node mode (VN) [7]. In virtual node mode both cores of all chips are utilitzed for computation, while in co-processor mode, one chip is used for computation and the other chip handles the communication tasks for the compute core. Significant improvement (up to 50%) to application performance was observed when the application was executed under virtual node mode. This performance improvement was high for smaller number of processors. When the communication cost increased with increasing number of processors, the savings due to the use of virtual node mode decreased. The application performance was found to be more stable under the CO processor mode (Figure 3.7).

26 Image Processing using GA Vs MPI GA MPI 1x103

1x102 Elapsed time in seconds

1x101 0 20 40 60 80 100 120 Number of Processors

Figure 3.6: Application performance - Image processing using GA vs MPI

2568x1841- CO Mode 2568x1841 - VN Mode 1000

100 Elapsed time in seconds

10

1 10 100 Number of chips

Figure 3.7: Application performance - Effect of Virtual node mode on Global Arrays based image pro- cessing benchmark

27 The GA program was easier to develop when compared to the MPI program. When using MPI program- ming it was necessary to determine the neighboring process before send and receive routines were called. The presence of the global index space in GA programs allows each processor to issue a get and put command using the global address of the elements. The GA get and put syntax makes it easy to specify the chunk of the global array to be copied.

28 Chapter 4

Optimizing memory usage of molecular dynamics code using Global Arrays

4.1 Molecular Dynamics

A molecular dynamics application used in current research at the University of Edinburgh was used to study the characteristics of an application which uses GAs to store frequently accessed configuration data. Using the DSM to store configuration data reduces data replication and saves memory on each node. The molecular dynamics code used in this study [26] is used to simulate electron transfer from an electrode to an electrolyte. The parallel version of the code was developed at Edinburgh Parallel Computing Center (EPCC) by Mr. Aristides Papadopoulos and Dr. Alan Gray.

The replicated configuration data restricted the problem size that could be solved on MPP machines. Using the GA toolkit these common data structures can be maintained across multiple nodes. For a given problem size, as the number of the nodes increases, the memory required on each node reduces. This is expected to improve the memory scalability of the application.

The original MD application displayed poor scaling, limited to 128 processors. An attempt was made to improve the scalability of the code. The scaling issue was attributed to poor load balance in the study conducted by Aristides Papadopoulos. As part of this project the poor scaling of the application was studied in detail. The result of this study is presented in the next chapter.

4.2 The physical system and its effect on memory

The molecular dynamics code attempts to simulate a simple system comprised of two walls of atoms, forming the cathode and the anode [26]. The melt ions are placed in between these electrodes. The walls of electrodes are assumed to be three layers thick for the simulation. The dimensions of the cathode and

29 anode can be altered by the user. The size of the system under simulation is dependent on the dimensions of the electrodes, and the number of the melt ions.

Z dimension

5

-

Melt ions + AnodeAnode - Cathode 5 - + Anode

+ -

+

Figure 4.1: The physical system under simulation

The system depicted in the Figure 4.1 is a 5x5 system with N melt ions of each species between the anode and cathode. The main factors that influence the memory usage of the program are the dimensions of the walls, and the number of melt ions between them.

To define the physical system a 3 stage approach is used. First the walls forming the anode and cath- ode are generated. Then the melt ion configuration files are generated. In the third step these files are combined to form the configuration of the entire system. Three programs - makewall, makemelt, and wallchange3 were made available by the research team to generate the input files used to define the system. Using these programs it was possible to generate systems of varying input sizes. It was also possible to change the number of the wall ions or change the number of the melt ions or both.

The molecular dynamics code used the number of wall ions and the number of the melt ions to define many of the data structures used by the code. These arrays held information about the ions, including position, velocity, potential, forces, the etectric field etc. To hold the information related to position and velocity in three dimensions, three separate one-dimensional arrays were used. These arrays were the largest arrays of the program and utilised a major share of the total memory usage of the program. Older versions of the molecular dynamics program used arrays of size equal to the square of the total number of ions. These were optimised in the latest code to use lesser memory, by using arrays of size totalnum*(totalnum-1)/2. However, these arrays contributed to the high memory requirements of the program.

As a result, the maximum system size solvable on the bluegene was limited to 4500 ions Figure 4.2. At 4500 ions the arrays holding velocity and position information utilised more than 200MB of the memory. On each node, the Bluegene/L possesses 512 MB memory. Therefore the largest system size solvable on

30 1200 Maximum number of ions limited to 4500 and uses 250+ MB

1000

800

600

400

200 Memory usage of configuration data in MB

2000 4000 6000 8000 10000 Total number of ions

Figure 4.2: Memory usage for storing position, velocity related data. the Bluegene/L was affected to a great extend by the large arrays of the program.

4.3 Structure of the molecular dynamics application

This section briefly describes the structure of the original MPI based molecular dynamics application [26]. The steps performed by the main routine ( main.F90) are listed below.

• Initialisation

The molecular dynamics application initialises the system, by reading the configuration data. Ini- tialisation is performed by calling the routines readin, setup, rstrun and velset. The readin routine reads runtime parameters from the input files and allocates some of the dynamic arrays of the program according to the sizes described in the input files. The setup routine opens the output files, and then allocates and initialises most of the dynamic arrays of the program. The rstrun routine is called only if a restart is required. A restart is useful to continue the simulation from a previous run. The restart file stores the positions of the ions and their velocities, from a previous run for reuse. The velset routine is called only if a parameter velinit is set to true in the input file. It sets up the velocities on a gaussian distrbution, and then uses rescale routine.

• Re-scaling

The rescale routine changes the velocities of the ions to the desired temperature.

• Calculation of the induced wall charges

31 The routine named conjgradwall specifies the induced wall charges with respect to the constant potential restriction. It uses two methods to perform this calculation - separations and wallCG.

Calculation of the ion separation

The separations routine calculates the separations between ions. It considers the minimum image criterion as well as the two-dimensional periodic boundaries.

Calculation of wall ion potential

The wallCG subroutine calculates the induced wall charges on the wall ions. This step uses the conjugate gradient method. This method tries to specify an appropriate set of charges for the wall ions so that the potential of these ions is the desired constant potential. This step is an iterative process.

• Calculation of the total energy of the system

The ener subroutine calculates the total energy of the system, the total forces on the walls, the total forces applied on the liquid ions, the potential of these ions, and the electric field. The rou- tines rgdrecipE , rdrealE and sr_energy are called by the ener subroutine. The subroutine rgdrecipE calculates the reciprocal space contributions to the Ewald sum, while rdrealE cal- culates the real space contributions to the ewald sum. The subroutine sr_energy calculates the short-range energy and the short-range forces.

• Transchains

The trans_chains subroutine is invoked for each timestep. It updates the positions and veloc- ities of the ions, calls conjgradwall and ener to calculate the new wall charges the energy, potential, forces and the electric field. The output obtained is periodically written down to output files.

4.4 Performance characteristics of original code

The original code was analyzed in detail in the MSc dissertation by Aristides Papadopoulos [26]. The original code achieved a maximum scaling of 128 processors with a parallel efficiency of 70% on the IBM Bluegene/L. The code that formed the basis of the work done by Aristides Papadopoulos was modified by the research community at the University of Edinburgh. In order to baseline the performance of the latest version of the MPI application, a performance evaluation was performed.

Figure 4.3 shows the scaling characteristics of the original code with two system sizes. The performance scaling of the code was limited to 128 processes especially for smaller number system sizes. The maxi- mum system size that could be solved on the Bluegene/L using the original MD code was experimentally found to be limited to 4272 ions, considering both melt ions and wall ions. The scaling of the code improves with the system size. The 10x10 system with a total of 4272 ions exhibited better scaling than the 5x5 system with a total of 1380 ions.

32 10x10 - 3072melt/600 wall 5x5 - 1080melt/150 wall s r o s

1000 s e c o r P

8 2 1

o t

d e t i m i L

g

100 n i l a c S Elapsed time in seconds

10

1 10 100 1000 Number of Processors

Figure 4.3: Scaling of original MD code

4.5 Design of the GA based molecular dynamics application

Molecular dynamics applications perform the following sequence of steps. Initially the configuration data is read from the restart files. The configuration data is comprised of the positions and velocities of the individual particles of the system. For the molecular dynamics system represented by the original code, the configuration data includes the state of the wall ions and melt ions as well as the potential applied. The second step calculates the forces applied on every atom. These forces arise from both bonded and non-bonded atomic interactions. After computation of the forces, Newton’s laws of motion are used to calculate the positions and velocities of all particles. This whole process is repeated for as many time steps as defined by the runtime parameters. The simulation results are periodically written to files by the molecular dynamics application. A final report is also written by the application at the end of the simulation.

The Figure 4.4 depicts the main loop of the molecular dynamics application. The trans_chains routine is repeated over many time steps. The total number of time steps is defined as a runtime pa- rameter. The ions are moved to their latest positions, and the conjgradwall routine is invoked. The conjgradwallroutine, depicted in Figure 4.5, is an iterative routine. The conjgradwall routine com- putes the separations of each particle from every other particles and stores the separations information in data arrays. A routine named wallCG is then invoked by the conjgradwall routine. This routine sets the inital wall charges and computes the wall energies. If the conjugate gradient routine used for this computation converges the wall charges are updated. Then the ener routine is invoked to update the energy information. These steps are repeated many times over till the last time step.

33 trans_chains routine

conjgradwall Move ions routine

Energy Calculations Repeat loop to last step

Periodic output

Last step ?

Figure 4.4: Main Loop of the molecular dynamics code

conjgradwall routine

Separations

Set initial WallCG charges

cgwallener update wall calls charges cgwallrecipE, cgwallrealE

No CG Method converged ?

Figure 4.5: A detailed view of the conjgradwall routine

34 Separations

ga_put

N*(N-1)/2

3 ga_dsav Global Array

ga_get ga_get

ga_get

cgwallrealE rgdrealE sr_energy

Figure 4.6: Usage of the Global Array in the molecular dynamics application

After an analysis of the data structures created by the molecular dynamics application, it was observed that the largest data structures were the ones holding the separations information. This data was updated only once per time step by the separations routine, and was replicated on all processors. A major share of the total memory usage of the code could be attributed to three one dimensional arrays ( dxsav,dysav and dzsav) of size N*(N-1)/2 where N is the total number of ions. This array was updated by the separations routine and is used in many routines called by the conjgradwall as well as the ener routines. In order to reduce the memory usage of the application, the three large arrays were distributed using Global Arrays (Figure 4.6). This reduces the memory foot print of the application. The memory usage of the Global Arrays based program will be the sum of ((N-1)/2)* size of double) and the memory used internally by the GA toolkit.

To implement the GA version many routines were modified. The separations routine, updates the dxsav, dysav, dzsav arrays. This routine was modified to operate on temporary buffer segments each of size equal to the size of dxsav divided by the number of processors . After computing the values to be stored on its segment, each processor updates its segment by issuing a ga_put call. Before start of the separations routine a ga_sync call ensures that all processors are ready to load the positions and velocites of particles in case the particles have been moved. Though the original code used three separate arrays to store the positions and velocity data, in the GA version a single array was used. Using GA syntax a multi dimensional chunk of data can be updated and retrieved with a single call. This feature allows reduction of communication calls. Since GA routines are optimised for bulk data transfer, clubbing the three one dimensional arrays into a single array, improved the performance of the application. The separations routine is invoked once every timestep.

The cgwallrealE, rgdrealE and sr_energy routines used the data stored in the GA to perform their part of the computation. The part of the global arrays that each processor operates upon is known prior to the start of the nested computation loops. Each processor copies its part of the data to a temporary buffer by issuing a ga_get call, and then operates on its part of the data. Initially it was decided to

35 use a large temporary buffer to store the entire data required by the nested computational loops. The prototype built using this approach showed very poor scaling and performance as it resulted in significant load imbalance 4.7.

Comm outside outer loop Comm inside outer loop

100

10 Elapsed time in seconds

10 100 Number of processors

Figure 4.7: Effect of using large buffers

When the difference in the size of the data that needs to be transferred to each processor increases, the GA routines adds significant communication overhead to the processors that transfer more data. This prototype was not developed further. In order to eliminate the load imbalance caused by the GA communication, it was decided to use smaller buffers within the inner loop of the nested computation routines. Although there were many other smaller arrays that contributed to the memory usage of the code, it was decided that GA should not be used to store them, because they were involved in more active communication.

The loops parallelised to use the Global Arrays share a common loop structure. The structure of the computation loops using GA blocking communication can be generalized as follows:

• Program structure of a computational loop using blocking communication

allocate a local buffer

do Jiter=Jbegin,Jend

Read a segment of global array to local buffer

do Iiter=Ibegin,Iend

Perform computation using local buffer

end do

36 end do

deallocate buffer

As described by the pseudo code given above, the GA communication is placed within the inner loop (J loop). Though this increases the communication cost, it was chosen to avoid significant load imbalances arising from the GA communication of large buffers of varying sizes.

Figure 4.8 depicts the high performance cost incurred by the GA version of the molecular dynamics ap- plication compared against the performance of the MPI molecular dynamics application. Non blocking communication can be used to overlap the computation with communication leading to latency hiding. An attempt was made to use non-blocking communication to improve the performance of the GA appli- cation and the result of this study is presented in the next section.

MPI GA with parallel rdf 5x5

1x102

1x101 Elapsed time in seconds

1

10 100 1000 Number of processors

Figure 4.8: Comparison of GA and MPI application performance

4.6 Performance of application using non blocking communication

In order to reduce the high communication cost, a study was performed by using non blocking (NB) communication. The structure of a generic computational loop using non blocking is provided below:

• Program structure of a computational loop using non blocking communication

allocate two local buffers

Read a segment of global array to local buffer1

do Jiter=Jbegin,Jend

37 read data from GA to local buffer2 using ga_nbget

do Iiter=Ibegin,Iend

Perform computation using local buffer1

end do

wait for nonblocking communication to complete

copy data from local buffer2 to local buffer 1

end do

deallocate buffers

The most costly routines in the molecular dynamics simulation are the cgwallrecipE, rgdrecipE routines. These two routines contribute to more than 61% of the total compute time. Though these routines are parallelised using MPI, Global Arrays were not used on this routine, and no modification was made to this code. The rgdrealE routine and the cgwallrealE routine were found to be less expensive compared to the reciprocal space routines. They contributed less than 2% each to the total time. Using non blocking communication on the cgwallrealE , was found to be counter productive. The computational work load of the cgwallrealE routines are very low, and the additional overhead of the data copy and branching when using non blocking communication led to degradation of the application performance. Though the computational work load in the rgdrealE routine was slightly higher than cgwallrealE, the performance of the application did not improve as a result of the implementation of non blocking communication. This is mainly because each routine of the application is comprised of many computational loops, each with low computational pay load per loop. Additional computational cost incurred by these loops while implementing non blocking communication leads to poor performance, especially at lower processor numbers. Figure 4.9 is a comparison of the performance of the GA code with non blocking communication and with blocking communication.

4.7 Result of memory optimization using Global Arrays

In order to investigate the savings in memory achieved through the use of the Global Arrays toolkit, a trial and error approach was used. Though counting the size of variables and arrays used was an alternative method, the counting approach could not be used as the internal memory consumption of the GA toolkit could not be predicted in advance. The largest system size that could be executed on the GA version of the program was also dependent on the total memory requirements of other replicated data structures used in the program.

It was observed practically that system sizes with around 8000 ions could be executed on the Bluegene/L successfully at 128 processors, with the GA based code (Figure 4.10). This is a significant improve- ment over the MPI based code with data replication. The positions and velocities arrays with 8000 ions consume more than 800 MB of memory. The Bluegene/L has a limited memory of 512 MB per chip. Therefore the original code fails to execute the code even on larger numbers of processors.

The savings in memory was evident at higher number of processors. The introduction of the GA increased the memory usage of the code, compared to MPI version when run on small number of processors. As a

38 GA blocking non blocking cgwallrealE 100 non blocking rdgrealE

80

60

Elapsed time in seconds 40

20

20 40 60 80 100 120 Number of processors

Figure 4.9: Blocking Vs Non Blocking GA program

1200 Maximum number of ions limited to 4500 and uses 250+ MB for MPI Program Maximun number of ions that the GA program can execute is more than 7500 ions 1000

800 GA program;s limit

600

400 MPI program's limit 200 Memory usage of configuration data in MB

2000 4000 6000 8000 10000 Total number of ions

Figure 4.10: Result of memory optimisation using GA

39 result of this, the GA code could not be used on smaller number of processors. The system with 7200 ions requires the code to be execute on atleast 64 processors. When the system size increases, the smallest number of processors that the code can be run on increases. A hard coded upper limit was placed on the number of ions as 8,500. With this setting, the code should be able to execute all systems successfully up to about 8500 ions, on 128 processors in communications coprocessor mode. The user of the code can change this setting to experiment with systems of different sizes.

When the code runs on 128 processors, the GA program consumes only 6 MB for the storage of the posi- tions and velocities arrays. Larger systems exceeding 8000 ions could not be executed on the Bluegene/L as the replicated arrays consume more memory than can be executed on the Bluegene/L.

The GA based program incurred a heavy performance loss. This is attributed to the cost of adding communication calls within the outer loop of two nested loops. Since this was implemented in routines which are repeatedly invoked (mainly cgwallrealE), the performance impact was very high.

4.7.1 Simulation of an Aqueous system

In order to verify the capability of the GA based program a system with water as the melt was simulated on the Bluegene/L. The system had 5340 ions in total, including 4752 melt and 588 wall ions. The MPI program was not able to execute this system on Bluegene/L. The MPI based program crashes with the error code 1525-108 on the root node. This error code indicates that an error was encountered while attempting to allocate a data object [28].

The GA code on the other hand executed this system successfully. As a result a new type of simulation could be performed on the Bluegene/L using the GA based molecular dynamics code. The output of this system was verified by the research team [29] by comparison with the results from the MPI code executed on a Linux cluster, with more memory per node.

4.8 Test results

The testing of the output from the GA based code was performed mainly by comparison of the output files with the output from original code. A small system executable using the MPI version also was used to test the output from the GA code. Since the modification to the code involved modifying the storage of the arrays, the result of the computation was expected to match exactly with the result of the original code. All the output files were compared to the output from the original code, and were verified to be accurate. The compiler optimization flag used to compile the code was -O3 -qstrict -qtune=440 -qarch=440. Compiling the GA toolkit as well as the molecular dynamics program with higher compiler options without using -qstrict option gave better performance, but resulted in differences in the output.

The test systems used to verify the GA code were also included in the delivery of the code. The 5x5 system uses 1080 melt ions and 300 wall ions, where as the larger system of size 10x10 uses 6000 melt ions and 1200 wall ions. The 5x5 system was used to verify the output of the GA version. Both these systems use the trans_chains routine. Even though the trans_chains routine was used for

40 investigation of performance of the code, water based system was used to verify that the systems using verlet_shake routine instead of the trans_chains routine produce accurate results.

41 Chapter 5

Performance optimization of Molecular dynamics code

The maximum a parallel program can achieve is limited by the serial component of the program and the load imbalance of the program. The maximum speed up S that a parallel program with a serial fraction F can achieve on N processors is limited by equation 5.1.

1 S = (1−F ) (5.1) F + N

The equation 5.1 represents the simplified form of Amdahl’s law [4].

5.1 Analysis of load imbalance and serial components

The original MPI based code suffered poor scaling above 128 processors. The root cause of this behavior was identified as the load imbalance between the processors when computing the cgwallrealE and cgwallrecipE routines [26]. The cause of the load imbalance in cgwallrealE was attributed to the short-range cut off applied in the reciprocal space loops.

In order to determine the root cause of this behavior, an analysis of the time taken by compute intensive and repetitive routines of the code was performed by timing the most expensive routines. The rgdrealE and rgdrecipE routines did not display any significant load imbalance between the processors. The cgwallrealE and cgwallrecipE routines individually displayed very high load imbalance between the processors. However these routines form a complementing pair, and the load imbalance of a the total of the times spend on these routines was found to be insignificant. This is represented in the figure 5.1. Therefore the load imbalance was eliminated as the root cause of the poor scaling characteristic demonstrated by the code.

The original MPI code had three routines - rdf, potdump, dump which were invoked only on the root node. These routines represents the serial component on the root node, and was identified as the

42 main reason for the poor scaling behavior of the original code. Of these three routines the rdf routine contribued to 80% of the compute time. The rdf routine is used to collect the data required to calculate the radial distribution functions. Another routine named rdfout writes the data calculated to multiple output files. Since the rdf routine is compute intensive, it was decided to parallelise this routine. The rdf routine impacts the GA based program even more adversely. Since the rdf routine acts on the velocity and positions data, this routine would access the global array and this would have introduced additional performance and scalability issues. Therefore it was decided to parallelise this routine.

0.03 cgwallrecipE cgwallrealE

0.025

0.02

0.015

0.01 Elapsed time in seconds

0.005

5 10 15 20 25 30 Processor Number

Figure 5.1: Analysis of load imbalance

5.2 Parallelizing RDF routines

Inorder to parallelise the rdf routine, two versions of the rdf routine were developed. One version for the GA based program and one version for the MPI program. In the MPI version, each processor identifies its chunk of the task by reusing the distribution of all ion pairs across nodes. In the GA version the distribution of the global array was used to implement the parallelization. By using the data locality information of the global arrays it is ensured that each processor acts only on the data located in its memory. This yeilds good performance by avoiding inter-node data transfer. After each processor acts on the chunk of data it is allocated, a global reduction is performed, and the root node writes this data to files using the rdfout routine.

43 5.2.1 Result of performance optimization

The performance curves of the MPI application with parallel rdf is compared to the performance of the original MPI application in Figure 5.2. The scaling of the MPI application improved significantly at higher number of processors. The MPI program shows performance improvements up to 512 processors. The GA program incurred a 2x performance hit compared to the MPI program with the parallelised rdf routine. However the GA program was able to solve much larger systems that could not be solved on Bluegene/L with the MPI code.

MPI with parallel rdf 5x5 GA with parallel rdf 5x5 Original program 5x5 1x102 GA with parallel rdf 10x10

1x101 Elapsed time in seconds

1

10 100 1000 Number of processors

Figure 5.2: The scaling of the MPI application before and after rdf routine was parallelised

On the HPCx, the performance characteristics of the GA code was quite different from that on Bluegene/L (Figure 5.3). The scaling of the GA as well as that of MPI application was affected significantly by the inter-node communication. On HPCx the intra-node performance is significantly better than the inter- node performance, as intra-node communication avoids the high-latency switch. For the 5x5 system size significant performance improvement was obtained by using 64 processors compared to 32 processors. This performance improvement deteriorated and eventually became stable when the application is run on more number of processors. The MPI application performance did not deteriorate when using larger number of processors. At runtime, before the initialization of the Global Array the ARMCI layer is configured to adapt to the communication characteristics of the machine. Having more number of nodes can therefore adversely affect the performance of GA based code on HPCx. The same code exhibits better scaling when large systems are solved. This indicates that ARMCI is configured to use larger messages on larger systems. At lower number of processors (32 processors) the GA code suffered very poor performance when compared to the performance of the MPI version. This contrasts with the behavior of the GA based program on the Bluegene, where the difference in performance of the GA code and

44 GA 5x5 System 80 MPI 5x5 System GA 10x10 System

60

40

Elapsed time in seconds 20

0 0 200 400 600 800 1000 Number of Processors

Figure 5.3: Performance of GA and MPI version of MD code on HPCx

MPI based code remained a constant as the number of processors increased from 8 processors to 128 processors . This can be attributed to the better network capabilities of the Bluegene/L.

By packaging the GA program along with the MPI program, the end user is provided two options. The user can opt to use the MPI program to compute small system sizes which fit the Bluegene/L memory and benefit from better performance and scaling. The user can also use the GA application to simulate systems with larger number of ions, which could not be executed on the Bluegene/L with the original MPI version.

45 Chapter 6

UPC

Investigations on the performance characteristics and ease of use of UPC for distributed shared memory programming was performed on Lomond and HPCx. This chapter presents the details and results of this investigation.

6.1 Installation of UPC

In order to study the performance characteristics of applications using UPC language two implementa- tions of the UPC language were experimented with. Berkeley UPC [3] runtime version 2.4.0 was installed on Lomond by compiling it from the open source distribution. This runtime provides the user access to the UPC environment, through the commands upcc and upcrun. IBM XL UPC was installed on HPCx by the HPCx technical support team. Apart from IBM XLUPC and Berkeley many other commercially available UPC implementations for various hardware architectures are available today. These include CRAY UPC [30], HP UPC [31] and GCC UPC [32].

6.1.1 Installation of UPC on Lomond

The Berkeley UPC compiler is made up of two main components - the UPC to C translator and the Berkeley UPC runtime. These components are available for download as separately archived files. The UPC to C translator is also available as an online service. Therefore the core component that needs to be installed is the Berkeley UPC runtime. By default this runtime will point to the UPC to C translator which is available online. It is also possible to use the Berkeley UPC runtime with the GCC UPC binary compiler to build UPC programs. It was decided to use the online UPC to C translator, since the UPC to C translator installer as well as GCC UPC compiler was not available for Solaris. The GCC UPC compiler is available mainly for Intel/Linux based uniprocessor and SMP clusters, as well as Cray T3E Alpha processor based systems and Cray XT3 AMD based systems.

As part of this project Berkeley UPC was investigated using the default installation. Only the Berkeley UPC runtime was installed on Lomond. In order to install the Berkeley UPC runtime on Lomond, the following steps were performed.

46 First, the installation files were downloaded from the Berkeley UPC download site [33] and extracted using the tar utility. Prior to installation of Berkeley UPC, it is necessary to configure the installation. The configuration step identifies the conduits to be used by the UPC communication layer. Since Lomond is a shared memory machine, it was decided to use the SMP conduit. The configuration was performed by invoking the configure script from the base folder.

/berkeley_upc-2.4.0/configure CC=gcc --without-cxx --without-mpi-cc

--disable-mpi --enable-smp

The above command compiles the Berkeley UPC compiler and enables the use of shared memory for inter-processor communication while disabling the MPI as well as the UDP conduits. To make the Berke- ley UPC runtime, the gmake command needs to be invoked after the configuration process is completed. It is possible to control the shared memory settings while UPC runtime is installed. This setting plays a crucial part in determining the performance characteristics of the UPC application compiled using the UPC compiler. The default configurations governing the operation of the UPC runtime is maintained in the upcc.conf file. A setting named shared_heap governs the default amount of UPC process’s memory that is dedicated for shared memory. If this setting is too low, UPC programs can fail due to memory de- pletion. To prevent this from happening, the user can set this value to half the available physical memory divided by the number of threads. This setting is recommended by the UPC installation notes.

Even though MPI is available as a conduit for installation of UPC, it is not recommended to be used on ethernet based clusters. Using MPI layer between the network hardware and the UPC runtime deteriorates the performance of UPC communication. Berkeley UPC installation notes recommends the use of UDP on ethernet based clusters rather than an installation over MPI. Specialized interconnects like LAPI, Quadrics Elan, Myrinet/GM, Infiniband/vapi, etc are recognized as conduits by the GASNET layer of UPC.

To compile an UPC source code using the Berkeley UPC runtime, the upcc script is invoked with suitable runtime options. The option used for the investigation presented in this is given below.

upcc -T=N -pthreads=N -opt -o output source.upc

The -T flag signifies the number of static threads to be used, -pthreads signifies the number of threads, while -opt turns on the experimental optimizations of Berkeley UPC.

6.1.2 Installation of IBM XL UPC on HPCx

While the Berkeley UPC runtime is a script file which can be invoked to compile the UPC source code, IBM XL UPC is a binary compiler based on the IBM XLC compiler. It can be installed on AIX as well as Linux machines. The installation procedure and requirements are documented at the IBM XL UPC website [34]. IBM XL UPC was installed by the HPCx support team on HPCx.

6.2 UPC syntax and common usage patterns

The UPC language is designed to simplify parallel programming on distributed memory computers. It aims to implement the convenience of the shared memory programming model on distributed memory

47 machines. UPC supports a powerful and versatile syntax, and the performance of the UPC based ap- plication depends highly on the programming style used. The main features of the UPC that reduces syntactical complexity and promotes usability are described below.

• Thread and process layout

Berkeley UPC provides users the flexibility to layout UPC processes and UPC threads in many different ways. In a cluster of SMPs or a cluster of servers with multicore chips, the end user can lay out a process per node and spawn multiple UPC threads within each node. These threads use pthreads for intra-node communication while the inter-node communication is handled through the conduit of choice. Typically, UDP is the conduit of choice in a cluster where the interconnect is ethernet. It also provides a wide range of choices for use as the network conduit.

IBM XL UPC provides a simple means of thread layout, as the user specifies only the number of threads. The spawing of UPC processes is performed internally by the IBM XL UPC compiler. IBM XL UPC is available mainly for IBM AIX, Linux machines. In many applications, specifying the number of static threads at compile time, additional compiler optimizations are enabled.

Within an UPC program, the THREADS keyword provides the values of the total number of UPC threads and the MYTHREAD keyword identifies the thread number of the local thread.

• Shared Variables

UPC provides the programmer with two distinct memory spaces - private memory and shared memory. The standard C local variable uses the private space while the shared space is used by variables qualified at declaration using the shared qualifier.

shared int array[N];

While every UPC thread creates a local copy of each private variable, the shared variables are accessible by all the threads. Pointers needs to be qualified with the shared clause to be able to point to the shared memory.

• Data distribution using UPC

It is very easy to set the data distribution of the shared array within the distributed shared memory. This is accomplished by setting the distribution block size while declaring the shared array.

shared [blocksize] int array[N];

By varying the block size it is possible to redistribute the data across N threads in different patterns. The default value of block size is 1 which leads to cyclic decomposition of N shared array elements across all the threads available. For example, by using the syntax

shared [N/THREADS] int array[N];

the N/THREADS contiguous elements of the array are spread between the available threads. This leads to block distribution. This syntax works fine for multi dimensional arrays as well.

shared [(M*N)/THREADS] int array[N][M];

In order to position all the array elelements on the root node, the shared array can be declared with a blank block size, or use the value N as the block size.

48 shared [ ] int array[N];

shared [N ] int array[N];

Pointers can also be declared as private or shared.

shared int *myptr;

shared int *shared ourptr;

The power of the UPC programming language is derived from the use of the shared variables and shared arrays. Shared array elements can be assigned values from local variables and vice versa. This coupled with work sharing and synchronization syntax makes parallel programming using UPC easier than the message passing model.

• Worksharing using UPC

The UPC runtime spawns as many threads as specified by the runtime environment settings. All threads execute in parallel by default. The primary form of work distribution across threads is performed by the using the upc_forall statement. The syntax of the upc_forall statement is similar to its C counterpart the for statement, the only difference being the addition of the MYTHREADS variable. This additional variable indicates the thread which should perform the work for that iteration. The upc_forall statement executes each loop concurrently, and independent of other loop iterations. It is therefore critical to ensure that the operations performed in the work section does not have loop dependencies.

upc_forall(iter=0; iter

Work ....

}

The syntax above performs each unit of work on each available thread concurrently. UPC programs can be developed to use the shared memory or the message passing model. As part of this study an investigation into the performance of the application using both these models is investigated.

• Data locality awareness

UPC provides powerful syntax capable of exploiting the data locality automatically. This is imple- mented as a variant of the upc_forall syntax. By using the syntax provided below, the thread that holds the data element denoted by the iter array index, is computed by the owner of the shared array element.

shared [N/THREADS] int array[N];

upc_forall(iter=0; iter

array[N] =sqrt(iter);

Data locality awareness provides significant ease of use to the programmer. The array syntax can be best utilised on multidimensional arrays distributed using block distribution.

• Synchronization

UPC provides powerful synchronization functionality in the form of barriers and locks. This allows protection of shared data through mutual exclusion and inter-process synchronization. UPC also

49 provides multiple synchronization modes - strict mode and the relaxed mode. Strict mode enforces the data synchronization before each data access from the shared array, while the relaxed mode allows threads to access data asynchronously. Strict mode adds additional serial elements and therefore leads to significant overhead and is best avoided in favor of locks.

• UPC one sided communication

UPC also allows one sided inter process communication through the use of upc_memput and upc_memget routines. These routines allows message passing style programming using UPC.

6.3 UPC benchmarks

The ping-pong benchmark was performed on both HPCx and Lomond to measure the performance of the UPC inter-processor communication. The code used to perform this benchmark test was available within the Berkeley UPC distribution. The ping-pong benchmark measures the time elapsed to copy data from a shared variable resident on a remote processor to a local variable, and vice versa. The memory grab is represented on the graphs presented in this section as upc get, and the memory store is represented as upc put. On Lomond UPC was installed to use the SMP conduit. This enables UPC to use the shared memory for communication purposes. The largest shared memory allocatable per process was limited to 2 GB on Lomond. This is shared between all the threads of the UPC program. The shared_heap setting is available as a compiler flag and can be set per program.

Figure 6.1 compares the communication characteristics of UPC routines with that of the corresponding MPI routines. The communication patterns reveal major differences in the communication characteristics between MPI and UPC. At very low message sizes Berkeley UPC outperformed MPI. Berkeley UPC communication routines are optimised for small message sizes, since it is intended for use in applications predominated by random point to point communication calls. MPI outperforms Berkeley UPC for larger system sizes. UPC suffered nearly an order of magnitude difference in communication performance on Lomond compared to the performance of MPI. The benchmark program uses direct assignment of shared memory elements to local memory for transfer of data. This method of use of shared memory was found to be very expensive on Lomond.

The performance of IBM XL UPC communication routines is compared to MPI performance, in Figure 6.2. The performance of IBM XL UPC communication was found to be very poor compared to MPI performance. The difference in performance between MPI and IBM XL UPC was nearly two orders of magnitude. The reasons for this poor performance are unknown. This result could be due to the choice of communication conduit used for the installation of UPC on HPCx. The Berkeley UPC on Lomond demonstrated better communication characteristics than IBM XL UPC on HPCx.

The behavior of IBM Xl UPC and Berkeley UPC differed greatly, when compared using the ping-pong benchmark. Figure 6.3 contrasts the performance of Berkeley UPC on Lomond and an Intel Core Duo chip with the performance of IBM XL UPC on HPCx. Almost an order of magnitude difference was observed between the communication latencies of Berkeley UPC on Lomond and IBM XL UPC on HPCx. On Lomond, the UPC communication routines seems to be able to use the shared memory more effectively for communication compared to IBM XL UPC on HPCx. Berkeley UPC suffered high costs for very small messages like MPI, while the IBM XL UPC did not. Berkeley UPC suffered very low

50 UPC Get 1 UPC Put UPC Multiget UPC Multiput MPI Non blocking 1x10-1 MPI Send Receive

1x10-2

1x10-3

1x10-4 Elapsed time in seconds

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 6.1: Ping-pong benchmark of UPC vs MPI communication on Lomond

1x101 MPI Non Blocking MPI Send Receive UPC Get 1 UPC Multiget UPC put UPC Mult Put 1x10-1

1x10-2

1x10-3 Elapsed time in seconds 1x10-4

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 6.2: Ping-pong benchmark of IBM XL UPC on HPCx compared to MPI performance

51 1 Berkeley UPC get Intel core duo Berkeley UPC put Intel core duo IBM XL UPC get HPCx 1x10-1 IBM XL UPC put HPCx Berkeley UPC get Lomond Berkeley UPC put Lomond 1x10-2

1x10-3

1x10-4

Elapsed time in seconds 1x10-5

1x10-6

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 Size in Megabytes

Figure 6.3: Ping-pong benchmark of Berkeley UPC on Lomond, Intel core duo compared to IBM XLUPC on HPCx latencies when used on a chip multi processor. This suggests high variance in the communication latency of UPC implementations as well as their installations on different systems.

6.3.1 Image reconstruction using shared memory and upc_forall

The image processing program described in section 3.3 was implemented using UPC. The UPC syntax simplified the implementation of the parallel program. Figure 6.4 depicts the program structure of the UPC program which uses shared memory and using upc_for all statement, and compares it against a similar program implemented using MPI. Please refer section 3.3 for the structure of equivalent GA program.

The serial program used for image reconstruction declared three main arrays on which the image re- construction algorithm relies on. The three arrays are named new, old and edge. To implement the UPC program, these arrays were redistributed by declaring them using the shared qualifier. In order to implement a block distribution the shared clause used a block size of MxN/THREADS where M and N represents the width and breadth dimensions of the image.

The for loops used for the computation were replaced with their parallel equivalent, the upc_forall statement. In order to ensure that the process that owns the data computes on its own data set the array element was used to specify the data locality.

The overall program layout of the program did not vary from the serial code. The UPC program therefore had very low syntactical complexity compared to its GA and MPI versions. Figure 6.5 presents the

52 Unified Parallel C MPI

Start Start

Allocate Shared Initalize MPI Arrays environement Initialisation steps Read the edge data Allocate local arrays to shared edge array on each node

Populate the shared arrays with edge Read the edge data data

Populate each Use upc_forall to processor with its parallelise the Parallel chunk of edge data computation on decomposition shared array Parallel decomposition & Perform Haloswap Compute using MPI Intensive loop communication Perform image reconstruction on shared array Perform image elements reconstruction Compute No No Intensive loop

Iteration Iteration count > Max count > Max iteration? iteration?

Yes Reduce image from Write reconstructed all nodes to root image using MPI calls Finalisation steps Write reconstructed Stop image

Stop

Figure 6.4: Image processing using UPC compared to MPI

53 UPC shared memory program 120 Serial program

100

80

60

40 Elapsed time in seconds

20

2 4 6 8 10 12 14 16 Number of processors

Figure 6.5: Elapsed time for UPC program using shared memory vs serial program performance of the UPC program compared to the performance of the original serial program with the same compiler flags enabled ( -O3 -qarch=pwr4 -qtune=pwr4 ). It was observed that the UPC program suffered very high performance degradation compared to the serial program compiled using the XL UPC compiler with equivalent compiler flags. To eliminate compiler optimisation issues as the root cause of this degradation, the serial image processing program was compiled using both IBM XL UPC compiler and IBM XLC compiler. The results of this test is presented in Figure 6.6. The use of compiler options in XL UPC compiler displayed considerable improvement to the performance of the program. While the XLC compiler provided significant improvements at each compiler optimisation level ranging from -O1 to -O5, the XL UPC compiler provided good optimisation to the serial code at -O2. This performance was observed to be marginally better than the XLC compiler. This is not surprising since the XL UPC compiler is a newer release than the XLC compiler and may be better optimised for the POWER5 architecture.

The UPC program was not able to scale above 16 processors, indicating severe performance loss when crossing the switch. The root cause of this issue is not known. This may be the result of installation issues.

The root cause of the performance issue discussed in the previous paragraph was identified as the high over head of accessing shared memory compared to the cost of accessing local memory. While the local memory access is optimised by the compiler to use prefetching, there is a high over head associated with address translation of the shared memory segment. There is an additional over head due to the data locality identification. These two factors led to the poor performance of the UPC code. These factors adds serial over head to the program. As the number of processors increased, the program exhibited scaling.

54 XL UPC compiler XLC compiler

100 Elapsed compute time in seconds

10 -O1 -O2 -O3 -O4 -O5 Compiler optimisation level

Figure 6.6: XL UPC vs XLC compiler optimisation.

This indicates that the inter-processor communication cost associated with the UPC shared memory is minimal. To validate this another version of the image processing example which uses local arrays for storage was implemented. The results of this study are presented in the next section.

6.3.2 Image processing using local memory and halo swaps

The UPC image processing program was implemented using a message passing style of implementation. This example is very similar to the MPI or GA version of the image processing benchmark. The com- munication routines used were upc_memput and upc_memget. Shared memory was used only to store the image halos. Figure 6.7 represents the halo swap implementation using the UPC shared memory.

The program exhibited good scaling especially as the computation cost increased with respect to the communication cost. The overall performance improved by more than an order of magnitude when local memory was used instead of shared memory for storing the arrays used in compute intensive loops. Figure 6.8 depicts the difference in performance between the UPC code which uses the local arrays and the UPC code which uses the shared arrays. Figure 6.9 presents the scaling of the UPC image processing benchmark implemented to use local memory. As the image size increased the scaling also improved owing to higher parallel computation to serial computation ratio.

55 Image data on Image data on Image data on Image data on Processor 0 Processor 1 Processor 2 Processor 4

put put

get get

Shared Array of size 2*THREADS is used for halo swap

Figure 6.7: Halo swap implemented using upc_memput and upc_memget.

6.3.3 Summary

The communication characteristics of the UPC implementations varied very highly between the hardware architectures, and installations. Berkeley UPC delivered very low latency when using a chip multiproces- sor, while the latency was much poorer when using the SMP server (Lomond). IBM XL UPC installation on HPCx demonstrated very high latency. The root cause of this behavior is unknown. While IBM XL UPC was able to provide excellent compiler optimisation, Berkeley UPC installation on Lomond was not able to use the native compiler optimisation effectively. This is mainly due to the non availability of the source to source translator or GCC UPC compiler optimised for Solaris. As a result of poor latency of IBM XL UPC installation on HPCx, it was not possible to obtain performance improvements when using distributed shared memory for parallelising the code. The code however scaled better when using a message passing style program structure.

56 UPC Shared Memory UPC Message Passing 100

10 Elapsed time in seconds

1

2 4 6 8 10 12 14 16 Number of processors

Figure 6.8: UPC - shared memory vs UPC - Message passing

Image size 192x360 Image size 384x576 14 Image size 600x840

12

10

8 Speedup 6

4

2

2 4 6 8 10 12 14 16 Number of processors

Figure 6.9: Speed up of image processing benchmark using IBM XL UPC on HPCx.

57 Chapter 7

Comparison of Global Arrays and UPC

Although the Global Arrays toolkit and UPC follow DSM programming model, significant differences exists in their usage characteristics. This chapter attempts to make a comparison of the GA toolkit and UPC language on many fronts. The comparison is made based on their availability and portability across machine architectures, ease of installation, availability and clarity of documentation, syntax and its ease of use, ability to utilise compiler optimisation, interoperability with MPI, communication latency, paral- lel performance gained, and ease of implementation of different parallel decomposition strategies. The analysis of UPC language is based on the characteristics of Berkeley UPC and IBM XL UPC implemen- tations on Lomond and HPCx. These results can vary from one implementation to another.

7.1 Portability and availability

The Global Array toolkit can be installed on many hardware platforms including SMP servers, cluster of SMP servers, MPP servers as well as network of work stations. It is supported on Solaris, AIX, HPUX, Linux, Windows NT, IRIX, HPUX as well as Digital/TRU 64 Unix. As part of this investigation, the GA toolkit was installed on a Solaris stand alone server (Lomond), IBM Bluegene/L, as well as an IBM eSeries 575 SMP cluster (HPCx). Many networks are supported by the Global Arrays toolkit. These include ethernet, Quadrics/QsNet Elan3, Shmem, Elan 4, VAPI (Infiniband), OpenIB (Infiniband), GM (Myrinet), as well as (VIA) Giganet cLAN. The GA toolkit is open source and is supported by the HPC tools support team at Pacific Northwest National Laboratory. Programming interfaces to the Global Arrays toolkit functionality is available on Fortran, C, C++ and Python.

The GA toolkit is implemented as a library and is designed to co-exist with MPI. In this investigation MPI was used as the communication medium for the installation of the GA toolkit on Lomond and Bluegene/L. This mode of installation offers maximum stability. On the HPCx, the GA toolkit was installed to use LAPI. GA test scripts available with the GA toolkit was used to verify the accuracy of the installation. It was ensured that the installations worked fine by running the tests packaged with the GA toolkit. The Global Arrays toolkit is also available on CRAY XT3/XT4, and can potentially be installed

58 on the HECTOR [35].

The Berkeley UPC is also an open source solution and is comprised of a source to source translator that converts UPC code to C code and an UPC runtime which provides the UPC environment to the user. The UPC runtime is available for installation on many operating systems including Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Microsoft Windows, Mac OS X, Cray Unicos, as well as NEC SuperUX. It also supports many processor architectures like x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, Cray T3E, Cray X1/X1E, Cray XD1, Cray XT3, SX-6, SGI Altix. This is mainly due to the excellent support of the ANSI C standard across a wide range of machines. Berkeley UPC claims to be installable using a wide range of compilers like GNU GCC, Intel C, Portland Group C, SunPro C, Compaq C, HP C, MIPSPro C, IBM VisualAge C, Cray C, NEC C, Pathscale C.

As part of this investigation the Berkeley UPC runtime was installed on Lomond. Although an attempt was made to install the Berkeley UPC runtime with the native Sun MPI compilers, this attempt failed. The installation succeeded when performed with the GCC compiler chain. On HPCx the same issue prevented the installation of Berkeley UPC runtime. Therefore IBM XL UPC was used for the study on HPCx.

The Berkeley UPC to C ( source to source) translator is not supported on as many platforms as the run- time. The UPC to C translator was not available on Solaris, and this prevented the complete installation of the Berkeley UPC platform on Lomond. However, an online translator from Berkeley could be used to compile the code for Lomond. This setup has the drawback that the native compiler could not be utilised effectively. Due to this, most of the tests and study was performed using the IBM XL UPC installed on HPCx. Unlike the Berkeley UPC compiler, the IBM XL UPC compiler is closed source and is a binary compiler. The IBM XL UPC compiler was found to be capable of providing excellent compiler optimisation, sometimes rivaling the performance of XLC compiler on serial C code.

It was necessary to compile the GA toolkit on all three machines separately. However, the source codes were found to be completely portable between the three machines. This was not the case between Berke- ley UPC and IBM XL UPC. It was observed that the Berkeley UPC compiler was less forgiving than the IBM XL UPC compiler. While the IBM XL UPC compiler allowed direct assignment between shared and private variables and their references to be passed as function arguments, Berkeley UPC threw warning messages and on more complex scenarios failed to compile the code.

The Global Arrays toolkit documentation was very clear and easy to understand. Berkeley UPC also provided excellent documentation with the product. However, IBM documentation was not available for download. This issue raised difficulties in its investigation.

Therefore the portability, availability and install-ability of the GA code was observed to be higher than UPC on the three machines these experiments were conducted on. Many leading molecular dynamics applications are implemented using Global Arrays, and therefore the maturity and performance of the GA toolkit is higher than that of UPC.

7.2 Comparison of GA and UPC syntax

The details of the GA and UPC routines used for the investigations detailed in this report is presented below. The syntax and programmability of these syntaxes are compared in this section.

59 The ga_initialize subroutine is used to initialise the GA environment and is called after the MPI environment is initalized. The ga_terminate subroutine is called prior to the finalisation of the MPI environment and is used to terminate the GA environment. UPC does not require an explicit runtime initialisation in code. An environment variable is used to set the number of threads that the UPC program runs on.

In order to create a global array, the ga_create routine is used. This routine accepts as parameters the datatype of the array, the number of elements on dimension 1, the number of elements of dimension 2, the name of the array, block sizes (chunks) of dimension 1, chunks for dimension 2 as well as the integer handle to the global array. The chunks can be specified on each dimension to allow custom decomposition of the array across all available nodes.

ga_create(type, dim1, dim2, arrayname, chunk1, chunk2, ga)

The GA toolkit also allows creation of multi-dimensional arrays and provides the ability to partition them on every dimension. Up to seven dimensions are supported by the GA toolkit. Using the GA toolkit it is possible to partition each dimension irregularly across nodes.

shared [N/THREADS] type array[N];

shared [(M*N)/THREADS] type array[M][N];

UPC on the other hand supports partitioning of two dimensional array. This is due to the nature of its partitioning syntax. The dependency on a single block size parameter severely restricts the partitioning capability of the UPC syntax. Irregular partitioning of the arrays across nodes is not supported by UPC. The GA syntax therefore provides more control over the partitioning of the distributed shared memory compared to UPC.

The global array toolkit provides routines to explicitly check the distribution of a global array. The ga_distribution routine returns the layout of the global array across the distributed memory of the parallel computer. This collective routine is used to investigate the high and low elements of a multi- dimensional array stored on processor iproc.

ga_distribution(ga, iproc, ilo, ihi, jlo, jhi)

UPC lacks a mechanism to explicitly check the extend of an array on a processor. Instead, it provides data locality check to test whether an array element resides on a processor.

upc_forall (iter=0;iter

This process is compute intensive, and its use is not recommended. On the other hand the Global Arrays routine can be used effectively to parallelise compute intensive loops after performing a data affinity check in the outer loop.

Though UPC provides an easy to use syntax to implement loop level parallelism, its performance was found to be poor due to the high cost of access to distributed shared memory. Therefore for problems with regular and high amount of communication UPC programs will need redesign to use an explicit parallel decomposition. This program will store data in local memory and use the shared memory for halo swaps only.

The GA communication routines provides more control over access to segments of global array space, compared to the UPC syntax. It is possible to access multi-dimensional chunks of global array using

60 single communication calls. The GA tool kit internally converts them into multiple messages, and this simplifies parallel programming. The upc_memget and upc_memput syntax can only be used to ac- complish one dimensional data transfer from the shared array. This requires careful calculation and more complex programming compared to the GA syntax.

UPC syntax allows users to directly assign a shared array element to a local array element. This simplifies parallel programming. However care needs to be taken to avoid frequent accesses to shared memory especially on large shared arrays as this will lead to performance degradation.

Both GA and UPC provides collective communications and features to gather, scatter data across pro- cessors. The GA toolkit provides explicit routines to perform simple arithmetic, data transpose, linear algebraic equation solving etc on the shared memory in a data parallel manner, and thus offers more user friendly functionalities. When using the UPC programming language these routines will need implemen- tation by the programmer.

7.3 Effect of compiler optimisation

The GA programs benefits from the compiler optimisation offered by the native compiler. The GA programs were able to benefit from compiler optimisation on all the three hardware architectures on which it was tested, without sacrificing accuracy of the output.

Berkeley UPC on the other hand was unable to utilise compiler optimisation on Solaris as the source to source translator was not available for native compilation on this platform. IBM XL UPC compiler was able to benefit from compiler optimization.

7.4 Interoperability with MPI

The Global Arrays toolkit is designed grounds up to interoperate with MPI. The ranks of the processes assigned to MPI is ensured to match with the ranks of the processes enquired using the GA routines. The GA synchronization calls can co-exist with MPI synchronization without leading to dead locks.

Though UPC can be installed to use MPI as its communication conduit, parallel programming using MPI as well as UPC syntax is not fully supported. While Berkeley UPC supports this mode of programming experimentally, no information was available on the support of MPI in IBM XL UPC. It is also rec- ommended by the Berkeley UPC team to use separate installations of the UPC runtime for compiling UPC-MPI hybrid programs as it can lead to performance degradation of the UPC communication.

7.5 Comparison of communication latency of GA,UPC and MPI on HPCx

HPCx was used as the platform to investigate the intra-node communication latency of UPC, MPI and GA Figure 7.1. MPI offered the best overall communication performance for large messages. The GA communication performance was better than MPI for small messages, but for messages over 1MB in

61 size MPI offered better bandwidth. In comparison to both GA and MPI the UPC performance was poor. However the communication characteristics of UPC routines were found to be very stable compared to that of GA or MPI.

1x101 GA UPC get 1 UPC put MPI

1x10-1

1x10-2

1x10-3

-4 Elapsed time in seconds 1x10

1x10-5

1x10-4 1x10-3 1x10-2 1x10-1 1 1x101 1x102 Message size in Megabytes

Figure 7.1: Comparison of communication latency of UPC, GA and MPI on HPCx.

62 Chapter 8

Conclusions

This chapter gives the conclusions derived from the experience of using distributed shared memory pro- gramming represented by the Global Arrays Toolkit, and IBM XL UPC on varying hardware architec- tures.

Distributed shared memory programming using the GA toolkit was found to be easy and yielded high developer productivity. The performance of GA communication routines were found to be more costly when compared to the equivalent communication implemented using MPI. As a result GA based pro- grams suffer performance degradation when compared to MPI programs.

Distributed shared memory programming can offer a high productivity development environment that en- ables straight forward implementation of memory intensive applications to novel MPP architectures like the IBM Bluegene/L. This was successfully demonstrated by extending the system size of the molecular dynamics application by 77.7% through the use of the Global Arrays toolkit. This allows the use of IBM Bluegene/L as a target for the execution of simulations, which could not be executed using the previous version of the molecular dynamics application.

The root cause of the poor scalability faced by the molecular dynamics code on IBM Bluegene/L was attributed to the serial component. By parallelising the routines which contributed to this issue, the scalability of the molecular dynamics application was extended from 160 processors to more than 512 processors.

A comparative study of the use of Unified Parallel C and the Global Arrays toolkit to parallelise an im- age processing application which uses nearest neighbor communication was undertaken on HPCx. The results from this study indicates that the use of UPC shared memory can lead to poor application perfor- mance. This result is based on the installations of the IBM XL UPC and Berkeley UPC implementations on Lomond and HPCx. Programming using UPC in message passing mode can lead to better application performance in applications having nearest neighbor communication.

The developer productivity for high performance computing was found to be higher when using Global Arrays compared to that of UPC, for applications with nearest neighbor communication. While UPC and GA programs needed the message passing program model to yeild good performance, the GA syntax was found to be simpler and better suited than UPC for message passing style programming. The Global Arrays toolkit was also observed to be more portable compared to its open source counter part - the

63 Berkeley UPC.

The communication performance of MPI as well as Global Arrays routines were found to be better than that of UPC shared memory to local memory assignment by an order of magnitude on Lomond and HPCx. The programs which use local memory and explicit halo swaps using UPC did not suffer this issue, and yielded good performance. This was demonstrated using an image processing application implemented using shared memory model as well as message passing model using UPC.

8.1 Future Work

The investigations performed as part of this project revealed several areas which needs further investiga- tion. These areas are highlighted in this section.

As part of this project, three large arrays were distributed across the MPP machine using the Global Arrays toolkit. It is possible to improve the memory scalability of the code on Bluegene by migrating additional arrays, where the communication cost would not deter the application performance. A more detailed investigation may potentially uncover more such opportunities.

The investigation of UPC performance was performed using applications which do not fully utilise the potential of UPC. The UPC is better suited for applications with irregular communication patterns and require dynamic load balancing. A more detailed investigation of UPC on large systems using application with irregular communication patterns is therefore a potential area for future work.

An investigation of UPC on CRAY XT4 can help identify if the issues reported on the poor performance of shared memory to local memory assignment in this investigation exists with other versions of UPC on other platforms.

An experimental version of Berkeley UPC as well as IBM XL UPC for IBM Bluegene/L is available currently. Investigations into their performance is also identified as a potential area for future work.

64 Bibliography

[1] M. Rasit Eskicioglu, University of New Orleans, 1995, A Comprehensive Bibliography of Dis- tributed Shared Memory, ACM SIGOPS Operating systems review http://portal.acm.org/citation.cfm?doid=218646.218651

[2] PNNL, 2003, The Global Arrays Toolkit., Online documentation http://www.emsl.pnl.gov/docs/global/

[3] LBNL,UC Berkeley, 2006, Berkeley UPC - Unified Parallel C., Online documentation http://upc.lbl.gov/

[4] Ian Foster, Designing and Building Parallel Programs (Version 1.3), Online tutorial from Addison- Wesley Inc., Argonne National Laboratory, and the NSF Center for Research on Parallel Compu- tation. http://www-unix.mcs.anl.gov/dbpp/

[5] Alan Gray, Joachim Hein and Stephen Booth, 2005, Improved MPI with RDMA, HPCx Technical report http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0505.pdf

[6] EPCC, 2005, Introduction to the University of Edinburgh HPC Service (Version 3.0), User guide http://www2.epcc.ed.ac.uk/computing/services/sun/documents/hpc-intro/html/index.html

[7] EPCC, 2007, User guide to EPCC’s Bluegene/L Service. (Version 1.0), User guide http://www2.epcc.ed.ac.uk/ bgapps/UserGuide/BGuser/

[8] IBM, 2005, Unfolding the IBM eServer Blue Gene Solution., IBM Redbook http://www.redbooks.ibm.com/abstracts/sg246686.html

[9] HPCx, 2007, User’s Guide to the HPCx Service. (Version 2.02), User guide http://www.hpcx.ac.uk/

[10] Jarek Nieplocha, Manojkumar Krishnan, Bruce Palmer, Vinod Tipparaju, 2007, The Global Arrays User’s Manual, User manual: http://www.emsl.pnl.gov/docs/global/user.html

[11] PNNL, TCGMSG Message Passing Library, Online documentation http://www.emsl.pnl.gov/docs/parsoft/tcgmsg/tcgmsg.html

[12] Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, Ad- vances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit,

65 International Journal of High Performance Computing Applications http://hpc.sagepub.com/cgi/reprint/20/2/203?ck=nck

[13] Jarek Nieplocha, M. Krishnan, Vinod Tipparaju, D. K. Panda 2007, High Performance Remote Memory Access Communication: The Armci Approach, International Journal of High Performance Computing Applications http://hpc.sagepub.com/cgi/reprint/20/2/233

[14] 2007, The ScaLAPACK Project, Online documentation http://www.netlib.org/scalapack/

[15] PNNL, Overview of the Global Arrays Parallel Software Development Toolkit., PNNL Tutorial http://www.emsl.pnl.gov/docs/global/tutorial/ga-sc06.ppt

[16] SC2006, 2007, Design and Implementation of a One-Sided Communication Interface for the IBM eServer Blue Gene Supercomputer, Supercomputing, 2006. SC ’06. Proceedings of the ACM/IEEE SC 2006 Conference http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4090228

[17] IBM, 2005, IBM XL UPC Compilers, Online documentation http://www.alphaworks.ibm.com/tech/upccompiler

[18] UPC consortium, 2005, UPC Language Specifications (Version 1.2), Language specification http://www.gwu.edu/ upc/docs/upc_specs_1.2.pdf

[19] U.C. Berkeley, 2003, GASNET, Online documentation http://gasnet.cs.berkeley.edu/

[20] U.C. Berkeley, 2006, Titanium, Online documentation http://titanium.cs.berkeley.edu/

[21] 2006, Co-Array Fortran, Online documentation http://www.co-array.org/

[22] Francois Cantonnet, Yiyi Yao, Mohamed Zahran and Tarek El-Ghazawi, 2004, Productivity Anal- ysis of the UPC Language, Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1303318

[23] D.C.Rapaport, 2003, The Art of Molecular Dynamics Simultation, Cambridge University Press. p 4.

[24] M. P. Allen and D. J. Tildesley, 1987, Computer Simulation of Liquids Clarendon Press, Oxford.

[25] Plimpton, S. and Hendrickson, 1995, Parallel Molecular Dynamics Algorithms for Simulations of Molecular Systems, Parallel Computing in Computational Chemistry American Chemical Society, Symposium Series 592

[26] Aristedes Papadopoulos, 2006, Simulation of Fundamental Electrochemical Events, MSc Disser- tation EPCC http://www2.epcc.ed.ac.uk/msc/dissertations/dissertations-0506/8550532-9e-dissertation1.1.pdf

[27] Dr. Huub Van Dam, CCLRC Daresbury Laboratory, Private communication

66 [28] IBM, 2006, Port Fortran applications, Online documentation http://www.ibm.com/developerworks/aix/library/au-portfortan.html

[29] Dr. Stewart Reed, School of chemistry, University of Edinburgh, Private communication

[30] CRAY, CRAY Unified Parallel C, User documentation http://docs.cray.com/books/S-2179-50/html-S-2179-50/z1035483822pvl.html

[31] Hewlett-Packard, 2005, HPUPC, User documentation http://h30097.www3.hp.com/upc/

[32] GCC, 2005, GCC UPC, GCC website http://www.intrepid.com/upc.html

[33] UC Berkeley, 2006, Berkeley UPC downloads, Software download http://upc.lbl.gov/download/

[34] IBM, 2006, IBM XL UPC installation requirements, Online documentation http://www.alphaworks.ibm.com/tech/upccompiler/requirements

[35] HECTOR, High-End Computing Terascale Resource, HPCx website http://www.hector.ac.uk

67