On Load Balancing of Hybrid Opencl/Global Arrays Applications on Heterogeneous Platforms

On Load Balancing of Hybrid OpenCL/Global Arrays Applications on Heterogeneous Platforms Ekasit Kijsipongse and Suriya U-ruekolan Large-Scale Simulation Research Laboratory National Electronics and Computer Technology Center (NECTEC) 112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand Tel: 662-564-6900 Ext.2276, Fax: 662-564-6772 Email: ekasit.kijsipongse [email protected] | Abstract—In the recent years, high performance computing statically at compile time while others dynamically distribute (HPC) resources has grown up rapidly and diversely. The next load during runtime. Due to the highly heterogeneous na- generation of HPC platforms is assembled from resources of ture of future computing platforms, dynamic load balancing various types such as multi-core CPUs and GPUs. Thus, the development of a parallel program to fully utilize heterogeneously techniques are the viable choices. For a recent example, the distributed resources in HPC environment is a challenge. A dynamic load balancing in multi-GPU systems as discussed parallel program should be portable and able to run efficiently in [3] is based on task pool model which the host enqueues on all types of computing resources with the least effort. We tasks into a queue and the idle GPU that is attached to the combine the advantages of Global Arrays and OpenCL for such host dequeues a task for processing. Another example [4] the parallel programs. We employ the OpenCL in implementing parallel applications at fine-grain level so that they can execute used dynamic load balancing based on the task pool model across heterogeneous platforms. At coarse grain level, we utilize to distribute tasks for GPU devices in a cluster computer. the Global Arrays for efficient data communication between In this paper, we further add two extensions into our computing resources in terms of virtually shared memory. In previous work [4] to achieve load balancing of the hybrid addition, we also propose a load balancing technique based on the OpenCL/Global Arrays applications on heterogeneous com- task pool model for hybrid OpenCL/Global Arrays applications on heterogeneous platforms to improve the performance of the puting platforms. Firstly, the dynamic load balancing based applications. on the task pool model is used to improve the performance of Global Arrays (GA) applications at coarse-grain level. I. INTRODUCTION Although GA provides a high-level programming interface in As technology advances, computing resources gain benefits terms of virtually shared memory which can reduces the de- in many aspects: larger capacity, increased capability as well veloper’s effort required to write parallel program in distribute as rapidity. The developers then regularly upgrade their com- environments, it is left to the application to do load balancing. puting resources, expand the existing ones, or even recruit new Our dynamic load balancing based on task pool model should computing devices to serve more demand, especially in high compliment what is needed by the typical GA applications. performance computing (HPC) area where speed is crucial. Secondly, we deploy OpenCL in implementing the parallel Graphics Processing Units (GPUs), for example, which is pre- applications at fine-grain level so that they can execute across viously used in computer graphics, recently becomes common heterogeneous platforms utilizing full resource capacity. The components in the HPC resources at many organizations and rest of this paper is organized as follows. Section II provides institutes. This trend will definitely happen sooner or later details of related technologies that used implementing in our in other organization as well. The next generation of HPC work. Section III discusses the design and implementation platforms is then assembled from resources of various types of an efficient hybrid OpenCL/GA programming model with such as multi-core CPUs and GPUs. Thus, the development of dynamic load balancing by the task pool model. Section IV a parallel program to fully utilize heterogeneously distributed presents the evaluation result of our work. Finally, we conclude resources in HPC environment is a challenge. A parallel this paper in Section V. program should be portable and able to run on all types of computing resources with the least effort. The OpenCL [7] II. RELATED TECHNOLOGIES framework is one of a promising key that the developers use Our study on the load balancing of hybrid OpenCL/GA to unlock the computing power from those diverse resources. applications on heterogeneous platforms involves two main As the HPC resources can consists of hardware of different technologies as follows: types from various vendors, it is critical for developers to take load balancing problem as another major concern so as to A. Global Arrays achieve higher resource utilization and lower execution time The Global Arrays (GA) [1] was designed to simplify the of the applications. Some solutions assign load to resources programming methodology on distributed memory systems. 978-1-4673-2025-2/12/$31.00 ©2012 IEEE Kernel III. DESIGN AND IMPLEMENTATION In this paper we focus on the load balancing of embar- Workgroup rassing parallel programs on heterogeneous platforms. A task is decomposed into a set of independent subtasks of equal size, all of which are stored in a task pool waiting for an idle Private Private Private Private Memory Memory Memory Memory compute device to fetch a subtask for execution as illustrated in Work Work Work Work Item Item Item Item Figure 2. After a device has completed the subtask, it requests PE PE PE PE a new subtask from the task pool. By this task pool model, Compute Unit Compute Unit load is dynamically distributed to all devices depending on Local Memory Local Memory the capability of each device. Consequently, faster devices get Global Memory more subtasks to work on, thus, resulting in load balance. OpenCL Device In the implementation, the definition of task and subtask is application specific. However, in general, a task consists of Fig. 1. OpenCL abstract device the input data and the kernel to transform the input data into the output data. For hybrid OpenCL/GA programs, the parallel computing logic is written in OpenCL kernel for portability The most innovative idea of GA is that it provides an asyn- so that it can be executed on any OpenCL capable devices. chronous one-sided, shared memory programming environ- We use GA operations in task and the task pool management. ment for distributed memory systems. The GA has included Transferring the input and output data between the task pool the ARMCI (Aggregate Remote Memory Copy Interface) and the compute devices through the communication network library which provides one-sided communication capabilities is encapsulated in GA put/get operations. In addition, by the for distributed array libraries and compiler runtime systems. virtue of GA, the location that stores input and output data is ARMCI offers a simpler and lower-level model of one-sided transparent to the application. If the data are too large to fit communication than MPI-2 [2]. GA reduces the effort required in a single machine, they could be stored distributedly over to write parallel program for clusters since they can assume multiple machines. The outline of the GA implementation of a virtual shared memory. Part of the task user is to explicitly the task pool is given as follows. define the physical data locality for the virtual shared memory Initialize Global Arrays and the appropriate access patterns of the parallel algorithm. Initialize OpenCL device Create kernel Create device buffer B. OpenCL GA_Sync() Do The Open Common Language (OpenCL) [7] has been GA_Read_inc(subtask_id) adopted as a standard programming framework for parallel if (subtask_id > num_subtasks) then break computing. It allows a parallel program to be portable on GA_Get(input[subtask_id]) Transfer input data into device buffer heterogeneous platforms from many vendors. In OpenCL, call OpenCL kernel computing devices of various types such as CPUs, GPUs Transfer output data from device buffer and others are modeled into abstract devices as illustrated GA_Put(output[subtask_id]) in Figure 1. The computing capable hardware is defined While true as an array of compute units each of which can operate GA_Sync() Release OpenCL resources independently to each other. Each compute unit is further Terminate Global Arrays divided into multiple processing elements (PE). The vendors map this abstract device to a specific physical hardware in The number of subtasks is defined by num_subtasks. their implementation as exemplified in Table I. The GA operation GA_Read_inc() is the atomic A part of the program that is to be executed on a OpenCL read and increment the integer subtask_id by one device is called kernel. The memory space in an OpenCL such that each subtask is exclusively processed by one device is separated into global memory, local memory, and pri- device. The GA_Get(input[subtask_id]) and vate memory. Global memory generally is the largest capacity GA_Put(output[subtask_id]) are GA operations memory space on a device. For most devices, global memory to read input data and write output data to the memory has lower throughput than the local and private memory. Local space occupied by the current subtask, respectively. The memory is small (KB) but faster than global memory so it is GA_Sync() is the barrier synchronization. often used as the user-managed cache. Private memory is the IV.

Load more