On Load Balancing of Hybrid OpenCL/Global Arrays Applications on Heterogeneous Platforms

Ekasit Kijsipongse and Suriya U-ruekolan Large-Scale Simulation Research Laboratory National Electronics and Computer Technology Center (NECTEC) 112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand Tel: 662-564-6900 Ext.2276, Fax: 662-564-6772 Email: ekasit.kijsipongse [email protected] |

Abstract—In the recent years, high performance computing statically at compile time while others dynamically distribute (HPC) resources has grown up rapidly and diversely. The next load during runtime. Due to the highly heterogeneous na- generation of HPC platforms is assembled from resources of ture of future computing platforms, dynamic load balancing various types such as multi-core CPUs and GPUs. Thus, the development of a parallel program to fully utilize heterogeneously techniques are the viable choices. For a recent example, the distributed resources in HPC environment is a challenge. A dynamic load balancing in multi-GPU systems as discussed parallel program should be portable and able to run efficiently in [3] is based on task pool model which the host enqueues on all types of computing resources with the least effort. We tasks into a queue and the idle GPU that is attached to the combine the advantages of Global Arrays and OpenCL for such host dequeues a task for processing. Another example [4] the parallel programs. We employ the OpenCL in implementing parallel applications at fine-grain level so that they can execute used dynamic load balancing based on the task pool model across heterogeneous platforms. At coarse grain level, we utilize to distribute tasks for GPU devices in a cluster computer. the Global Arrays for efficient data communication between In this paper, we further add two extensions into our computing resources in terms of virtually . In previous work [4] to achieve load balancing of the hybrid addition, we also propose a load balancing technique based on the OpenCL/Global Arrays applications on heterogeneous com- task pool model for hybrid OpenCL/Global Arrays applications on heterogeneous platforms to improve the performance of the puting platforms. Firstly, the dynamic load balancing based applications. on the task pool model is used to improve the performance of Global Arrays (GA) applications at coarse-grain level. I. INTRODUCTION Although GA provides a high-level programming interface in As technology advances, computing resources gain benefits terms of virtually shared memory which can reduces the de- in many aspects: larger capacity, increased capability as well veloper’s effort required to write parallel program in distribute as rapidity. The developers then regularly upgrade their com- environments, it is left to the application to do load balancing. puting resources, expand the existing ones, or even recruit new Our dynamic load balancing based on task pool model should computing devices to serve more demand, especially in high compliment what is needed by the typical GA applications. performance computing (HPC) area where speed is crucial. Secondly, we deploy OpenCL in implementing the parallel Graphics Processing Units (GPUs), for example, which is pre- applications at fine-grain level so that they can execute across viously used in computer graphics, recently becomes common heterogeneous platforms utilizing full resource capacity. The components in the HPC resources at many organizations and rest of this paper is organized as follows. Section II provides institutes. This trend will definitely happen sooner or later details of related technologies that used implementing in our in other organization as well. The next generation of HPC work. Section III discusses the design and implementation platforms is then assembled from resources of various types of an efficient hybrid OpenCL/GA programming model with such as multi-core CPUs and GPUs. Thus, the development of dynamic load balancing by the task pool model. Section IV a parallel program to fully utilize heterogeneously distributed presents the evaluation result of our work. Finally, we conclude resources in HPC environment is a challenge. A parallel this paper in Section V. program should be portable and able to run on all types of computing resources with the least effort. The OpenCL [7] II. RELATED TECHNOLOGIES framework is one of a promising key that the developers use Our study on the load balancing of hybrid OpenCL/GA to unlock the computing power from those diverse resources. applications on heterogeneous platforms involves two main As the HPC resources can consists of hardware of different technologies as follows: types from various vendors, it is critical for developers to take load balancing problem as another major concern so as to A. Global Arrays achieve higher resource utilization and lower execution time The Global Arrays (GA) [1] was designed to simplify the of the applications. Some solutions assign load to resources programming methodology on systems.

978-1-4673-2025-2/12/$31.00 ©2012 IEEE Kernel III. DESIGN AND IMPLEMENTATION In this paper we focus on the load balancing of embar-

Workgroup rassing parallel programs on heterogeneous platforms. A task is decomposed into a set of independent subtasks of equal size, all of which are stored in a task pool waiting for an idle

Private Private Private Private Memory Memory Memory Memory compute device to fetch a subtask for execution as illustrated in

Work Work Work Work Item Item Item Item Figure 2. After a device has completed the subtask, it requests PE PE PE PE a new subtask from the task pool. By this task pool model, Compute Unit Compute Unit load is dynamically distributed to all devices depending on Local Memory Local Memory the capability of each device. Consequently, faster devices get Global Memory more subtasks to work on, thus, resulting in load balance.

OpenCL Device In the implementation, the definition of task and subtask is application specific. However, in general, a task consists of Fig. 1. OpenCL abstract device the input data and the kernel to transform the input data into the output data. For hybrid OpenCL/GA programs, the logic is written in OpenCL kernel for portability The most innovative idea of GA is that it provides an asyn- so that it can be executed on any OpenCL capable devices. chronous one-sided, shared memory programming environ- We use GA operations in task and the task pool management. ment for distributed memory systems. The GA has included Transferring the input and output data between the task pool the ARMCI (Aggregate Remote Memory Copy Interface) and the compute devices through the communication network library which provides one-sided communication capabilities is encapsulated in GA put/get operations. In addition, by the for distributed array libraries and compiler runtime systems. virtue of GA, the location that stores input and output data is ARMCI offers a simpler and lower-level model of one-sided transparent to the application. If the data are too large to fit communication than MPI-2 [2]. GA reduces the effort required in a single machine, they could be stored distributedly over to write parallel program for clusters since they can assume multiple machines. The outline of the GA implementation of a virtual shared memory. Part of the task user is to explicitly the task pool is given as follows. define the physical data locality for the virtual shared memory Initialize Global Arrays and the appropriate access patterns of the . Initialize OpenCL device Create kernel Create device buffer B. OpenCL GA_Sync() Do The Open Common Language (OpenCL) [7] has been GA_Read_inc(subtask_id) adopted as a standard programming framework for parallel if (subtask_id > num_subtasks) then break computing. It allows a parallel program to be portable on GA_Get(input[subtask_id]) Transfer input data into device buffer heterogeneous platforms from many vendors. In OpenCL, call OpenCL kernel computing devices of various types such as CPUs, GPUs Transfer output data from device buffer and others are modeled into abstract devices as illustrated GA_Put(output[subtask_id]) in Figure 1. The computing capable hardware is defined While true as an array of compute units each of which can operate GA_Sync() Release OpenCL resources independently to each other. Each compute unit is further Terminate Global Arrays divided into multiple processing elements (PE). The vendors map this abstract device to a specific physical hardware in The number of subtasks is defined by num_subtasks. their implementation as exemplified in Table I. The GA operation GA_Read_inc() is the atomic A part of the program that is to be executed on a OpenCL read and increment the integer subtask_id by one device is called kernel. The memory space in an OpenCL such that each subtask is exclusively processed by one device is separated into global memory, local memory, and pri- device. The GA_Get(input[subtask_id]) and vate memory. Global memory generally is the largest capacity GA_Put(output[subtask_id]) are GA operations memory space on a device. For most devices, global memory to read input data and write output data to the memory has lower throughput than the local and private memory. Local space occupied by the current subtask, respectively. The memory is small (KB) but faster than global memory so it is GA_Sync() is the barrier synchronization. often used as the user-managed cache. Private memory is the IV. EVALUATION smallest but the fastest. The size of the private memory is not defined in the OpenCL. When too much private memory is For performance evaluation, we selected Pearson correlation requested, it is up to the compiler to store which data in the coefficient as the testing application. This application has private memory and the rest in slower memory. higher computation to communication so that it is appropriate TABLE I OPENCL DEVICE MAPPING

OpenCL ATI GPU Nvidia GPU AMD CPU Compute Unit SIMD Core Streaming Multiprocessor Processor Core Processing Element Streaming Processor Streaming Processor - Global Memory Global Memory Device Memory RAM Local Memory Local Data Share Shared Memory RAM Private Memory Register Register Register or Stack

Distributed Shared Memory Input Data set Correlation Matrix

A,B

Task Pool A

n n Global Arrays Put/Get Communication B

m n

ATI GPU CPU Core NVIDIA GPU Fig. 3. Block decomposition of correlation matrix for parallel computing Fig. 2. Task pool dynamic load balancing for hybrid OpenCL/GA

B. system setting to be executed in distributed environments. We have carried We provide a summary of experimental setting used in the out the experiments and the details are given as follows. performance evaluation. The heterogeneous platform used here consists of four different nodes: three of which are attached A. pearson correlation with GPU devices, and a node having only multi-core CPUs. Table II shows the hardware specification of these devices. Pearson correlation coefficient [6] gives a measure of how Note that for the testing purpose, we did not use CPU devices two objects are similar. It can be used in data analysis, if the node has a GPU. All nodes are connected through a signal processing, pattern recognition, image processing, and Gigabit network. The software specification of all devices are bioinformatics. Let X and Y be objects that contains m shown in Table III. attributes, the Pearson correlation coefficient, rX,Y , between two objects X and Y is defined mathematically as: TABLE II HARDWARE SPECIFICATION m ¯ ¯ i=1(Xi X)(Yi Y ) Device A Device B Device C Device D rX,Y = − − m ¯ 2 m ¯ 2 Processor Xeon Nvidia GTX460 Nvidia GTS250 ATI HD5450 !(Xi X) (Yi Y ) i=1 − i=1 − Cores 8 336 128 80 " " Clock 2.4 GHz 1.3 GHz 1.8 GHz 650 MHz ¯ ¯ ! ¯ !1 m ¯ Memory 16 GB DDR3 1 GB GDDR5 1 GB GDDR3 1 GB GDDR3 where X and Y are defined as X = m i=1 Xi and Y = 1 m m i=1 Yi, respectively. The value of rX,Y ranges from -1 to 1. It is closed to zero if two objects are uncorrelated.! When ! it is positive, X and Y are correlated. The higher the value, TABLE III SOFTWARE SPECIFICATION the stronger the correlation. If the value of rX,Y is negative, then X and Y are negatively correlated. Software Version Linux kernel 2.6.4 The calculation of pairwise Pearson correlation coefficient Global Arrays GA 5.0.2 on a dataset, known as the correlation matrix, becomes com- OpenCL Nvidia 4.0, AMD 2.6 puting intensive according to the rapid growth of the data in the digital era. Assume that the size of the data set is n, each of which has m attributes, the correlation between all C. results and discussion pairs of objects can be expressed as the correlation matrix To see result of the task pool dynamic load balancing, we in which each element is the Pearson correlation coefficient, compared the load on all devices with those of the static rX,Y , of the different instance pairs (X, Y ). The calculation of load balancing which simply assigns equal number of subtasks correlation matrix is highly parallelizable as each rX,Y can be to all devices. The input data set consists of 20480 objects, computed independently. Figure 3 shows that the correlation each of which has 100 attributes. The size of the output matrix is partitioned into Blocks for parallel computing. The correlation matrix is 20480 20480. All elements are of × output block A, B in the correlation matrix is calculated from floating point type. The output matrix is decomposed into input block A and B from the dataset. Note that the correlation blocks of 1024 1024 for parallel execution. Thus, there × matrix is always symmetric. exist 400 subtasks in the task pool. In case of the static 5 150 Static 1 Dynamic

0

30 100

25

20

15 50 Load Ratio

10 Execution Time (seconds) 5

1 0 0 256 512 1024 2048 4096 0 Block Size A B C D OpenCL Device Fig. 5. Parallel execution time versus block size Fig. 4. Device load on static and dynamic load balancing

achieving good performance. Thus developers must trade-off load balancing, each device receives equally 100 subtasks to between portability and highest performance across devices. compute. This static load balancing is easily implemented by V. CONCLUSIONS the operation NGA_Distribution() in GA. We measured the load ratio between the number of subtasks processed by In this paper, we have proposed an efficient load balancing the device and the maximum throughput of the device (# for hybrid OpenCL/Global Arrays applications on heteroge- of subtasks/second which we have tested off-line for each neous platforms. We employ the OpenCL in implementing device). The load ratio approaching 1 means that the number parallel applications at fine-grain level so that they can exe- of subtasks is closed to the device capability. We repeated the cute across heterogeneous platforms which consist of various experiment 10 times for the average values. Figure 4 shows computing resources such as GPUs and multi-core CPUs. that with the task pool dynamic load lancing, loads are closer At coarse grain level, we utilize the Global Arrays as it to 1, though they are not perfectly balanced due to the latency provides high level programming interface with asynchronous of Get/Put operations in GA on non-dedicated network. For one-sided communication protocol so that the performance each task, Get operation requires to transfer 2 1024 100 of data communication between nodes can be improved and × × × sizeof(float) ( 800K) bytes of input block A and B from it help reduce the developer’s effort in writing a parallel ≈ GA to the device, and Put operation transfers 1024 1024 program in distribute environments. To improve the application × × sizeof(float) ( 4M) bytes of each output block from the performance, we develop the task pool dynamic load balancing ≈ device to GA. Note that on device A, the load under the static that distributes tasks to all computing resources fully utilizing load balancing is near 30. their capacity. In the future, we will consider the overlapping Next, we studied if the block size has effect to the perfor- between computation and communication to serve the appli- mance of the application. We used the same data set but varied cation with higher communication demand. the block size, and then measured the overall execution time. REFERENCES The result is illustrated in Figure 5. When block size is too [1] J. Nieplocha, R.J. Harrison and R.J. Littlefield, “Global Arrays: A small, specially when it is smaller than 1024, the execution Nonuniform Memory Access Programing Model for High-Performance time increases sharply due to the overhead associated with Computers,” Journal of Supercomputing, pp. 169-189, 1997. each task which includes the data communication in GA as [2] J. Nieplocha and B. Carpenter, “ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Complier Run-Time well as transferring data to/from the device buffer. Obviously, Systems,” In Proceedings of the IPPS/SPDP’99 Workshops Held in as block size increases, the execution time decreases. However, Conjunction with the 13th International Parallel Processing Symposium when block size becomes larger than 2048, there is no much and 10th Symposium on Parallel and Distributed Processing, Springer, Heidelberg, 1999, pp. 533-546. gain and if it is set too large, there is a possibility to have load [3] L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, “Dynamic Load unbalanced due to the completion of the last subtask on slow Balancing on Single- and Multi-GPU Systems,” In Proceedings of the devices. The performance of the application can be simply 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2010. improved by selecting a proper block size; but the exact size [4] E. Kijsipongse, S. U-ruekolan, C. Ngamphiw and S. Tongsima, “Efficient is very application specific. Large Pearson Correlation Matrix Computing using Hybrid MPI/CUDA,” For other performance improvement, developers need to 8th International Joint Conference on Computer Science and Software Engineering (JCSSE), 2011, pp.237-241. understand the implementation details of a specific hardware [5] J. Fang, A.L. Varbanescu and H. Sips, “A Comprehensive Performance which may vary from vendor to vendor. Although, the openCL Comparison of CUDA and OpenCL,” International Conference on Par- programs are meant to be portable, adding specific optimiza- allel Processing (ICPP), 2011, pp.216-225. [6] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan tion into the programs may cause undesirable result if they Kaufmann Publishers Inc., San Francisco, CA, USA, 2006, pp. 67-68. are executed on other devices. Fang et. al [5] discussed the [7] OpenCL - The open standard for parallel programming of heterogeneous OpenCL’s portability versus performance issues and sum- systems. http://www.khronos.org/opencl, 2012. marized that OpenCL is sufficient for portability while still