ACCELERATED CODE GENERATOR FOR PROCESSING OCEAN COLOR REMOTE SENSING DATA ON GPU

Jae-Moo Heo1, Gangwon Jo2, Hee-Jeong Han1, Hyun Yang 1

1Korea Ocean Satellite Center, Korea Institute of Ocean Science and Technology, Pusan, South Korea 2Seoul National University and ManyCoreSoft Co., Ltd., Seoul, South Korea

ABSTRACT GOCI-II ground system in July 2017 and decided to process As the satellite data gradually increases, the processing time algorithms on the Graphics Processing Unit (GPU) in order of the algorithm for remote sensing also increases. It is now to process large amounts of data. essential to make efforts to improve the processing Nowadays, we can take advantage of computation devices performance in various accelerators such as Graphics of various architectures: Many Integrated Core (i.e., Intel Processing Unit (GPU). However, it is very difficult to use Xeon Phi), GPU, and Filed Programmable Gate Array programming models that make programs to run on (FPGA). These accelerators are constantly improving and accelerators. Especially for scientists, easy programming changing. And, parallel programming models that are and performance are often more important than providing compatible with these various accelerators are being many optimization functions and directives. To meet these developed at the same time. However, algorithm source requirements, we developed the accelerated code generator codes have been generally improved for several years or based on Open Computing Language (OpenCL) for several decades and it is not easy to develop a large amount processing ocean color remote sensing data. We easily of these source codes into the existing parallel programming applied our generator to atmospheric correction program for models. OpenCL [4] and Compute Unified Device GOCI data as a case study and the program’s performance Architecture (CUDA) [5] are extensions to existing achieved about 29.2x that is as good as the hand-written programming languages (i.e., /C++ and Fortran) and the OpenCL program. We look forward to many scientists that most popular parallel programming models among want to find a tool mentioned above taking advantage of our programmers. These models are the best option if generator. performance improvement is the most important. Conversely, these are difficult to use because it is a low- Index Terms— GOCI-II, GPU, Ocean color data pro- level programming model and we should also have good cessing, OpenCL, Source-to-source translator knowledge of hardware architecture. Open Multi-Processing (OpenMP) [6] and OpenACC [7] are directives provided 1. INTRODUCTION additional information to compiler for parallelization and have relatively recently been designed for a various Amount of ocean color satellite data is rapidly increased in computing devices and are currently undergoing vigorous recent years. This increase is due to the development of improvements. Both programming models are aimed at multiple satellite sensors and improvement of sensor portability and easy programming. However, these are performance. There are many studies using improved relatively lower performance improvement than the computation devices for high-resolution and multi-sensor OpenCL or CUDA [8] and we have to be familiar with the data processing [1] and it is now commonplace to improve various directives in order to achieve good performance. As the processing performance of the algorithm for remote a result, these models are still difficult for general scientists sensing through an accelerator such as GPU [2], [3]. Korea to use. Ocean Satellite Center (KOSC) has been operating We concentrated on producing a good performance as well Geostationary Ocean Color Imager (GOCI) since 2010 and as parallelizing scientific algorithms easily. In addition, is also developing ground system for GOCI-II launched in these generated programs must be available in future 2019. KOSC with many universities and research institutes accelerators. OpenCL is not only a portable programming is being developed 26 kinds of ocean, air, and land products. model, it also has a good performance. Therefore, we In addition, amount of data is increased by about 24 times decided to base on OpenCL 1.2 and support only C due to 1.5 times the number of observation bands, 4 times programming [9]. Finally, we developed the accelerated the spatial resolution, and full-disk observation, etc. code generator for processing ocean color remote sensing Therefore, we completed the critical design review of data and named it as OCAccel. It is implemented based on annotations for easier programming rather than complier sequentially. This helps the scientists to easily verify the directives or C/C++ extensions. OCAccel generate behavior of the program before parallelizing it. automatically parallel code from serial code. It internally 1 void KERNEL generateRrs(int i, int j, translates the given serial code into an OpenCL code. 2 INPUT float *arrayA, OUTPUT float *arrayB, Therefore, scientists do no need to use a complicated 3 float x, float y, float z) 4 { parallel programming model because it is automatically 5 arrayB[i] = fabs(sin(arrayA[i] * x) + done during compilation. In this paper, the atmospheric 6 cos(arrayB[i] * y) + z); correction algorithm for GOCI data was used for case study 7 } and the presented experiment has been based on NVIDIA 8 9 void main() { GTX Titan X Pascal GPU. Consequently, most of parallel 10 ... sections showed good performance and it is much easier to 11 paralleize: programming. 12 for (int i = 0; i < N; i++) { 13 for (int j = 0; j < M; j++) { 14 generateRrs(i, j, arrayA, arrayB, 2. FRAMEWORK 15 x, y, z); 16 } OCAccel has a source-to-source translator that 17 } 18 ... automatically offloads parallelizable loops (i.e., loops with 19 } no loop-carried dependences) in a C/C++ program to an accelerator (e.g., a GPU). OCAccel internally translates the Fig. 1. An example code of OCAccel given program into an OpenCL program, and then executes it on top of the vendor-provided OpenCL driver. OpenCL is an open standard for heterogeneous computing and many industry-leading hardware vendors including AMD, ARM, IBM, Intel, NVIDIA, and Xilinx supports OpenCL for their processors and accelerators. Thus, OCAccel also provides portability between different accelerators. Fig. 1 shows an example code that is given to OCAccel. A for-loop (or perfectly nested for-loops), that is annotated by a label whose name starts with parallelize, is offloaded to an accelerator by OCAccel. We call this a kernel loop. Multiple work-items (i.e., threads running on an accelerator) execute different loop iterations simultaneously [10]. All other parts of the program are executed on a CPU. The body of a kernel loop contains only a single function call. It may pass arrays, memory blocks allocated by malloc() or new, and scalar values as arguments to the function. The target function is defined with the keyword KERNEL. In addition, every pointer parameter passed from the kernel loop is classified into one of INPUT, OUTPUT, and INOUT (Line 1 Fig. 2. The compilation and execution flow of OCAccel of Fig. 1). They indicate that the given array or memory block is read-only, write-only, or read-write. A KERNEL We implement the source-to-source translator by modifying function should be written in , even if the other parts of 3.7.1 and writing a python script. OCAccel consists of the program are written in C++. It may call other KERNEL an extract program, a patch program, and a runtime library. functions, C mathematical functions (e.g., sin, cos, and so We split the extract program and the patch program to make on), and built-in functions defined by OCAccel. The it easier for users to understand the program translated by program translated by OCAccel calls only a small number OCAccel and to handle complex programs with multiple of designed runtime functions, and then internally calls source codes. Fig. 2 shows how a program is executed on a vendor-provided OpenCL libraries with the runtime library. heterogeneous system using OCAccel. The source-to-source All of the user annotations rid OCAccel of the need for loop translator of OCAccel replaces every kernel loop in a given dependence analysis and pointer analysis that are source code with an OpenCL host code that copies all impractical for real-world applications. Note that if we erase INPUT/INOUT arrays and memory blocks to the accelerator, four keywords (i.e., KERNEL, INPUT, OUTPUT, and INOUT) executes all iterations of the kernel loop on the accelerator, using the , the program becomes a valid and copies all OUTPUT/INOUT arrays and memory blocks C/C++ source code and can be executed solely on a CPU from the accelerator. In addition, the source-to-source translator aggregates all KERNEL functions into an OpenCL kernel code. Then, the program can be executed like a usual founded that the GPU-based method is more efficient than OpenCL program. the multi-core CPU in the atmospheric correction algorithm. 3. CASE STUDY AND EVALUATION

In ocean color remote sensing, atmospheric correction program must be necessarily performed before processing other ocean color algorithms. Although the atmospheric correction algorithms are not exactly identical due to the differences of satellite sensor performance or trajectory, the overall procedure is similar. For this reason, we evaluate OCAccel with the atmospheric correction algorithm for GOCI data that has a theoretical basis for standard atmospheric correction methods applied to NASA's polar orbiting satellites, Sea-Viewing Wide Field-of-view Sensor (SeaWiFS) and Moderater-Resolution Imaging Spectro- radiometer (MODIS) [11] and it is included in GOCI Data Processing System (GDPS) [12] 2.0 version. This algorithm program is divided into 4 parallel sections. Section 1 generates and store information corresponding to each slot of spectral bands (for GOCI only) and section 2 calculates the geometric angles of the sun and the satellite (depending on satellites). And then section 3 generates Rayleigh Fig. 3. A comparison of OpenMP, OCAccel, and OpenCL corrected reflectance. Finally, section 4 generates remote with respect to execution time sensing reflectance [13], [14]. Case study was on the system that has dual-socket Intel Table 1. Internal execution time of each section of the Xeon E5-2630 v3 CPUs (total 16-Cores) and an NVIDIA Titan X Pascal GPU. We compared the OCAccel program CPU host Memory copy GPU kernel Total with the serial program, the OpenMP program (multi-core CPU version), and the hand-written OpenCL program (GPU Section 1 0.42 0.61 0.68 1.71 version). In this experiment, we used GOCI L1B data Section 2 0.28 0.85 0.75 1.88 consisting of bands and each band has 5567 x 5685 pixels. Section 3 0.30 1.23 1.34 2.87 Each experiment took an average time of 10 runs and the Section 4 1.55 3.51 15.43 20.49 execution time of each run was almost the same within several tens of milliseconds. During the case study, we OCAccel program confirmed that there are some limitations for a good performance. ACAccel does not provide detailed Table 1 summarizes the internal execution time of each information about the memory copy to the GPU. Therefore, section of the OCAccel program. It measures the time inside if the same kernel is repeatedly executed continuously, of each for-loop that is annotated by a label parallelize. meaningless copy transfer may occur. This issue is also The time of CPU host is included before calling the present in OpenACC [15]. Nevertheless, it is difficult to “clEnqueueNDRangeKernel” OpenCL API function [16]. solve with an abstracted directive or an annotation base. We also used the “clGetEventProfilingInfo” function to Thus, we need to configure the kernel loop in the outermost measure the time that spent in the command queue and for-loop so that each kernel can be executed only once. defined it as the time of GPU kernel. The remaining time is Fig. 3 shows the execution time of each program. As you defined as the memory copy between CPU and GPU. This can see, the OCAccel program has almost the same time, of course, includes the time it takes to call OpenCL performance as the hand-written OpenCL program and it API functions. Because it takes 0.5 seconds to call the also showed much better performance than the multi-core OpenCL function even if no memory copy occurs, there is CPU version OpenMP program. We achieved the little overhead due to memory copy. The pure OCAccel’s total performance improvements of 29.2x and computational time within the GPU kernel is 212x, 202x, 2.3x compared to a serial program and OpenMP program 90x, and 58x faster for each section compared to the serial running on a multi-core CPU, respectively. For the OpenMP program, respectively. The reason for the relatively low program performs best with 32 threads. The performance performance of section 3 and 4 is that all of code branches was evenly measured 13-14x since the number of cores of are internally performed in the GPU kernel [17]. Therefore, the multi-core CPU is physically 16. At this point, we the smallest performance improvement is shown in section 4 since the amount of computation deviates greatly according to the branch statement and the computation amount are [8] S. Memeti, L. Li, S. Pllana, J. Kolodziej, and C. Kessler, “Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: basically large. programming productivity, performance, and energy consumption”, in Proceedings of the 2017 Workshop on Adaptive Resource 4. CONCLUSION Management and Scheduling for Cloud Computing, pages 1–6, 2017.

We have presented the accelerated code generator for [9] OpenCL 1.2 Official Reference Webpage, [Online], Available” processing ocean color satellite data and call it as OCAccel. https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/ It is based on OpenCL programming model considering portability, scalability, easy programming, and performance. [10] J. E. Stone, . Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems”, OCAccel internally translates a serial C/C++ program into a Computing in science & engineering, vol. 12, no. 3, pp. 66-73, 2010. program that can operate in accelerators through a few lines of simple annotations. We easily applied OCAccel to [11] J. H., Ahn, Y. J. Park, J. H. Ryu, and B. Lee, “Development of atmospheric correction program for GOCI data and the atmospheric correction algorithm for Geostationary Ocean Color experimental result showed the performance as good as the Imager (GOCI)”, Ocean Science Journal, vol. 47, no. 3, pp. 247-259, hand-written OpenCL program. In the future, we will apply 2012. it to the ocean color algorithms that have been developed [12] J.H. Ryu, H.J. Han, S. Cho, Y.J. Park, and Y.H. Ahn, “Overview of for GOCI-II ground system. Then we will develop the Geostationary Ocean Color Imager (GOCI) and GOCI Data common parts of remote sensing algorithms as OCAccel Processing System (GDPS)”, Ocean Science Journal, vol. 47, no. 3, runtime API functions and release it. pp. 223-233, 2012.

ACKNOWLEDGMENT [13] J. H. Ahn, Y. J. Park, W. Kim, and B. Lee, “Vicarious calibration of the geostationary ocean color imager”. Optics express, vol. 23, no. 18, This research was supported by the ‘Development of the pp. 23236-23258, 2015. integrated data processing system for GOCI-II’ funded by the Ministry of Ocean and Fisheries, Korea. This research [14] J. H. Ahn, Y. J. Park, W. Kim, and B. Lee, “Simple aerosol correction technique based on the spectral relationships of the aerosol multiple- was also supported by the ‘Operation in Korea Ocean scattering reflectances for atmospheric correction over the oceans”, Satellite Center’ funded by the Korea Institute of Ocean Optics express, vol. 24, no. 26, pp. 29659-29669, 2016 Science and Technology (KIOST). [15] R. Reyes, I. López, J. J. Fumero, and F. de Sande, “A preliminary 5. REFERENCES evaluation of OpenACC implementations”, The Journal of Supercomputing, vol. 65, no.3, pp. 1063-1075, 2013 [1] A. Plaza, Q. Du, Y. L. Chang, and R. L. King, ”High performance computing for hyperspectral remote sensing”, IEEE Journal of [16] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Selected Topics in Applied Earth Observations and Remote Sensing, Spafford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous vol. 4, no. 3, pp. 528-544, 2011. computing (SHOC) benchmark suite”. in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing [2] J. M. Nascimento, J. M. Bioucas-Dias, J. M. R. Alves, V. Silva, and A. Units, 2010 ACM, pp. 63-74. Plaza, “Parallel hyperspectral unmixing on GPUs”, IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 3, 666-670, 2014. [17] J. Lee, M. Samadi, Y. Park, and S. Mahlke, “Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems”, in [3] E. M. Sigurdsson, A. Plaza, and J. A. Benediktsson, “GPU Parallel Architectures and Compilation Techniques (PACT), 2013 implementation of iterative-constrained endmember extraction from 22nd International Conference on, 2013 IEEE, pp. 245-255. remotely sensed hyperspectral images”, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2939-2949, 2015.

[4] OpenCL Official Webpage, [Online]. Available: https://www.khronos.org/opencl.

[5] NVIDIA Dveloper Webpage, [Online], Available: https://developer.nvidia.com

[6] OpenMP Official Webpage, [Online]. Available: http://www.openmp.org.

[7] OpenACC Official Webpage, [Online], Available: https://www.openacc.org