Mapping Streaming Applications to Opencl Abhishek Ray*1 *Student Author 1Nanyang Technological University
Total Page:16
File Type:pdf, Size:1020Kb
Mapping Streaming Applications to OpenCL Abhishek Ray*1 *Student Author 1Nanyang Technological University ABSTRACT 1.1 Motivation Graphic processing units (GPUs) have been gaining popularity in general purpose and high performance CUDA and OpenCL are two different programming computing. A GPU is made up of a number of frameworks for programming GPUs. CUDA is streaming multiprocessors (SM), each of which NVIDIA’s proprietary technology and is specific to consists of many processing cores. A large number of NVIDIA GPUs whereas OpenCL is an open and free general-purpose applications have been mapped onto standard managed by the Khronos Group [4] that can GPUs efficiently. Stream processing applications, be used to program different devices such as CPUs, however, exhibit properties such as unfavorable data GPUs and DSPs. Its portability is one of the prime movement patterns and low computation-to- motivations for porting framework to OpenCL. communication ratio that might lead to a poor performance on a GPU. We describe the automated 2. DESIGN mapping framework developed earlier that maps most The mapping framework is divided into two parts: stream processing applications onto NVIDIA GPUs StreamIT language – In StreamIt, the basic efficiently by taking into account its architectural programmable unit is called a Filter. It has a single characteristics. We then discuss the implementation input and a single output and its body is essentially details of porting the mapping framework to OpenCL Java-like code. StreamIt programs have a hierarchical running on AMD GPUs and evaluate the performance graph structure where a filter is represented by a leaf of the mapping framework by running several node. The flow of StreamIt programs can be further benchmarks. Performance between the generated distributed by placing filters into any of the composite CUDA and OpenCL code is compared based on blocks such as pipelines, splitters and joiners. different heuristics. 1. INTRODUCTION A GPU consists of a number of multi-threaded processors called streaming multiprocessors (SM). Each SM consists of execution cores, which execute in SIMD mode. On each SM, threads are grouped into scheduling units called wavefronts and warps, respectively, in AMD and NVIDIA terminology. Several warps/wavefronts can be supported on each Figure 1: Different Hierarchical Streams provided by SM at the same time. StreamIT Streaming applications are an important domain of Automated Mapping onto GPUs – applications. Streaming applications can be A filter executes in a sequence of steps. Firstly, it represented as graphs, which are composed of nodes reads the data from the memory. Secondly, it that communicate independently over data channels performs the computation on the data and lastly, it [1][6]. writes it back to the memory to pass it to the next processing filter. Usually the ratio between An automated mapping framework [2] was developed computation and memory access is small. Therefore, that maps stream processing applications onto GPUs if global memory is used to store the filter’s input and efficiently by taking into account architectural output data, most of the time will be spent on memory characteristics of NVIDIA GPUs. The mapping flow accesses. Thus, it is beneficial to bring this data onto captures the parallelism from the streaming language, the shared memory as the threads can access this models the parallel architecture of the GPU and uses a region of memory faster. novel execution model of heterogeneous threads, which is basically a mix of compute and memory In the framework two different types of threads are access threads. used: memory access (M) threads and compute (C) threads. M threads prefetch the data for the next C threads required for the execution of each stream stream execution while the C threads perform schedule, and the number of M threads accessing computation on the data fetched by M threads onto global memory are determined according to the the shared memory in the previous execution. stream schedule structure and specification of the As C threads are always ready for execution, they can target GPU. access the SM. Two components are built after this – A Kernel The automated mapping framework transforms the loader, which runs on the CPU and performs all the code in following ways: initializations and the GPU kernel code that executes on the device and performs the mapping and • Memory transfer operations with large latency are computation. clustered into dedicated threads. • The data flow is transformed based on the 3. MAPPING To OpenCL parallelism exhibited by StreamIt. Porting an application from CUDA to OpenCL is straightforward as both programming frameworks have similar syntax [3]. However, there are a few major differences. In CUDA, both the host code and the device code is compiled at the run time where as in OpenCL host code is compiled statically and the device code is compiled at run time. As a result, an additional header file is automatically generated in OpenCL, which stores all the macros, which are required, by both the host code and the device code. These macros were defined in the device code in CUDA. Another key difference is in the device initialization. Since OpenCL targets a lot of different platforms, it has a complicated initialization process as compared to CUDA. Table 1: Syntax Difference – Device Initialization CUDA OpenCL Figure 2: Automated Mapping Flow of the Framework Schedule<<<grid, status = clGetContextInfo(context, threads, CL_CONTEXT_DEVICES, ..); The mapping framework is implemented as an sharedSize>>>(in_stre commandQueue= am, out_stream, extension to the back-end of the StreamIt compiler. C thread_step, iterations) clCreateCommandQueue(context, code is generated by the application that can be devices[0], ..); compiled by the GPU compiler. The StreamIT compiler flattens the hierarchical stream graph and in_stream = clCreateBuffer(context, generates a schedule, which consists of the order and CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, .. ); the number of executions of each of these operators. After StreamIT generates the schedule, the mapping out_stream = clCreateBuffer(context, framework takes over as the generated schedule is CL_MEM_READ_WRITE | sequential and is meant for single-threaded execution. CL_MEM_USE_HOST_PTR,..); clBuildProgram(program, 1, devices, The requirements of each operator in the schedule are NULL, NULL, NULL); analyzed and a buffer layout is produced. After this various mapping parameters such as the number of stream schedules to execute in parallel, the number of kernel = clCreateKernel(program, “Schedule", &status); #Processing 1536 - Elements clSetKernelArg(kernel,0, #Core 800 1500 sizeof(cl_mem), (void *)&in_stream); clEnqueueNDRangeKernel(commandQ Clock ueue, kernel, .. ); (MHz) #Max BW 176 144 (GB/s) #Max 2703 1288 Table 2: Syntax Difference- Kernel Code GFLOPs CUDA OpenCL Global void Schedule kernel void Schedule( global Parameters used for the Graphs: (unsigned int *in_stream, unsigned int * in_stream, global unsigned int unsigned int * out_stream, const int S, the number of C threads per execution. *out_stream) { thread_step,const int iterations_total, • __local volatile unsigned int • F, the number of M threads that transfer data extern shared shared_mem) between global and SM memory. unsigned int • X – axis – number of parallel stream executions in volatile shared_mem[]; { each SM. • Performance is measured in FLOPS [5]. It is int iterations_start = int iterations_start = iterations_total calculated as the number of floating point iterations_total * *get_group_id(0) / operations / second. blockIdx.x / gridDim.x; get_num_groups(0); int • Performance is compared between OpenCL and int iterations_stop = iterations_stop = iterations_total * CUDA by taking into account their respective iterations_total * (get_group_id(0) + 1) / (blockIdx.x + 1) / get_num_groups(0); if FLOPS using the same parameters. It is calculated gridDim.x; if ((get_local_id(0) < WORKERS) && by FLOPSOPENCL / FLOPSCUDA * 100 ((threadIdx.x < (get_local_id(0) % 32 == 0)) WORKERS) && sync_work[get_local_id(0) / 32] = 0; 4.1 PERFORMANCE COMPARISON (threadIdx.x % 32 == BETWEEN GPU & CPU 0)) barrier(CLK_LOCAL_MEM_FE Performance between the GPU (Radeon HD 6970) sync_work[threadIdx.x / NCE); 32] = 0; and the CPU (Intel Xeon CPU E5540 @ 2.53 GHz) was compared. As expected, the GPU easily } syncthreads(); outperformed the CPU. } 40 35 30 GPU vs CPU 25 20 4. EXPERIMENTS Speedup 15 After mapping the streaming application to OpenCL, 10 different benchmarks were tested to check for 5 correctness of the generated code and to compare the 0 results of the OpenCL code with the CUDA code. The devices compared were AMD Radeon HD 6970 running OpenCL and NVIDIA Tesla C2050 running CUDA. Figure 3: Performance Comparison – GPU vs CPU Table 3: Architectural Differences between devices being compared AMD Radeon NVIDIA 4.2 PERFORMANCE COMPARISON HD 6970 Tesla C2050 BETWEEN OPENCL & CUDA #Compute 24 14 Unit Performance between OpenCL and CUDA was #Cores - 448 compared by taking into account the FLOPS (Floating Point Operations/Second). for our application. These factors are listed Memory Bandwidth, Floating Point Performance and 120 Architecture Related Differences. 100 80 Table 4: Comparison between the two devices based on 60 different heurestics for FFT benchmark 40 AMD Radeon NVIDIA Performance Performance 20 Comparison (%) OpenCL vs CUDA / S=1 / F=64 HD 6970 Tesla C2050 0 8 24 40 56 72 88 104 120 136 152 168 184 #BW 0.0105