A Middleware for Efficient Stream Processing in CUDA
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) A Middleware for E±cient Stream Processing in CUDA Shinta Nakagawa ¢ Fumihiko Ino ¢ Kenichi Hagihara Received: date / Accepted: date Abstract This paper presents a middleware capable of which are expressed as kernels. On the other hand, in- out-of-order execution of kernels and data transfers for put/output data is organized as streams, namely se- e±cient stream processing in the compute uni¯ed de- quences of similar data records. Input streams are then vice architecture (CUDA). Our middleware runs on the passed through the chain of kernels in a pipelined fash- CUDA-compatible graphics processing unit (GPU). Us- ion, producing output streams. One advantage of stream ing the middleware, application developers are allowed processing is that it can exploit the parallelism inherent to easily overlap kernel computation with data trans- in the pipeline. For example, the execution of the stages fer between the main memory and the video memory. can be overlapped with each other to exploit task par- To maximize the e±ciency of this overlap, our middle- allelism. Furthermore, di®erent stream elements can be ware performs out-of-order execution of commands such simultaneously processed to exploit data parallelism. as kernel invocations and data transfers. This run-time One of the stream architecture that bene¯t from capability can be used by just replacing the original the advantages mentioned above is the graphics pro- CUDA API calls with our API calls. We have applied cessing unit (GPU), originally designed for acceleration the middleware to a practical application to understand of graphics applications. However, it now can acceler- the run-time overhead in performance. It reduces exe- ate many scienti¯c applications typically with a 10-fold cution time by 19% and allows us to process large data speedup over CPU implementations [2]. Most of the ap- that cannot be entirely stored in the video memory. plications are implemented using a flexible development framework, called compute uni¯ed device architecture ¢ ¢ ¢ Keywords Stream processing overlap CUDA (CUDA) [1]. This vendor-speci¯c framework allows us GPU to directly manage the hierarchy of memories, which is an advantage over graphics APIs such as OpenGL. Using this C-like language, developers are allowed to implement their GPU programs without knowledge of 1 Introduction computer graphics. In general, CUDA-based applications consist of host Stream processing [4] has become increasingly impor- code and device code, each running on the CPU and tant with emergence of stream applications such as au- on the GPU, respectively. The host code invokes the dio/video processing [9] and time-varying visualization device code that implements kernels of stream applica- [5]. In this stream programming model, applications tions. Thus, the GPU is treated as a coprocessor of the are decomposed into stages of sequential computations, CPU. On the other hand, input/output streams must S. Nakagawa ¢ F. Ino ¢ K. Hagihara be stored in video memory, called device memory. Since Graduate School of Information Science and Technology, device memory is separated from main memory, namely Osaka University, host memory, streams have to be transferred between 1-5 Yamadaoka, Suita, Osaka 565-0871, Japan them. Data transfer from host memory to device mem- Tel.: +81-6-6879-4353 Fax: +81-6-6879-4354 ory is called \download" and \readback" in the opposite E-mail: fs-nakagw, [email protected] direction. Data transfer can be overlapped with ker- 2 nel computation using the asynchronous CUDA API, 2 Related Work which immediately returns from API calls before their completion. To the best of our knowledge, there is no work that tries to automate the overlap for CUDA-enabled stream Notice here that the term \stream" in CUDA has a applications. However, some development frameworks de¯nition di®erent from that mentioned above. To avoid focus on GPU-accelerated stream processing. For ex- confusion, we denote the former as \CUDA stream" in ample, Yamagiwa et al. [8] propose a stream-based dis- this paper. A CUDA stream is a stream object, which tributed computing environment. Their environment is denotes a sequence of commands that execute in or- implemented using the DirectX graphics library. Our der [1]. Such commands contain data download, kernel middleware di®ers from their environment in that the execution, and data readback. Using multiple CUDA middleware focuses on the CUDA framework and opti- streams, we can implement stream applications on the mizes the interaction between the GPU and the CPU. GPU. For example, a typical implementation will asso- Hou et al. [3] present bulk-synchronous GPU pro- ciate each of the stream elements with a CUDA stream gramming (BSGP), which is a programming language to overlap kernel computation for a CUDA stream with for stream processing on the CUDA-compatible GPU. data transfer for another CUDA stream. Using their compiler, application developers can auto- One technical problem in CUDA is that data trans- matically translate sequential C programs into stream fer commands from CUDA streams are processed in kernels with host code. Thus, the main focus of this lan- an in-order sequence. Similarly, kernels are also exe- guage is to free application developers from the tedious cuted in an in-order sequence. These constraints will chore of temporary stream management. Our middle- result in non-overlapping execution though there is no ware has almost the same purpose as their language but data dependence between di®erent stream elements. For provides an out-of-order execution capability to maxi- example, a readback command from a CUDA stream mize the e®ects of stream processing. can prevent a download command of another CUDA Some practical applications [6,7] are implemented stream from being overlapped with kernel computation. using the stream programming model with achieving Therefore, developers have to call asynchronous API the overlap. However, their purpose is to accelerate spe- functions in the appropriate order to realize overlap- ci¯c applications. In contrast, our middleware provides ping execution. Finding the appropriate order is a time- a general framework to application developers. consuming task because it depends on run-time situa- tions such as API call timings and application code. In this paper, we propose a middleware capable of 3 Preliminaries out-of-order execution of kernels and data transfers, This section explains how stream processing can be aiming at achieving e±ciently overlapped stream pro- done in CUDA. cessing in CUDA. The proposed middleware is designed to reduce the development e®orts needed for GPU ac- celerated stream processing. It automatically changes 3.1 Stream Programming Model the execution order to maximize the performance by overlapping data transfer with kernel execution. This We consider here a simple stream model that produces run-time optimization can easily be done by replacing an output stream by applying a chain of commands, the original CUDA API calls with our middleware API f1; f2; : : : ; fm, to an input stream. Parameter m here calls in host code. Our middleware currently assumes represents the number of commands that compose the that (1) the application can be adapted to the stream chain. Let I and O be the input stream and the out- programming model and (2) the kernel execution time put stream, respectively. The input stream then can be and the data transfer time does not signi¯cantly vary written as I = fe1; e2; : : : ; eng, where ei (1 · i · n) between di®erent stream elements. The middleware is represents an element of the input stream and n repre- implemented as a class of the C++ language. sents the number of elements in the stream. Similarly, The rest of the paper is organized as follows. Section we have the output stream O = fg1; g2; : : : ; gng, where 2 introduces some related work. Section 3 presents pre- gi (1 · i · n) represents an element of the output liminaries needed to understand our middleware. Sec- stream. The computation then can be expressed as tion 4 then describes the details of the middleware. Sec- g = f ± f ¡ ± ¢ ¢ ¢ ± f (e ); (1) tion 5 shows some experimental results obtained using i m m 1 1 i a practical application. Finally, Section 6 concludes the where 1 · i · n and ± represents the composition op- paper with future work. erator. In other words, the computation consists of n 3 Download Kernel execution Readback Input: Input stream I = fe1; e2; : : : ; eng, size n, and number l of CUDA streams. CUDA Output: Output stream O = fg1; g2; : : : ; gng. stream 1 1: cudaStream t str[l]; 2: for i = 1 to n do CUDA 3: Download stream element ei using str[i mod l]; stream 2 Time 4: end 5: for i = 1 to n do 6: Launch kernel to compute gi from ei using str[i mod l]; (a) 7: end Download Kernel execution Readback 8: for i = 1 to n do 9: Readback output stream gi using str[i mod l]; CUDA 10: end stream 1 11: cudaThreadSynchronize(); (a) CUDA stream 2 f g Time Input: Input stream I = e1; e2; : : : ; en , size n, and number l of CUDA streams. f g (b) Output: Output stream O = g1; g2; : : : ; gn . 1: cudaStream t str[l]; 2: for i = 1 to n do Fig. 1 Timeline view of naive methods running on a single 3: Download stream element ei using str[i mod l]; GPU. (a) Overlapping execution continuously runs kernels 4: Launch kernel to compute gi from ei using str[i mod l]; 5: Readback output stream gi using str[i mod l]; but (b) non-overlapping execution causes waiting time due 6: end to data transfer commands. In both cases, four tasks are pro- 7: cudaThreadSynchronize(); cessed using two CUDA streams (n = 4 and l = 2).