A Parallel Algorithm Development Model for the GPU Architecture

A Parallel Algorithm Development Model for the GPU Architecture J. Steven Kirtzic, Ovidiu Daescu Department of Computer Science University of Texas at Dallas Richardson, TX USA {jsk061000, daescu}@utdallas.edu Abstract— Parallel computing has been in use for decades, limitations of other parallel models in that it accounts for and throughout many researchers have sought to define a the unique architecture of the GPU, in particular the various model for algorithm design for such a platform. Valiant types of memory that the GPU possesses and their individual developed a model for parallel computing, which was later attributes. Our model is also designed to include single extended to later include multi-core processors, but it still or multi-core CPUs as part of the system, if the designer may not be best suited for the unique GPU architecture. With chooses to do so. Finally, our model is intended to be easily the current advances in high performance computing, it is accessible for a wide variety of reseachers from all scientific easy to see the role that GPUs can play, and even easier to fields interested in GPU algorithm design, ranging from the see the need for a model for GPU algorithm development. novice to the experienced. Here we propose a parallel GPU model which offers both a general design and a fine-grained approach, intended to 2. Related work accommodate nearly any GPU architecture. We show how We present the following parallel model designs in suc- our model can result in significant increases in performance cession to demonstrate the evolution of our Parallel GPU when algorithms are designed based on its principles. Model (PGM) and give it proper context. Keywords: GPU, parallel processing, algorithm design 2.1 The PRAM model The PRAM model is generally regarded as one of the orig- 1. Introduction inal parallel algorithm design models. The main shortcoming The rapid advancement of the Graphics Processing Unit, of the PRAM model lies in its unrealistic assumptions of or GPU, over the last few years has opened up a new zero communication overhead and instruction-level synchro- world of possibilities for high-speed computation, ranging nization. Another drawback of with the PRAM model is that from biomedical to computer vision applications. Recent the time complexity of a PRAM algorithm is often expressed examples include [1], [2], and [3]. However, the GPU in big-O notation, which is often misleading because the ma- architecture is unlike that of any other, and designing al- chine size n is usually small in existing parallel computers. gorithms to fully harness the capabilities of a GPU is not Consequently, the PRAM model is generally not used as a an easy task, especially when one considers the advantages machine model for real-life parallel computers. and disadvantages of the various resources that a GPU has available to it. In this paper we introduce a parallel algorithm 2.2 The BSP model design model for the GPU architecture which addresses these The BSP, or bulk-synchronous parallel model, was pro- issues. In Section 2 we discuss related work; in Section 3 we posed by Leslie Valiant [4] to overcome the limitations present a brief overview of the GPU architecture, focusing of the PRAM model [5], while maintaining its simplicity. on NVIDIA’s CUDA architecture; in Section 4 we present In the BSP model, a BSP computer consists of a set of the model in its entirety; in Section 5 we illustrate the use n processor/memory pairs (nodes) that are interconnected of our model as we apply it to template and shape matching by a communication network. The BPS model is Multiple algorithms; in Section 6 we discuss the results of our model Instruction Multiple Data (MIMD) in nature, and uses the as applied to these template and shape matching algorithms; concept of a superstep, which is comprised of a computation and finally in Section 7 we conclude and remark on future step, a communication step, and a synchronization step. The work. BSP model is also variable grained, loosely synchronous, has non-zero overhead, and uses message passing or shared 1.1 Contribution variables for communication. We believe that our main contribution with this work is The program executes as a strict sequence of supersteps. to provide an easily accessible parallel algorithm design In each superstep, a process executes the computation oper- model for the GPU architecture. Our model addresses the ations in at most w cycles, a communication operation that takes gh cycles, and a barrier synchronization that takes l interaction overhead (t0 and tc terms), and the parallelism cycles. Note that in the communication overhead gh, g is overhead (tp term). the proportional coefficient for realizing a h relation. The While these models represent the evolution of parallel value of g is platform-dependent, but independent of the algorithm design in general terms, they are limited in scope communication pattern. In other words, gh is the time that as they ultimately fall short when applied to the unique it takes to execute the most time-consuming h relation. architecture of the modern GPU. The need for a model suited Within a superstep, each computation operation uses only to this architecture was vocalized in a paper from MIT [8] data in its local memory. This data is put into the local in which the authors identify that official documentation for memory, either at the program start-up time or by the CUDA from NVIDIA was rather sparse, the forums required communication operations of previous supersteps. Therefore, a lot of searching to find an answer to a particular problem, the communication operations of a process are independent and the trade-offs between various programming options of other processes. were difficult to discern. We attempt to address these issues The BSP model is more realistic than the PRAM model by providing a model which was designed to not only include because it accounts for all overheads except for the paral- the more general models identified above, but to also take lelism overhead for process management. The time for a into consideration the unique nature of the GPU architecture, superstep is estimated by the sum as it differs considerably from the CPU architecture. w + gh + l (1) 3. GPU architecture This model is highly regarded and has formed the basis for In this paper we will often refer to the machine containing other parallel models, such as the parallel phase model [5], the GPU as the “host" and the GPU itself as the “device”. which we will briefly discuss next. However, its generality The NVIDIA GeForce 8800 series is an example of a typ- is its shortcoming when one attempts to apply it to more ical GPGPU (General Purpose GPU) device, which utilizes specific architectures, such as that of the GPU. Valiant NVIDIA’s CUDA (Compute Unified Device Architecture recently extended his model to include multi-core CPUs [6]. GPU design. The GeForce 8800 contains 16 multiprocessors, While this model is much more akin to the architectural each containing 8 semi-independent cores for a total of 128 nature of the GPU, it still does not take into considera- processing units. Each of the 128 processors can run as tion the complexities of the typical GPU architecture, in many as 96 threads concurrently, for a maximum of 12,288 particular the various types of memory, which as we will threads executing in parallel. The computing model is SIMD demonstrate in later sections have a tremendous impact on (Single Instruction Multiple Data), and the memory model the performance of a given GPU algorithm. is NUMA (Non-Uniform Memory Access) with a semi- shared address space. This stands in contrast to a modern 2.3 The parallel phase model CPU, which is typically either SISD (Single Instruction Single Data) or MIMD, in the case of a multi-processor Kai Hwang and Zhiwei Xu [7] proposed a phase parallel or multi-core machine. Additionally, from the perspective of model for parallel computation that is further refined from the programmer, all memory is explicitly shared (in multi- the above two abstract models. This model is similar to threading environments) or explicitly separate (in multi- the BSP model with the following distinctions: a parallel processing environments) on a desktop machine. program is executed as a sequence of phases: the parallelism phase, the computation phase, and the interaction phase. 3.1 GPU instruction throughput versus mem- The total execution time of the superstep on n processors ory access is expressed by The GPU architecture is much more optimized for per- forming calculations than for memory accesses. Therefore, Tn = Tcomp + Tinteract + Tpar considering the multiple types of memory that the GPU p architecture typically includes, it is important to keep this in = (w + σ 2logn)tf + t0(n) + α ∗ w ∗ tc(n) + tp(n) (2) mind when accessing these types of memory, particularly the slower, off-chip ones such as the GPU’s global and the host’s where w is the number of cycles, as with the BSP model, main memory. The most costly memory access is by far the α is the communication-to-communication ratio (CCR) of host-to-device (CPU to GPU) data transfer, and reducing each superstep, and tf is the average time to execute a flop that transfer can have a tremendous impact on the overall by a processor. performance of any algorithm that is implemented in part or Improved from the PRAM and the BSP models, the phase fully on a GPU. parallel model is closer to covering real machine/program As an example of our research, we present the case of a behavior.

A Parallel Algorithm Development Model for the GPU Architecture

Parallel Prefix Sum (Scan) with CUDA

CSE373: Data Structures & Algorithms Lecture 26

CSE 613: Parallel Programming Lecture 2

A Review of Multicore Processors with Parallel Programming

Parallel Algorithms and Parallel Program Design

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian

A Survey on Parallel Multicore Computing: Performance & Improvement

Parallelizing Multiple Flow Accumulation Algorithm Using CUDA and Openacc

Chapter 3. Parallel Algorithm Design Methodology

Parallel Synchronization-Free Approximate Data Structure Construction (Full Version)

Chapter 1 Introduction