Optimizing Parallel Reduction In Cuda To Reach GPU Peak Performance

1Adityo Mahardito 2Adang Suhendra 3Deni Tri Hasta

1Gunadarma University([email protected]) 2Gunadarma University([email protected]) 3Gunadarma University([email protected])

Abstract

GPUs are massively multi threaded many core chips that have hundreds of cores and thousands of con- current threads swith high performance and memory bandwidth. Now days, GPUs have already been used to accelerate some numerically intensive high performance computing applications in parallel not only used to graphic processing. This thesis aims primarily to demonstrate the programming model ap- proaches that can be maximize the performance of GPUs. This is accomplished by a proof of maximum reach of bandwidth memory and get speed up from the GPU that used to process parallel computation. The programming environment that used is ’s CUDA, it is parallel architecture and model for general- purpose computing on a GPU. 1 Introduction trol. Many companies have produced GPUs under a number of brand names. Intel, NVIDIA, AMD/AT, Currently, demand for hardware resources for pro- S3 Graphics, VIA Technologies, and pro- cessing data on computers is increasing. This can duce GPUs. But, NVIDIA and ATI control nearly be proved by continuously increasing the mini- 100% of the market. Recently, NVIDIA began re- mum requirement of software applications. The leasing cards supporting an API extension to the C phenomenon of the advent multi core CPUs and programming language CUDA (”Compute Unified many core GPUs means that mainstream processor Device Architecture”), which allows specified func- chips are now parallel systems. Many applications tions from a normal C program to run on the GPU’s that process large data sets can use a data paral- stream processors[8]. lel programming model to speed up the computa- The performance offered by the GPU, in fact, tions. The application outside the field of computa- depend-ing on the method approach that used tion itself are accelerated by data parallel process- while create a CUDA program, even though using ing, from general image rendering and processing, same algorithm. Because in the parallel program- signal processing, physics simulation to computa- ming, especially on CUDA, the thinks that must tional finance or computational biology. However, be considered to obtain optimal results are mem- to more support of data parallel processing, the ory usage and instruction usage. The main focus GPU can be considered than the CPU, because there of this thesis is make optimization of parallel pro- are very significant differences between the capabil- gramming on the CUDA environment that running ity that are owned by the CPUs and GPUs to pro- over GPU using parallel reduction algorithm. The cess sets large data on parallel computation. In parallel reduction algorithm that implemented is the other side, GPUs are very efficient not only at a summation algorithm that computes the sum of manipulating computer graphics, but also at their large arrays of values. highly parallel structure makes them more effective than general purpose CPUs for a range of complex 2 Literature Review algorithms [2]. The GPU is specialized for compute intensive, 2.1 Graphic Processor Unit highly parallel computation and therefore designed such that more transistors are devoted to data pro- A or GPU is a processor at- cessing rather than data caching and flow con- tached to a graphics card dedicated to calculating floating point operations. GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth. The reason behind the discrepancy in floating- point capability between the CPU and the GPU is that the GPU is specialized for compute intensive, highly parallel computation and therefore designed such that more transistors are devoted to data pro- cessing rather than data caching and flow control. Figure 1: Streaming Processor (SP) This difference makes the CPU and GPU suited for different kinds of problems. In a simplified way, one can say that a CPU performs best when a few, isn’t specific in saying about SFU architecture, just small pieces of data are processed in a complex, assume that each SFU is also a fully pipelined, sin- but sequential way. This lets the CPU utilize the gle issue, in order microprocessor. There’s a MT many transistors used for caching, branch predic- issue unit that dispatches instructions to all of the tion and instruction level parallelism. The GPU, on SPs and SFUs in the group. the other hand, need massively data parallel prob- In addition to the processor cores in a SM, lems to work efficiently. The programming model there’s a very small instruction cache, a read only most commonly used when programming a GPU is data cache and a 16KB read/write shared mem- based on the stream programming model. In the ory. These cache sizes are kept purposefully small stream programming model, input to and output because unlike a conventional desktop micropro- from a computation comes in the form of streams. cessor, the datasets we’re trying to cache here are A stream is a collection of homogeneous data ele- small. Each SP ends up working on an individual ments on which some operation, called a kernel, is pixel and despite the move to 32-bit floating point to be performed, and the operation on one element values, there’s only so much data associated with is independent of the other elements in the stream. a single pixel. The 16KB memory is akin to Cell’s Another important difference between a general local stores in that it’s not a cache, but a software purpose processor and a typical GPU is the mem- managed data store so that latency is always pre- ory bandwidth. Because of simpler memory mod- dictable. With this many cores in a single SM, con- els and no requirements from legacy operating sys- trol and predictability and very important to mak- tems, the GPU can support more than 100 GB/s ing the whole thing work efficiently. of memory bandwidth, while the bandwidth of The next cluster is Texture Processor Cluster general-purpose processors is around 20 GB/s [7]. (TPC), shown by Figure 2. NVIDIA purposefully designed its GPU architecture to be modular, so a single TPC can be made up of any number of SMs. 2.1.1 GPU Architecture In the GT200 architecture it was made up of three SMs and it can be different with the other series Start from Streaming processor (see at Figure 1), [4]. it is a fully pipelined, single-issue, in order micro- processor complete with two ALUs and a FPU. An SP doesn’t have any cache, so it’s not particularly great at anything other than cranking through tons of mathematical operations. Since an SP spends most of its time working on pixel or vertex data, the fact that it doesn’t have a cache doesn’t really matter, because if you add up enough of these cores of GPU technology you can start to get something productive given that graphics rendering is a highly parallelizable task[3]. The GPU has a many-core processor containing an array of Streaming Multiprocessors (SMs). A SM is an array of SPs, consists of 8 Streaming Proces- sors (SPs), along with two more processors called Special Function Units (SFUs). Each SFU has four FP multiply units which are used for transcenden- tal operations (e.g. sin, cosin) and interpolation, the latter being used in some of the calculations for things like anisotropic texture filtering. NVIDIA Figure 2: Texture Processor Cluster The components of the TPC however haven’t a device and how threads are grouped into blocks. changed, a TPC is made up of SMs, some control The memory model defines the different types of logic and a texture block. Remember that a SM is memories that are available to a CUDA program. a total of 8 SPs and 2 SFUs, so that brings the to- tal up to 24 SPs and 6 SFUs per cluster in GT200. 2.3 CUDA Programming Model The texture block includes texture addressing and filtering logic as well as a L1 texture cache. NVIDIA’s Compute Unified Device Architecture The modular theme continues with the Stream- (CUDA) is a programming model and hard- ing Processor Array (SPA) that is composed of a ware/software environment for GPGPU, which con- number of TPCs, see Figure 3 [5]. sists of the following components:

• An extension to the C programming language that al-lows programmers to define GPGPU functions called kernels which are executed by multiple threads on the GPU.

• A compiler and related tools for translating CUDA source code to GPU devicespecific bi- nary

• A software stack consisting of GPGPU appli- cation li-braries, CUDA runtime libraries and Figure 3: Streaming Processor Array CUDA device driver • A CUDA enabled GPU device Similarly, the processed data elements need to be stored back in global memory for copying back to host memory. The task of effectively hiding the 2.3.1 CUDA Kernels global memory access la-tency and managing the The CUDA programming language provides a memory hierarchy is very crucial for obtaining max- means for programmers to express a problem that imal performance from the GPU. Each SM manages exhibits significant data-parallelism as a CUDA ker- the creation, execution, synchronization and de- nel. A CUDA kernel is a function that is executed struction of concurrent threads in hardware, with in Single Program Multiple Data (SPMD) on a large zero scheduling overhead, and this is one of the key set of data elements. factors in achieving very high execution through- Kernel code is written on the thread level with put. Each parallel thread is mapped to an SP for access to built-in variables that identify the ex- execution, and each thread maintains it’s own regis- ecuting thread. A kernel is defined using the ter state. All the SPs in an SM execute their threads global declaration specifier and the number of in lock-step, according to the order of instructions CUDA threads for each call is specified using a new issued by the per-SM instruction unit. The SM cre- <<<...>>>syntax. ates and manages threads in groups of 32, and each such group is called a warp [6]. A warp is the smallest unit of scheduling within 2.3.2 CUDA Memory Hierarchy each SM. The GPU achieves efficiency by splitting CUDA threads may access data from multiple mem- it’s work-load into multiple warps and multiplexing ory spaces during their execution as illustrated by many warps onto the same SM. Figure 4. Each thread has a private local memory. Each 2.2 CUDA Overview thread block has a shared memory visible to all threads of the block and with the same lifetime as CUDA, Compute Unified Device Architecture, is the block. a general-purpose hardware interface designed to let pro-grammers use NVIDIA graphics hardware 2.3.3 Thread Hierarchy for purposes other than graphics in a more famil- iar way. In general, the hardware need not be The CUDA programming model organizes threads a graphics related card at all, as there are cards into a three-level hierarchy as shown in Figure 5. At designed specifically for general-purpose calcula- the highest level of the hierarchy is the grid. A grid tions. CUDA defines a programming model and a is a 2D array of thread blocks, and thread blocks memory model that is consistent between all CUDA are in turn 3D arrays of threads. devices. The programming model describes how The size of the grid and the thread-blocks are de- parallel code is writ-ten, launched and executed on termined by the programmer, according to the size In this case, the strategy that will used is instruction performance approach.

2.4.1 Instruction Performance To process an instruction for a warp of threads, a multiprocessor must read the instruction operands for each thread of the warp, execute the instruction, then write the result for each thread of the warp. Therefore, the effective instruction throughput depends on the nominal instruction throughput as well as the mem-ory latency and bandwidth. It is maximized by minimizing the use of instructions Figure 4: Memory Access with low throughput, then maximiz-ing the use of the available memory bandwidth for each category of memory.

2.4.2 Instruction Throughput • Arithmetic Instruction To issue one instruction for a warp, a multipro- cessor takes 4 clock cycles for single-precision floating-point add, multiply, and multiply-add, integer add, bitwise operations, compare, min, Figure 5: Hierarchy of threads in the CUDA Pro- max, type conversion in-struction. Integer di- gramming Model vision and modulo operation are particularly costly and should be avoided if possible or re- placed with bitwise operations whenever pos- of the problem be-ing solved and communicated to sible. If n is a power of 2, (i/n) is equiva- the driver at kernel launch time. Each thread-block lent to (i >> log2(n)) and (i%is equivalent in a grid has it’s own unique identifier and each to (i&(n−1)), the compiler will perform these thread has a unique identifier within a block. Us- conversions if n is literal. ing a combination of block-id and thread-id, it is possible to distinguish each individual thread run- • Control Flow Instructions ning on the entire device. Only a single grid of Any flow control instruction (if, switch, do, thread blocks can be launched on the GPU at once, for,while) can significantly impact the effective and the hardware limits on the num-ber of thread instruc-tion throughput by causing threads of blocks and threads vary across different GPU archi- the same warp to diverge, that is, to follow dif- tectures. ferent execution paths. If this happens, the dif- A kernel is executed by a grid, which contain ferent executions paths have to be serialized, blocks. These blocks contain the threads. A thread increasing the total number of instruc-tions ex- block is a batch of threads that can cooperate in ecuted for this warp. When all the different ex- sharing data through shared memory and synchro- ecution paths have completed, the threads con- nizing their execution. A threads from different verge back to the same execution path. To ob- blocks operate independently. tain best performance in cases where the con- trol flow depends on the thread ID, the control- 2.4 CUDA Optimization Strategy ling condition should be written so as to mini- mize the number of divergent warps. In CUDA optimization performance strategy, there are 4 approaches that can be used, ie: • Memory Instruction • Instruction Performance Memory instructions include any instruction that reads from or writes to shared, local or • Number of thread per block global memory. A multiprocessor takes 4 clock • Data transfer between host(CPU) and de- cycles to issue one memory instruction for a vice(GPU) warp. When accessing local or global mem- ory, there are, in addition, 400 to 600 clock • Texture Fetch versus Global or Constant Mem- cycles of memory latency. Much of this global ory Read memory latency can be hidden by the thread scheduler if there are sufficient independent finite precision floating point numbers, compared arithmetic instructions that can be issued while to the obvious approach. This is done by keeping a waiting for the global memory access to com- separate running compensation (a variable to accu- plete. mulate small errors). In particular, simply summing n numbers in se- • Synchronization quence has a worst-case error that grows propor- Instruction syncthreads takes 4 clock cycles to tional√ to n, and a root mean square error that grows issue for a warp if no thread has to wait for any as n for random inputs (the roundoff errors form other threads. a random walk). With compensated summation, the worst-case error bound is independent of n, so a large number of values can be summed with an 2.4.3 Memory Bandwidth error that only depends on the floating-point preci- The effective bandwidth of each memory space de- sion. pends significantly on the memory access pattern. Since device memory is of much higher latency Algorithm 1 Kahan Summation Algorithm and lower bandwidth than on-chip memory, de- function kahanSum(input) vice memory accesses should be minimized. A typ- var sum ← input[i] A running compensation for ical programming pattern is to stage data coming lost-low order bits from device memory into shared memory; in other var c ←0:0 words, to have each thread of a block: i = 2 to input.length y ← input[i] - c 1. Load data from device memory to shared mem- t ← sum + y ory c ← (t -sum) - y sum← t 2. Synchronize with all the other threads of the next i block so that each thread can safely read return sum shared memory loca-tions that were written by different threads

3. Process the data in shared memory 2.5.2 Parallel Reduction Complexity and Cost 4. Synchronize again if necessary to make sure In the parallel processing, there are three kind of that shared memory has been updated with the complexity, i.e. step complexity, work complexity, results and time complexity. 5. Write the results back to device memory • Step complexity is Log(N) parallel steps, each 2.5 Parallel Reduction step S does N/2 independent operation, so step complexity is O(log N). Parallel reduction is one of the kinds of Paral- D lel Random Access Machine (PRAM) algorithms. • Work complexity because for N = 2 performs P D−S A Parallel Random Access Machine (PRAM) is a S[1..D]2 = N − 1 operations, so work shared memory abstract machine which is used by complexity is O(N). parallel algorithms designers to esti-mate the algo- • The time complexity is O(N/P + logN) with P rithm performance (like its time complexity). So, threads physically in parallel (P processor) a parallel reduction is a process of PRAM in the manipulating data stored in the memory of global registers. There are several operations in parallel 3 Methodology reduction i.e., the addition operation, subtraction operation, and multiplication operation. Parallel 3.1 General Scenario reduction can be described with a binary tree, a group of n log p value is added in addition steps A reduction is the process of combining elements in parallel. of a vector or array to yield a single aggregate el- ement. It is commonly used in scientific compu- 2.5.1 Kahan Summation Algorithm tations in parallel processing. Therefore, this re- search try to implementing this method to find a In numerical analysis, the Kahan summation algo- way optimizing the parallel device. Actually, the rithm (see algorithm 1) also known as compensated reduction have some operation that can be used, summation significantly reduces the numerical er- i.e. addition , multiplication, substraction, and di- ror in the total obtained by adding a sequence of vision. The author chose the addition operation with consideration that it will not give the elements 3.4.2 Reduction kernel with bank conflict that consist negative value, decimal, or too large. method Then, this reduction operation will be implemented to process large set elements of an array on GPU de- This method is using the solution of Interleaved ad- vice. The case that used is sum of increment array dressing with divergent branching. Just replace the elements on CUDA environment, then explore the way on thread indexing with stride index and non CUDA programming model to get different imple- divergent branch. So, the indexing of thread ID not mentation of each others, in an attempt to find an based on anymore with the order of elements value efficient and scalable implementation. in shared memory. By using this method, the thread index has been sorted correctly without hop between each other, so 3.2 GPU Device there is no useless of thread usage. But the new problem is shared memory bank con- In this research, the GPU device that used to op- flicts. To achieve high memory bandwidth, shared timized is GeForce GT 240. This device is a desk- memory is divided into equally sized memory mod- top class, commonly used to play the games on the ules, called banks, which can be accessed simul- PC. It’s have 96 CUDA cores to process on paral- taneously. So, any memory read or write request lel with 54.4 GB/s memory bandwidth. The points made of n addresses that fall in n distinct memory that want to reach on GPU is the maximum of mem- banks can be serviced simultaneously, yielding an ory bandwidth and minimum of the time process- effective bandwidth that is n times as high as the ing. bandwidth of a single module. However, if two ad- dresses of a memory request fall in the same mem- 3.3 Research Method ory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory There are some method that implemented in this request with bank conflicts into as many separate research, i.e. : conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the num- 1. Use recent of example program by Ian Buck re- ber of separate memory requests. If the number of search [1] separate memory requests is n, the initial memory request is said to cause n-way bank conflicts. To get 2. Find a weakness or problem from the program maximum performance, it is therefore important to understand how memory addresses map to memory 3. Try to solve the problem then upgrade to better banks in order to schedule the memory requests so program as to minimize bank conflicts. For all threads of a warp (32 threads), accessing 4. Test and compare between all of them the shared memory is as fast as accessing a regis- ter as long as there are no bank conflicts between 3.4 Program Implementation the threads. So, the solution is using sequential ad- dressing method for indexing thread ID. 3.4.1 Interleaved Addressing with Divergent Branching by Ian Buck 3.4.3 Reduction kernel with sequential ad- The program, for next called as kernel, doing re- dressing duction where each thread performs calculation on two elements that loaded from shared memory. It is Sequential addressing method will avoid bank con- an implementation of sum EREW PRAM reduction flict problem, because in this method the thread ID process, where the indexing method use a modulo indexing doing with sequential method, so every operation for every iteration process. threads will not access the same address on mem- This method is simple but has a weakness, that ory bank, just point to one same shared memory indexing thread is useless, because there is one hop address. between the thread and others, it will make highly In the source code, the think that must replace divergent warps that affected inefficient process. is just a stride indexing in inner loop with reversed The solution is according to the strategy perfor- loop and thread ID based indexing. mance, this program can be revised with arithmetic But, this way still have a problem, half of the instruction and control flow instruction approaches threads are idle on first loop iteration, so it make to replace if (tid% process to minimize the number wasteful and of course will affected to performance. of divergent branch and warps. It can be replaced The solution can be get by doing reduction during with if(index < blockDim.x) control flow, where the load element process from global memory to intindex = 2 ? s ? tid. shared memory. 3.4.4 Reduction kernel with first add during 3.4.7 Reduction kernel with multiple elements load per thread

To avoid idle of threads on sequential addressing Given a total number of threads per grid, the num- method by doing the reduction when load elements ber of threads per block, or equivalently the num- from global memory to shared memory or can said ber of blocks, should be chosen to maximize the uti- first add during load method. lization of the available computing resources. This In this method, during loop process of loading el- means that there should be at least as many blocks ement from global memory, also carried out the re- as there are multiprocessors in the device. duction process from element that have finished to Furthermore, running only one block per multi- load before finally writing on shared memory. So, processor will force the multiprocessor to idle dur- the element that process in kernel function is the ing thread synchronization and also during device result of sum computation too. The point is replace memory reads if there are not enough threads per single load with two loads on the load process and block to cover the load latency. It is therefore usu- perform first add during load processed.. ally better to allow for two or more blocks to be active on each multiprocessor to allow overlap be- It should be the best method that can used to tween blocks that wait and blocks that can run. reach GPU peak performance, but in fact the reach For this to happen, not only should there be at of memory bandwidth is still far from bandwidth least twice as many blocks as there are multipro- bound, it may caused by instruction bottleneck or cessors in the device, but also the amount of allo- instruction latency. An instruction’s latency is the cated shared memory per block should be at most number of clock cycles it takes for the instruction to half the total amount of shared memory available pass through the pipeline. To avoid this condition, per multiprocessor. the strategy that can used is unroll loops, because of the reduction has low arithmetic intensity. 4 Result and Comparison 3.4.5 Reduction kernel with unroll last warp 4.1 Parameter Sometimes, the compiler may unroll loops or op- Before start to run all of program from each timize out if or switch statements by using branch method, the parameter that used on CUDA for all predication instead.In these cases, no warp can ever method is: diverge. The programmer can also control loop un- rolling. • Number of elements array is 217 to 225 As reduction proceeds, the number of active threads decreases. In GPUs architecture, a kernel • Number of thread per block is 512 or an instruction are SIMD synchronous within a • Number of block is 512 warp. Warp is a group of threads executed physi- cally in parallel, and each group of warp consist of 32 thread. That means when ”s¡=32”, no longer 4.2 Result and Comparison need to _syncthreads_() and no longer need to ”if(tid¡s)” because it doesn’t save any work, so it • Result for 224 elements array can be control loop unrolling. Let’s unroll the last 6 Result of running all of method, compared iteration of the inner loop. by memory bandwidth and time consuming This saves useless work in all warps, not just the shown by Figure 6. It’s just described the per- last one. Without unrolling, all warps execute every formance by each method to process 224 ele- iteration of the for loop and if statement. ments of array, where ”Time” column is time process that needed in millisecond, ”Band- width” column is rate at which data can be 3.4.6 Reduction kernel with completely un- read from or stored into a memory. Mem- rolled ory bandwidth is usually expressed in units of bytes/second, ”Step Speedup” column is value If the number of iterations at compile time is of current kernel time divided by previous ker- known, it could completely unroll the reduction. nel time, and ”Cumulative Speedup” column Luckily, the block size is limited by the GPU to 512 shown the value of current kernel time divided threads and also sticking to power of 2 block size. by kernel 1 (divergent branch method) time. So it can easily unroll for fixed block size, then re- place the iterative process ”for” with completely un- roll. • Result for 217 to 225 elements array 5 Conclusions

5.1 Conclusion According to the result, the GPU peak perfor- mance has reached. It proved by acquisition mem- ory bandwidth value (at 43.3 GB/s) with mem- ory bandwidth bound (54.4 GB/s) on GPU device Figure 6: Result performance by all of method (Geforce GT 240). To optimize the GPU for parallel reduction pro- cess by keep the multiprocessors on the device as Figure 7, show performance comparison by busy as possible. A device in which work is poorly time on graphical. On that figure can see balanced across the multiprocessors will deliver 17 20 that from 131.072 - 1.048.576 (2 to 2 ) ele- suboptimal performance. Hence, it’s important to ments array, all of method still reliable to used. design your application to use threads and blocks 21 24 At 2.097.152 - 16.777.216 (2 to 2 ) ele- in a way that maximizes hardware utilization and ments array, the method of Divergent Branch- to limit practices that impede the free distribution ing, Bank Conflict, and Sequential Addressing of work. One of the keys to good performance is was not reliable to used, because they need Another important concept is the management of 25 a time over 1.0 second. Furthermore, at 2 system resources allocated for a particular task. show that just four methods that still reliable to CUDA optimization strategy can be used to maxi- used. In the end of the result, the best method mizes hardware utilization. In this case of reduc- is Multiple Element per thread because it con- tion process, it can be use divergent branching, sumes more less time than the others. bank conflicts, memory coalescing with sequential addressing and latency hiding with unroll loop. Actually, so many type of approaches that can be used to optimize performance on GPU, but espe- cial for CUDA, there two type strategy, ie instruction throughput and memory bandwidth.

5.2 FutureWork The author have an wish to continue the research about GPU performance on future work, wish to try with some different thinks,i.e.: Use different type Figure 7: Result Comparison by time of GPU device, more high specification to process more large data Thread and block parameter that Figure 8, show performance comparison by used, to find ideal portion of shared memory usage bandwidth per seconds on graphical, its means Algorithm processing, to break through the bound- that the best method is that can reach highest ary of memory bandwidth. bandwidth or closer with the GPU bandwidth boundary, at 54.4 GB/s. On that figure show performance from 131.072 - 33.554.432 (217 References to225) elements. [1] Ian Buck. Parallel programming with cuda. NVIDIA, 2008.

[2] NVIDIA. Nvidia cuda compute unified device architecture. In NVIDIA CUDA Programming Guide 2.0, volume 2, page 1. 2008.

[3] Anand Lal Shimpi and Derek Wilson. Nvidia’s 1.4 billion transistor gpu. page 2, 2008. http://www.anandtech.com/show/2549/2, accessed on August 2010.

[4] Anand Lal Shimpi and Derek Wilson. Nvidia’s 1.4 billion transistor gpu. 2008. Figure 8: Result Comparison by bandwidth http://www.anandtech.com/show/2549/4, access on April 2010. [5] Anand Lal Shimpi and Derek Wilson. Nvidia’s 1.4 billion transistor gpu. page 4, 2008. http://www.anandtech.com/show/2549/4, access on April 2010. [6] Anand Lal Shimpi and Derek Wilson. Nvidia’s 1.4 billion transistor gpu. page 2, 2008. http://www.anandtech.com/show/2549/2, access on April 2010. [7] wikipedia. gpgpu, 2010. http://en.wikipedia.org/wiki/GPGPU. [8] Wikipedia. Graphics processing unit. 2010. http://en.wikipedia.org/wiki/Graphics_ pro- cessing_unit, accessed on August 2010.