Quick viewing(Text Mode)

Linear Algebra Package on the NVIDIA G80 Processor

Linear Algebra Package on the NVIDIA G80 Processor

PACKage on the G80 Robert Liao and Tracy Wang Science 252 Project University of California, Berkeley Berkeley, CA 94720 liao_r [at] berkeley.edu, tracyx [at] berkeley.edu

Abstract primarily on running instructions faster. This brought forth the frequency race between The race for performance improvement manufacturers. Programmers previously depended on running one application could write code, wait 6 months, and their code or sequence of codereally quickly. Now, the would suddenly run faster. Additionally, many processor industry and universities focus on techniques for increasing instruction level running many things really quickly at the same parallelism were introduced to improve time in an idea collectively known as parallelism. performance. Today, the next performance Traditional achieve this by obstacle is not clear, but the general consensus is building more cores onto the architecture. A single that the next big thing includes putting many processor may have as many as 2 to 8 cores that processing cores on a processor. This will enable can execute independently of one another. applications to take advantage of some higher However, many of these processor systems form of parallelism beyond instruction parallelism. include an underutilized . NVIDIA’s G80 Processor represents the first step During all of this time, Graphics to improve performance of an application through Processing Units (GPUs) have been doing many the use of the inherent parallel structure of a operations in parallel due to the relatively graphics processing unit. independent nature of its computations. Graphics scenes can be decomposed into objects, which can This paper explores the performance of a be decomposed into various rendering steps that subset of the Linear Algebra Package running on are independent from one another. However, this the NVIDIA G80 Processor. Results from the led to a very specialized processor that was exploration show that if utilized properly, the optimized for rendering. performance of linear algebra operations is improved by a factor of 70 with a suitable input Performing computations on a GPU is a size. Additionally, the paper discusses the issues relatively new field full of opportunities thanks to involved with running general programs on the the NVIDIA G80 GPU. This processor represents GPU, a relatively new ability provided by the G80 one of the first processors to expose as many as Processor. Finally, the paper discusses limitations 128 computation cores to the programmer. As a in terms of the use of a GPU. result, a comparison with respect to the CPU to see how the GPU performs is more than 1 Introduction appropriate to determine if this is right direction for parallelism. The effort to improve performance on microprocessors for much of the 1990s focused Though any program can be loaded onto the (CPU). The the GPU for this exploration, we decided to relationship between the CPU and GPU is closely linear algebra operations as a starting intertwined as most of the output from a point to evaluate the GPU for other research areas computer is received through video. In the past, like the Berkeley View. The Berkeley View has architects optimized the CPU and GPU bus routes compiled many kernels called dwarves that they because of the high demand to display video to an think should perform well on parallel platforms. like a monitor or LCD. These include dense and sparse operations. Many programs can be reduced purely to these GPU manufacturers have kept up with the dwarves. As a starting point to examining efficacy demand by offering more advanced capabilities in of using the GPU in this capacity, we benchmark their GPUs beyond putting text and windows to a the performance of general linear algebra screen. GPUs today are typically capable of taking operations. geometric information in the form of polygons from an application like a game and performing This paper is organized as follows. Section many different transformations to provide some 2 provides a general introduction to traditional sort of realistic or artistic output. GPUs and how they worked prior to the NVIDIA G80 processor. Section 3 discusses details about This video processing is embarrassingly the NVIDIA G80 processor and outlines its parallel. The representation of a on the capabilities. Section 4 discusses the benchmarks screen often can be rendered independently of used to profile the performance of the G80 with other . As a result, GPU manufacturers have respect to two CPUs used in this paper. Section 5 provided many superscalar features in their provides results, discussion, and speculation on processors to take advantage of this parallelism. the performance runs on the GPU. Section 6 brings This push for parallelism has come to a point forth the issues associated with GPU where a GPU is basically a specialized vector along with a discussion on issues on running processor. applications on the G80 platform. Finally, the paper concludes in Section 7 with a summary of Motivation for Change the results, future directions for research, as well A fixed characterizes the as related works on GPUs. traditional GPU. Many have a fixed number of special such as vertex shaders and Traditional GPUs shaders. NVIDIA noticed that during certain rendering scenarios, many of the specialized Background shaders remain dormant. For instance, a scene with many geometric features will use many Many modern personal computing and vertex shaders, but not very many pixel shaders. architectures include a GPU to off- As a result, NVIDIA began to look for a load the task of rendering graphical objects from reconfigurable solution.

Figure 1 shows a typical high-altitude pipeline of a GPU. Data flows forward from the CPU through the GPU and ultimately on to the display. GPUs typically contain many of these pipelines to process scenes in parallel. Additionally, the pipeline is designed to flow Figure 1: The Traditional GPU Pipeline forward, and as a result, certain stages of the pipeline have features like write-only registers to Each processor also has local memory as avoid hazards like read after write hazards found well as with other processors. in typical CPU pipelines. According to the NVIDIA guide, accessing local and shared memory on- is as fast as accessing Additionally, the right registers. next to the CPU is quiet during heavy computations performed on the CPU. Most Compute Unified Device Architecture developers do not send parallelizable The Compute Unified Device Architecture computations to the GPU because the make (CUDA) is NVIDIA’s API for exposing the it too difficult to do so. The typical interfaces like processing features of the G80 GPU. This OpenGL and DirectX are designed for graphics, Language API provides services ranging from not computation. As a result, the programmer common GPU operations in the CUDA Library to cannot tap into the GPU’s vast vector resources. traditional C memory management semantics in the CUDA runtime and layers. 3 The NVIDIA G80 GPU Additionally, NVIDIA provides a specialized C The G80 GPU is found in NVIDIA’s compiler to build programs targeted for the GPU. GeForce 8 Series graphics cards as well as the NVIDIA FX 4600 and 5600. The NVIDIA Code compiled for the GPU is executed on Quadro FX 5600 is the card used in this the GPU. Likewise, memory allocated on the GPU exploration. resides on the GPU. This introduces complications in interfacing programs running in CPU space Architecture with programs running in GPU space. The programmer must keep track of the pointers used The G80 GPU is NVIDIA’s answer to in each processor. Many programs, including the many of the aforementioned concerns and issues. ones used to benchmark the GPU in this paper, It represents a large departure from traditional reasonably assume that all pointers and execution GPU architectures. A block diagram of the code reside on one memory space and one architecture is shown in Figure 2. The GPU . Porting this style of programming contains 8 blocks of 16 stream processors with a to a separate format is a non-trivial task. total of 128 stream processors. Each stream processor can execute floating point instructions. From the block below, each group of 16 shares a L1 . From there, each block has access to 6 L2 caches. This architecture arrangement also allows one processor to directly feed results into another processor for continued .

Each processor can be configured to be a part of some unit in the traditional GPU sense. This reconfigurability also means that the processors can be dedicated to performing general computations. This capability is exposed in NVIDIA’s Compute Unified Device Architecture.

Figure 2: The NVIDIA G80 Graphics Processor Architecture quickly LAPACK can perform this factorization with respect to matrix size in both the CPU and GPU.

BLAS and CUBLAS The LAPACK tools rely on the Basic Linear Algebra Subprograms (BLAS) library. Figure 3: Organization of the modules. These subprograms are a set of primitive operations that operate on matrices. The original A Note on Scarce Specifications BLAS can be run on the CPU. NVIDIA provides its Due to the secretive nature of the industry, own version called CUBLAS (Compute Unified NVIDIA has not released much information about BLAS). CUBLAS is designed to run on the G80 the G80 processor beyond a high level overview. GPU, and abstracts much of the CUDA As a result, we can only speculate on specific like programming API in a succinct mathematical L1 and L2 cache sizes in this benchmark. package. The only major change is the inclusion of allocation and freeing function to deal with the 4Benchmarking separation of CPU and GPU memory. CULAPACK LAPACK and CLAPACK To perform this exploration, we took the The Linear Algebra PACKage (LAPACK) general solve function and altered it to be consists of a library of programs that operate on compatible with CUBLAS. We called the resulting matrices. This exploration uses LAPACK to gauge port CULAPACK (Compute Unified LAPACK). the performance of the GPU with respect to the Figure 3 shows the organization of the CPU. Originally written in Fortran 77, LAPACK is aforementioned packages. designed to solve simultaneous systems of equations, determine least-squares solutions for Performing a port on LAPACK to systems of linear equations, and eigenvalues of CULAPACK is non-trivial due to the assumption matrices among other linear algebra problems. It of one memory space. LAPACK often makes not only solves those problems, but provides the references to CPU memory for branching basic tools necessary for solving those problems. conditions before calling BLAS. Our exploration deals with CLAPACK, a C version of LAPACK run through F2C, a Fortran to There are several approaches to this port. C converter. One approach is to wrap the memory transfer code around each BLAS call. Unfortunately, this We focus on the sgesv function in comes at the expense of performance since LAPACK for measuring the performance of the repeated copies are not necessary for each BLAS GPU with respect to the CPU. sgesv solves a call. To lessen this performance hit, the second linear system of equations in the form of approach moves the memory transfer to the level of the LAPACK call. The third and ideal solution A × X = B would be to move the GPU and CPU boundary where A is a matrix of coefficients, X is the between LAPACK and BLAS to above LAPACK. variable vector, and B, is a constants vector. This requires LAPACK to be re-architected to LAPACK uses triangular factoring to solve this account for two memory spaces. system of equations. Our performance tests how Because the first approach brings a heavy memory Memory Overhead on the NVIDIA Quadro FX 5600 overhead hit, and the third approach is more time- 5 consuming and involved, we chose to implement 4.5 4 the second approach in the following steps: 3.5

3

2.5

1. Perform a preprocessing step: Take (Milliseconds)

2 Time

matrices supplied by our testing 1.5

framework in CPU memory, and copy 1 them to GPU memory. Additionally, 0.5 0 perform any computations that LAPACK 0 100000 200000 300000 400000 500000 600000 700000 Number of FORTRAN Real Elements needs before invoking BLAS calls. Figure 4: Graph showing memory overhead on the 2. Compute: Invoke as many CUBLAS calls NVIDIA Quadro FX 5600.. as is possible given the preprocessing. benchmark and the red line represents a linear 3. Postprocess the results: Copy relevant regression of these points. parts of the GPU matrix to allow LAPACK The memory overhead is linear as to continue processing without changing indicated from the graph. This means that behavior. Additionally, if LAPACK needs memory operations for subsequent benchmarks to return the actual matrix, copy the result will scale linearly with size. This is important for out of the GPU and into the CPU. the GPU since later findings will show that the Even though step 3 requires some amount of GPU performs better with larger input sizes. copying from GPU to CPU in the program flow, it Additionally, this also means that the is minimal and becomes negligible as the matrix programmer must be careful when dealing with size grows larger. We found this approach to be the CPU and GPU interface. A naïve approach the most sensible and an equally telling method to with copying memory between the CPU and GPU evaluate the GPU for this paper. can result in poor performance in computations. 5 Performance and Analysis LAPACK Mathematical Performance Memory Allocation Overhead The benchmarks targeted the triangular factoring step of the general solve function in CULAPACK requires many, potentially LAPACK. Figure 5 shows the speedup of the large, memory transfers between the CPU and the NVIDIA Quadro FX 5600 over the Core 2 Duo GPU. The benchmark first allocates a specified 6700. The speedup is calculated by taking the time amount of memory on the CPU and the GPU. to triangle factor a given matrix of a particular Randomized data is generated and is used to input size on the CPU divided by the time populate the CPU array. Next, the array is copied required of the same operation on the GPU. Since to the GPU. After that, the array is copied back memory copying is required to perform these into the GPU and both arrays are freed. operations on the GPU, this memory overhead is Figure 4 shows the memory overhead for included in the chart. In the surface graph, a various amounts of elements transferred between higher the surface means a greater speedup for a the CPU and GPU. The blue line represents the particular row and column size. We performed average of three different trial runs of this three time trials for each of the dimensions and Speedup of the NVIDIA Quadro FX 5600 over a Core 2 Duo 6700 (CPU Time / GPU Time)

3

2.5

2 2.5­3 2­2.5 1.5

Speedup 1.5­2 1­1.5 1 0.5­1 900 0­0.5

0.5 700

500 Input Column Size 0 100 200 300 300 400 500 600 700 800 100 900 1000 Input Row Size

Figure 5: Surface chart showing speedup of the NVIDIA Quadro FX 5600 over the Core 2 Duo 6700 averaged the time to help get rid of random run on a 820 processor, which contains factors affecting any particular time trial. a different core design from the Core 2 Duo. The same peak occurs at around 600 rows. The next As shown from the graph, most of the large peak does not occur until around 1600 rows, data points lie below a speedup of 1.0 (red and which is not shown in the graph above. Figure 6 blue regions. This means that the GPU was slower shows a slice along this interesting axis. than the CPU in performing an identical calculation. An interesting peak occurs at a row To further investigate this unusual peak, size of 600 in the graph. The same benchmark was we plotted the time required to perform this Speedup of the NVIDIA Quadro FX 5600 calculation. This is shown in Figure 7. Notice how 600 Rows the graph is relatively linear for larger input sizes. 8 For the CPU, there is quite an execution time peak 7 at 600. 6

5 We speculate that the BLAS memory 4 Speedup 3 access and instruction patterns exacerbate some

2 architectural feature on the Core 2 Duo 6700. The

1 Core 2 Duo 6700 has a shared 4MB L2 cache. This 0 means that it should be able to hold the entire 100 200 300 400 500 600 700 800 900 1000 Input Column Size matrix inside of the cache with plenty of space to

Core 2 Duo 6700 Pentium D 820 spare. While there are no specifications on the L1 and L2 cache for the G80 processor, NVIDIA has Figure 6: A Slice of the Surface Graph stated that an L1 cache is associated with a group Time Required for the Speedup of the NVIDIA Quadro FX 5600 NVIDIA Quadro FX 5600 5000 Rows

80

0.6 70

60 0.5 50 0.4 0.5­0.6 0.4­0.5 40 0.3 Speedup (Seconds)

0.3­0.4 30

Time 0.2 0.2­0.3 1000 20 0.1 0.1­0.2 700 0­0.1 10 0 400 0 100

200 1000 2000 3000 4000 5000 6000 7000 8000 9000 300 100 400

500 Input Column 600 700 800 900 Size Input Column Size 1000

Input Row Size Core 2 Duo 6700

Figure 7: Time required to perform triangular factorization. Figure 8: A Slice of the Surface Graph of 16 stream processors and 6 L2 caches can be found that the processor seemed to be branch accessed by all groups. This grouped cache mispredicting at an unusually high rate for a row probably large enough to hold the entire matrix size of 600. The Pentium D 820 also exhibited a for input sizes tested in this benchmark. similar performance hit at the same row size. If both processors use the same , We used the VTune Performance then this may be the cause of the large Environment profiler to further investigate this performance hit. However, determining why the unusual hit in performance. We ran the profiler on branch predictor on the processor mispredicts on the program operating on matrix row sizes of 500, that particular row size is beyond the scope of this 600, and 700. For cache misses, the numbers seem paper. to be in order. The program cache misses proportionally to the matrix row size. However, For even larger matrix sizes, the GPU upon examination on the branch predictor, we simply blows the CPU out of the water. Figure 8 Swapping Matrices shows the extraordinary speedup calculated in the same fashion. For a 5000x9000 matrix, the CPU 90 took over two hours to determine the result. The GPU, in contrast, took under 2 minutes. This 70 benchmark shows two different wins for the GPU. First, the vast number of cores on the GPU means (Seconds) 50 it can perform many more operations at the same time compared to the CPU. Additionally, these Required

30 large matrices take up quite a bit of memory on Time both respective platforms. In a CPU, a cache miss 10 means a penalty of many cycles. On the NVIDIA G80, much of the memory is as fast as registers. As 0 500 1000 1500 2000 2500 ­10 a result, the G80 can process more data faster. We Size of Square Matrix NVIDIA Quadro FX 5600 Core 2 Duo 6700 would like to note that the total time to execute the benchmark on the Core 2 Duo was on the order of Figure 9: Time required to swap two matrices. 10 hours. The Pentium D 820 was not halfway through the benchmark after 8 hours, and as a result, we decided not to show its speedup in 6 GPU Programming Issues Figure 8. GPU programming is not without its It is worthy to note that the memory issues. The biggest issue is that of memory overhead in our ported version of CULAPACK separation between the GPU and CPU. quickly becomes negligible as the matrix size Minimizing the memory overhead in using GPU grows larger. Because we have minimized the in CPU code either requires large memory amount of memory transfers, we can approximate transfers at the border of GPU and CPU or an most of the performance results to actual involved reorganization of the CPU code to avoid computation time. many memory copies. This means that while the GPU performs better for large matrix operations, Non-computational Performance to fully take advantage of the performance gain, For completeness, we decided to see how current applications need to be aware of the GPU memory space. well the GPU performed on non-computational operations. We decided to find out how quickly In the first iteration of this project for the the GPU could swap two matrices. The results of project presentations, we were not able to obtain this benchmark are shown in Figure 9, and the any sort of speed up for matrix sizes below 1000 results are quite surprising. The time required for by 1000. That iteration simply wrapped CUBLAS the GPU reasonably increases quadratically since calls to copy memory at the appropriate places the number of swaps required increases and then call CUBLAS. After that, the second quadratically as the matrix square size increases. iteration optimized the memory copy. We copied However, the CPU seems to be able to perform the matrices into GPU memory and made as many this swap very quickly if near constant time is CUBLAS calls as could be grouped together. This considered to be quick. allowed some of the benchmarks to go on par if Without more information about not better with the GPU for matrices smaller than 1000 by 1000. NVIDIA’s architecture, it is difficult to determine the cause of the slow down. However, it is not Another issue involved with operating on unreasonable to speculate that the slow down the G80 GPU involves issues with float point might be a difference in optimization goals. For arithmetic. A close examination of the solutions typical graphics computations, swaps may not obtained from the linear algebra operations occur often in the write-once semantics of reveals a level of floating point imprecision (e.g. 0 traditional GPUs. As a result, it is not a common becomes 0.0001). The NVIDIA CUDA use scenario. On the other hand, swaps occur all Programming Guide explains many the time on the CPU. An excellent example comes implementation deviations from IEEE-754. There from sorting. Many data sets performed on the is no support for denormalized numbers, and computer require some sort of sorting, and as a underflowed numbers are flushed to zero. This result, the CPU can optimize for that. As a result, lack of precision indicates that the GPU is NVIDIA may not have optimized for this new unsuitable for scenarios where high precision is scenario during design. required. This may impact its usefulness in scientific computing, where parallelism is king. From this example, the G80 shows that there is little benefit derived from executing non- computational operations on the GPU. 7Conclusion The primary goal of this project was to explore the GPU programming space as well as assess the performance of a GPU running some Department at the University of North Carolina at type of non-graphical computation. In this paper, Chapel Hill. They currently run their we have just begun the exploration of GPU computations on the NVIDIA GeForce 7 Series computing, and there is much that others can do graphics cards. in this very new field. Future Directions The G80 demonstrated that it is up to the We have only benchmarked a small subset task of handling many of the computations that of the CLAPACK library with limited are typically left to the CPU. For small matrix optimization in terms of bringing CLAPACK in to operations, the computation is better left to the the GPU. Since the results are promising, there are CPU since the CPU is usually as fast, if not faster, many directions that can be taken. than the GPU. For matrices larger than 1000x1000, the GPU exhibits some very large performance One large scale project can port all BLAS gains. It is up to 70 times faster with matrices on calls to CUBLAS in CLAPACK functions in the the order of 5000 rows when compared to a Core 2 form described above. Benchmarks can be Duo CPU. performed for this wider sample of linear algebra operations. This will provide better insight to the Additionally, NVIDIA’s new CUDA API advantages of GPU for various operations. offers a very convenient way for the programmer to leverage the GPU for non-graphical Additionally, porting CLAPACK to the computations. This also opens the potential for the GPU program space would provide more GPU and CPU to work in parallel where the GPU interesting results for a program specifically serves primarily as a math and the designed with the GPU in mind. This is more CPU coordinates tasks for the GPU or does some involved as it requires that the computations other useful work. outside of CUBLAS also use CUDA to execute directly on the GPU. However, this would Related Work minimize the memory copying overhead and “The Landscape of provide a more accurate benchmark of GPU Research: A View from Berkeley” Project provided performance. the inspiration for this exploration. One of their main goals is to explore a parallel landscape with A more detailed analysis of the processors with many cores. The GPU provides a benchmarks and further optimizations can be promising step towards their goal of thousands of performed on CULAPACK with more cores on a processor. specifications of the G80 processor from NVIDIA. With this information, the project can optimize Professor James Demmel in the Electrical allocations based on cache sizes in the GPU among Engineering and Department at and code ordering among other optimizations. the University of California, Berkeley also does Optimized performance for linear algebra mathematical work on GPUs. One of his research operations may open new areas for the VIEW projects involves extracting useful computational project. information from an ATI GPU with the DirectX graphics API. Acknowledgements The authors would like to thank Professor Vinay Bondhugula et al., currently is James Demmel from the UC Berkeley Electrical exploring Fast Singular Value Decomposition in Engineering and Computer Science department Graphics Processors in the Computer Science for his guidance on the project. Additionally, the authors would like to thank Professor Sara McMains from the UC Berkeley Mechanical Engineering department for supplying the NVIDIA Quadro FX 5600 graphics board used in this exploration. Finally, the authors would like to thank the CITRIS Tele-immersion Group for providing the used to host the NVIDIA Quadro FX 5600 card. The card requires two PCI- Express 16x power supply sources and around 12 inches of clearance for the card. Not many support these requirements.

References 1. “NVIDIA CUDA Compute Unified Device Architecture.” Available at http://developer.download.nvidia.com/ compute/cuda/0_8/NVIDIA_CUDA_Programm ing_Guide_0.8.pdf

2. “NVIDIA CUDA CUBLAS Library.” Available at http://developer.download.nvidia.com/ compute/cuda/0_8/NVIDIA_CUBLAS_Librar y_0.8.pdf

3. Linear Algebra Package. Available at http://www.netlib.org/lapack/

4. A. Stephin, Y. Lyssenko, and A. Shilov. “Directly Unified: Nvidia GeForce 8800 Architecture Review.” X-bit Laboratories. Available at http://www.xbitlabs.com/articles/vide o/display/gf8800.html

5. 2 Duo Processor Technical Details. Available at http://www.intel.com/design/core2duo/ documentation.htm