Linear Algebra Package on the NVIDIA G80 Processor

Linear Algebra PACKage on the NVIDIA G80 Processor Robert Liao and Tracy Wang Computer Science 252 Project University of California, Berkeley Berkeley, CA 94720 liao_r [at] berkeley.edu, tracyx [at] berkeley.edu Abstract primarily on running instructions faster. This brought forth the frequency race between The race for performance improvement microprocessor manufacturers. Programmers previously depended on running one application could write code, wait 6 months, and their code or sequence of codereally quickly. Now, the would suddenly run faster. Additionally, many processor industry and universities focus on techniques for increasing instruction level running many things really quickly at the same parallelism were introduced to improve time in an idea collectively known as parallelism. performance. Today, the next performance Traditional microprocessors achieve this by obstacle is not clear, but the general consensus is building more cores onto the architecture. A single that the next big thing includes putting many processor may have as many as 2 to 8 cores that processing cores on a processor. This will enable can execute independently of one another. applications to take advantage of some higher However, many of these processor systems form of parallelism beyond instruction parallelism. include an underutilized graphics processing unit. NVIDIA’s G80 Processor represents the first step During all of this time, Graphics to improve performance of an application through Processing Units (GPUs) have been doing many the use of the inherent parallel structure of a operations in parallel due to the relatively graphics processing unit. independent nature of its computations. Graphics scenes can be decomposed into objects, which can This paper explores the performance of a be decomposed into various rendering steps that subset of the Linear Algebra Package running on are independent from one another. However, this the NVIDIA G80 Processor. Results from the led to a very specialized processor that was exploration show that if utilized properly, the optimized for rendering. performance of linear algebra operations is improved by a factor of 70 with a suitable input Performing computations on a GPU is a size. Additionally, the paper discusses the issues relatively new field full of opportunities thanks to involved with running general programs on the the NVIDIA G80 GPU. This processor represents GPU, a relatively new ability provided by the G80 one of the first processors to expose as many as Processor. Finally, the paper discusses limitations 128 computation cores to the programmer. As a in terms of the use of a GPU. result, a comparison with respect to the CPU to see how the GPU performs is more than 1 Introduction appropriate to determine if this is right direction for parallelism. The effort to improve performance on microprocessors for much of the 1990s focused Though any program can be loaded onto the Central Processing Unit (CPU). The the GPU for this exploration, we decided to relationship between the CPU and GPU is closely benchmark linear algebra operations as a starting intertwined as most of the output from a point to evaluate the GPU for other research areas computer is received through video. In the past, like the Berkeley View. The Berkeley View has architects optimized the CPU and GPU bus routes compiled many kernels called dwarves that they because of the high demand to display video to an think should perform well on parallel platforms. output device like a monitor or LCD. These include dense and sparse matrix operations. Many programs can be reduced purely to these GPU manufacturers have kept up with the dwarves. As a starting point to examining efficacy demand by offering more advanced capabilities in of using the GPU in this capacity, we benchmark their GPUs beyond putting text and windows to a the performance of general linear algebra screen. GPUs today are typically capable of taking operations. geometric information in the form of polygons from an application like a game and performing This paper is organized as follows. Section many different transformations to provide some 2 provides a general introduction to traditional sort of realistic or artistic output. GPUs and how they worked prior to the NVIDIA G80 processor. Section 3 discusses details about This video processing is embarrassingly the NVIDIA G80 processor and outlines its parallel. The representation of a pixel on the capabilities. Section 4 discusses the benchmarks screen often can be rendered independently of used to profile the performance of the G80 with other pixels. As a result, GPU manufacturers have respect to two CPUs used in this paper. Section 5 provided many superscalar features in their provides results, discussion, and speculation on processors to take advantage of this parallelism. the performance runs on the GPU. Section 6 brings This push for parallelism has come to a point forth the issues associated with GPU computing where a GPU is basically a specialized vector along with a discussion on issues on running processor. applications on the G80 platform. Finally, the paper concludes in Section 7 with a summary of Motivation for Change the results, future directions for research, as well A fixed pipeline characterizes the as related works on GPUs. traditional GPU. Many have a fixed number of special shaders such as vertex shaders and pixel 2 Traditional GPUs shaders. NVIDIA noticed that during certain rendering scenarios, many of the specialized Background shaders remain dormant. For instance, a scene with many geometric features will use many Many modern personal computing and vertex shaders, but not very many pixel shaders. workstation architectures include a GPU to off- As a result, NVIDIA began to look for a load the task of rendering graphical objects from reconfigurable solution. Figure 1 shows a typical high-altitude pipeline of a GPU. Data flows forward from the CPU through the GPU and ultimately on to the display. GPUs typically contain many of these pipelines to process scenes in parallel. Additionally, the pipeline is designed to flow Figure 1: The Traditional GPU Pipeline forward, and as a result, certain stages of the pipeline have features like write-only registers to Each processor also has local memory as avoid hazards like read after write hazards found well as shared memory with other processors. in typical CPU pipelines. According to the NVIDIA guide, accessing local and shared memory on-chip is as fast as accessing Additionally, the vector processor right registers. next to the CPU is quiet during heavy computations performed on the CPU. Most Compute Unified Device Architecture developers do not send parallelizable The Compute Unified Device Architecture computations to the GPU because the APIs make (CUDA) is NVIDIA’s API for exposing the it too difficult to do so. The typical interfaces like processing features of the G80 GPU. This C OpenGL and DirectX are designed for graphics, Language API provides services ranging from not computation. As a result, the programmer common GPU operations in the CUDA Library to cannot tap into the GPU’s vast vector resources. traditional C memory management semantics in the CUDA runtime and device driver layers. 3 The NVIDIA G80 GPU Additionally, NVIDIA provides a specialized C The G80 GPU is found in NVIDIA’s compiler to build programs targeted for the GPU. GeForce 8 Series graphics cards as well as the NVIDIA Quadro FX 4600 and 5600. The NVIDIA Code compiled for the GPU is executed on Quadro FX 5600 is the card used in this the GPU. Likewise, memory allocated on the GPU exploration. resides on the GPU. This introduces complications in interfacing programs running in CPU space Architecture with programs running in GPU space. The programmer must keep track of the pointers used The G80 GPU is NVIDIA’s answer to in each processor. Many programs, including the many of the aforementioned concerns and issues. ones used to benchmark the GPU in this paper, It represents a large departure from traditional reasonably assume that all pointers and execution GPU architectures. A block diagram of the code reside on one memory space and one architecture is shown in Figure 2. The GPU execution unit. Porting this style of programming contains 8 blocks of 16 stream processors with a to a separate format is a non-trivial task. total of 128 stream processors. Each stream processor can execute floating point instructions. From the block below, each group of 16 shares a L1 cache. From there, each block has access to 6 L2 caches. This architecture arrangement also allows one processor to directly feed results into another processor for continued stream processing. Each processor can be configured to be a part of some shader unit in the traditional GPU sense. This reconfigurability also means that the processors can be dedicated to performing general computations. This capability is exposed in NVIDIA’s Compute Unified Device Architecture. Figure 2: The NVIDIA G80 Graphics Processor Architecture quickly LAPACK can perform this factorization with respect to matrix size in both the CPU and GPU. BLAS and CUBLAS The LAPACK tools rely on the Basic Linear Algebra Subprograms (BLAS) library. Figure 3: Organization of the modules. These subprograms are a set of primitive operations that operate on matrices. The original A Note on Scarce Specifications BLAS can be run on the CPU. NVIDIA provides its Due to the secretive nature of the industry, own version called CUBLAS (Compute Unified NVIDIA has not released much information about BLAS). CUBLAS is designed to run on the G80 the G80 processor beyond a high level overview. GPU, and abstracts much of the CUDA As a result, we can only speculate on specific like programming API in a succinct mathematical L1 and L2 cache sizes in this benchmark. package. The only major change is the inclusion of allocation and freeing function to deal with the 4Benchmarking separation of CPU and GPU memory.

Linear Algebra Package on the NVIDIA G80 Processor

Gs-35F-4677G

Powervr SGX Series5xt IP Core Family

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Data Sheet: Quadro GV100

Driver Riva Tnt2 64

(GPU) Computing

NVIDIA Quadro P4000

Nvidia Tesla P40 Gpu Accelerator

A Configurable General Purpose Graphics Processing Unit for Power, Performance, and Area Analysis

4 Reasons Why Pny Is the Right Choice Nvidia Quadro Rtx 8000

Datasheet Quadro K600

NVIDIA Quadro P620