GPU Multiprocessing

GPU multiprocessing Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain) Outline 1. Multichip solutions [10 slides] 2. Multicard solutions [2 slides] 3. Multichip + multicard [3] 4. Performance on matrix decompositions [2] 5. CUDA programming [5] 6. Scalability on 3DFD [4] A world of possibilities From lower to higher cost, we have: 1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte). 2. Multicard: SLI(Nvidia) / CrossFire(ATI). Gigabyte (2005) NVIDIA (2007) ATI (2007) NVIDIA (2008) 3. Combination: Two chips/card and/orEvans & Sutherland (2004): two cards/connector. 3 I. Multichip solutions 4 First choice: Multichip. A retrospective: Voodoo 5 Rage Fury 5500 Maxx 3Dfx ATI (1999) (2000) Volari V8 2 Rad9800 Duo (prototype) XGI Sapphire (2002) (2003) 5 First choice: Multichip. Example 1: 3D1 (Gigabyte - 2005). A double GeForce 6600GT GPU on the same card (december 2005). Each GPU endowed with 128 MB of memory and a 128 bits bus width. 6 First choice: Multichip. Example 2: GeForce 7950 GX2 (Nvidia – 2006) 7 First choice: Multichip. Example 3: GeForce 9800 GX2 (Nvidia - 2008) . Double GeForce 8800 GPU, double printed circuit board and double video memory of 512 MB. A single PCI-express connector. 8 First choice: Multichip. 3D1 (Gigabyte). Cost and performance 3DMark 2003 3DMark 2005 Card 1024x768 1600x1200 1024x768 1600x1200 GeForce 6600 GT 8234 2059 3534 2503 3D1 using a single GPU 8529 2063 3572 2262 GeForce 6800 GT 11493 3846 4858 3956 GeForce 6600 GT SLI 14049 3924 6122 3542 3D1 using two GPUs 14482 4353 6307 3609 Cost: row 3 > row 4 > row 5 > row 1 > row 1 9 First choice: Multichip. 3D1 (Gigabyte). Analysis. As compared to a single GeForce 6800 GT, 3D1 has: . Lower cost. Higher arithmetic performance. Better at poorer resolution and software innovations (shaders). Similar bandwidth. Lower memory space and usability: . Vertices and textures must be replicated. A GPU cannot see the memory of its twin. As compared to two GeForce 6600 GT connected through SLI: . Slightly lower cost. Greater performance without demanding CPU bandwidth. Less versatile: Future expansion and/or single-card use. 10 First choice: Multichip. GeForce 7950 GX2 (2006) . GPU developed by Nvidia in June 2006. The GPU has “twin soul” (duality affects design). Clocks are slower than the single-GPU model: . GPU: 500 MHz (twin) versus 650 MHz (stand alone). Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone). Drivers were released almost a year later, which penalized initially the popularity of this card. It allows to use 48 pixel processors (24 on each GPU) and a video memory of 1 GB (512 MB connected to each GPU through a couple of buses 256 bits wide). 11 First choice: Multichip (2006). Transistors. A smaller chip with smaller transistors allows growing through a GPU replication 12 First choice: Multichip (2006). Frequency. A double GPU allows to relax clocks, with less heat and power consumption. 13 First choice: Multichip (2006). Bandwidth. Two GPUs placed on parallel planes make it easier to duplicate the bus width to 512 bits. 14 II. Multicard solutions 15 Second choice: Multicard. A couple of GPUs . SLI (Nvidia on GeForces) . CrossFire (ATI on Radeons) 16 Second choice: Multicard. SLI (Nvidia). Elements. - The motherboard must have several slots PCI-express 2.0 and PCI-express x16: - The power supply must reach at least 700 Watts. - Performance issues: A twin card may increment performance 60-80%. A new generation of GPUs may increment even more. Time frame becomes crucial! 17 III. Multichip + multicard 18 1+2 choice: Multichip+multicard . First solution available on the marketplace: Gigabyte (2005) based on GeForce 6 GPUs. It allows heterogeneous graphics cards, but workload balance gets complicated. 19 1+2 choice: Multichip+multicard. Implementation details 20 1+2 choice: Multichip+multitarjeta. Newer designs . It combines a number of GeForce 9800 GX2 GPUs and a multi-socket motherboard to configure up to quad-SLI: 2 GPUs/card x up to 4 cards = 8 GPUs. 2 GPUs 4 GPUs 8 GPUs 21 IV. Performance on matrix decompositions 22 Multicard performance versus a newer generation (LU decomposition) . A second (twin) GPU improves 1.6x, but does not reach the performance of a single card coming from the next generation. 23 CPU+GPU performance versus a single quad-core CPU (more on this later) . The benchmark is composed of three popular matrix decompositions used in linear algebra 24 V. CUDA programming for multi-GPU applications 25 Device Management . CPU can query and select GPU devices . cudaGetDeviceCount( int *count ) . cudaSetDevice( int device ) . cudaGetDevice( int *current_device ) . cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) . cudaChooseDevice( int *device, cudaDeviceProp* prop ) . Multi-GPU setup: . device 0 is used by default . one CPU thread can control only one GPU . multiple CPU threads can control the same GPU . calls are serialized by the driver 41 26 Multiple CPU Threads and CUDA . CUDA resources allocated by a CPU thread can be consumed only by CUDA calls from the same CPU thread. Violation example: . CPU thread 2 allocates GPU memory, stores address in p . thread 3 issues a CUDA call that accesses memory via p 42 27 When using several GPUs, the implementation gets complicated . GPUs don’t share video memory, so programmer must move data around PCI-express (even when GPUs belong to the same graphics card, as in the GeForce 9800 GX2). Steps to follow: . Copy data from GPU A to CPU thread A. Copy data from CPU thread A to CPU thread B using MPI. Copy data from CPU thread B to GPU B. We can use asynchronous copies to overlap the kernel execution on the GPU with data copies, and “pinned memory” to share copies among CPU threads (use cudaHostAlloc()) 28 Host Synchronization . All kernel launches are asynchronous . control returns to CPU immediately . kernel executes after all previous CUDA calls have completed . cudaMemcpy is synchronous . control returns to CPU after copy completes . copy starts after all previous CUDA calls have completed . cudaThreadSynchronize() . blocks until all previous CUDA calls complete 39 29 CPU↔GPU interactions: Conclusions . CPU↔GPU mem BW much lower than GPU mem BW. Use page-locked host memory (cudaMallocHost()) for maximum CPU ↔ GPU bandwidth . 3.2 GB/s common on PCI-e x16. ~4 GB/s measured on nForce 680i chipsets (8 GB/s for PCI-e 2.0). Be cautious however since allocating too much page-locked memory can reduce overall system performance. Minimize CPU ↔ GPU data transfers by moving more code from CPU to GPU: . Even if that means running kernels with low parallelism. Intermediate data structs. can be allocated, operated on, and deallocated without ever copying them to CPU memory. Group data transfers: . One large transfer much better than many small ones. 30 VI. Scalability for 3DFD (Nvidia code) 31 Example: Multi-GPU implementation for 3DFD . 3DFD is a finite differences code for the discretization of the seismic wave equation. 8th order in space, 2nd order in time. Using a regular mesh. Fixed X and Y dimensions, varying Z. Data is partitioned among GPUs along Z axis. Computation increases with z, communication (per node) stays constant. A GPU has to exchange 4 xy-planes (ghost nodes) with each of its neighbors. Executed on a cluster of 2 GPUS per node and Infiniband SDR network. 32 Performance for a couple of GPUs . Linear scaling is achieved when computation time exceeds communication time. 33 Three or more cluster nodes . Times are per cluster node. At least one cluster node needs two MPI communications, one with each of the neighbors. 34 Performance with 8 GPUs . 8x improvement factor is sustained at Z>1300, exactly where computation exceeds communication. 35.

GPU Multiprocessing

DRAFT - DATED MATERIAL the CONTENTS of THIS REVIEWER’S GUIDE IS INTENDED SOLELY for REFERENCE WHEN REVIEWING Rev a VERSIONS of VOODOO5 5500 REFERENCE BOARDS

Multiprocessing Contents

Multiprocessing and Scalability

Design of a Message Passing Interface for Multiprocessing with Atmel Microcontrollers

Rasterized Cnlor

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

Piotr Warczak Quarter: Fall 2011 Student ID: 99XXXXX Credit: 2 Grading: Decimal

12) United States Patent 10) Patent No.: US 7.003.588 B1

Message Passing Fundamentals

Composable Multi-Threading and Multi-Processing for Numeric Libraries

14. Parallel Computing 14.1 Introduction 14.2 Independent

Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence