<<

...... TESLA:AUNIFIED GRAPHICS AND COMPUTING ARCHITECTURE

...... TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,

NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL

COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS

MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.

...... The modern 3D graphics process- In this article, we discuss the require- ing unit (GPU) has evolved from a fixed- ments that drove the unified graphics and function to a programma- processor architecture, ble parallel processor with computing power describe the Tesla architecture, and how it is exceeding that of multicore CPUs. Tradi- enabling widespread deployment of parallel tional graphics pipelines consist of separate computing and graphics applications. programmable stages of vertex processors executing vertex programs and pixel The road to unification Erik Lindholm fragment processors executing pixel shader The first GPU was the GeForce 256, programs. (Montrym and Moreton provide introduced in 1999. It contained a fixed- John Nickolls additional background on the traditional function 32- floating-point vertex trans- graphics processor architecture.1) form and lighting processor and a fixed- Stuart Oberman NVIDIA’s Tesla architecture, introduced function integer pixel-fragment pipeline, in November 2006 in the GeForce 8800 which were programmed with OpenGL John Montrym GPU, unifies the vertex and pixel processors and the DX7 API.5 In 2001, and extends them, enabling high-perfor- the GeForce 3 introduced the first pro- NVIDIA mance parallel computing applications writ- grammable vertex processor executing vertex ten in the C language using the Compute , along with a configurable 32-bit Unified Device Architecture (CUDA2–4) floating-point fragment pipeline, pro- parallel programming model and develop- grammed with DX85 and OpenGL.6 The ment tools. The Tesla unified graphics and 9700, introduced in 2002, featured computing architecture is available in a a programmable 24-bit floating-point pixel- scalable family of GeForce 8-series GPUs fragment processor programmed with DX9 and GPUs for laptops, desktops, and OpenGL.7,8 The GeForce FX added 32- workstations, and servers. It also provides bit floating-point pixel-fragment processors. the processing architecture for the Tesla The 360 introduced an early unified GPU computing platforms introduced in GPU in 2005, allowing vertices and pixels 2007 for high-performance computing. to execute on the same processor.9 ......

0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Society. 39 ...... HOT CHIPS 19

Vertex processors operate on the vertices texture units. The generality required of a of primitives such as points, lines, and unified processor opened the door to a triangles. Typical operations include trans- completely new GPU parallel-computing forming coordinates into screen space, capability. The downside of this generality which are then fed to the setup unit and was the difficulty of efficient load balancing the rasterizer, and setting up lighting and between different shader types. texture parameters to be used by the pixel- Other critical hardware design require- fragment processors. Pixel-fragment proces- ments were architectural scalability, perfor- sors operate on rasterizer output, which fills mance, power, and area efficiency. the interior of primitives, along with the The Tesla architects developed the interpolated parameters. graphics feature set in coordination with Vertex and pixel-fragment processors the development of the Microsoft have evolved at different rates: Vertex DirectX 10 graphics API.10 They developed processors were designed for low-latency, the GPU’s computing feature set in coor- high-precision math operations, whereas dination with the development of the pixel-fragment processors were optimized CUDA C parallel programming language, for high-latency, lower-precision texture compiler, and development tools. filtering. Vertex processors have tradition- ally supported more-complex processing, so Tesla architecture they became programmable first. For the The Tesla architecture is based on a last six years, the two processor types scalable processor array. Figure 1 shows a have been functionally converging as the block diagram of a GeForce 8800 GPU result of a need for greater programming with 128 streaming-processor (SP) cores generality. However, the increased general- organized as 16 streaming multiprocessors ity also increased the design complexity, (SMs) in eight independent processing units area, and cost of developing two separate called texture/processor clusters (TPCs). processors. Work flows from top to bottom, starting Because GPUs typically must process at the host interface with the system PCI- more pixels than vertices, pixel-fragment Express . Because of its unified-processor processors traditionally outnumber vertex design, the physical Tesla architecture processors by about three to one. However, doesn’t resemble the logical order of typical workloads are not well balanced, graphics pipeline stages. However, we will leading to inefficiency. For example, use the logical graphics pipeline flow to with large triangles, the vertex processors explain the architecture. are mostly idle, while the pixel processors At the highest level, the GPU’s scalable are fully busy. With small triangles, streaming processor array (SPA) performs the opposite is true. The addition of all the GPU’s programmable calculations. more-complex primitive processing in The scalable memory system consists of DX10 makes it much harder to select a external DRAM control and fixed-function fixed processor ratio.10 All these factors raster operation processors (ROPs) that influenced the decision to design a unified perform color and depth frame buffer architecture. operations directly on memory. An inter- A primary design objective for Tesla was connection network carries computed to execute vertex and pixel-fragment shader pixel-fragment colors and depth values from programs on the same unified processor the SPA to the ROPs. The network also architecture. Unification would enable dy- routes texture memory read requests from namic load balancing of varying vertex- and the SPA to DRAM and read data from pixel-processing workloads and permit the DRAM through a level-2 cache back to the introduction of new graphics shader stages, SPA. such as geometry shaders in DX10. It also The remaining blocks in Figure 1 deliver let a single team focus on designing a fast input work to the SPA. The input assembler and efficient processor and allowed the collects vertex work as directed by the input sharing of expensive hardware such as the command stream. The vertex work distri- ...... 40 IEEE MICRO Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

bution block distributes vertex work packets Command processing to the various TPCs in the SPA. The TPCs The GPU host interface unit communi- execute vertex shader programs, and (if cates with the host CPU, responds to enabled) geometry shader programs. The commands from the CPU, fetches data from resulting output data is written to on-chip system memory, checks command consisten- buffers. These buffers then pass their results cy, and performs context switching. to the viewport/clip/setup/raster/zcull block The input assembler collects geometric to be rasterized into pixel fragments. The primitives (points, lines, triangles, line pixel work distribution unit distributes pixel strips, and triangle strips) and fetches fragments to the appropriate TPCs for associated vertex input attribute data. It pixel-fragment processing. Shaded pixel- has peak rates of one primitive per clock fragments are sent across the interconnec- and eight scalar attributes per clock at the tion network for processing by depth and GPU core clock, which is typically color ROP units. The compute work 600 MHz. distribution block dispatches compute The work distribution units forward the thread arrays to the TPCs. The SPA accepts input assembler’s output stream to the array and processes work for multiple logical of processors, which execute vertex, geom- streams simultaneously. Multiple clock etry, and pixel shader programs, as well as domains for GPU units, processors, computing programs. The vertex and com- DRAM, and other units allow independent pute work distribution units deliver work to power and performance optimizations. processors in a round-robin scheme. Pixel ......

MARCH–APRIL 2008 41 ...... HOT CHIPS 19

Figure 2. Texture/processor cluster (TPC).

work distribution is based on the pixel tions to texture operations, one texture unit location. serves two SMs. This architectural ratio can vary as needed. Streaming processor array The SPA executes graphics shader thread Geometry controller programs and GPU computing programs The geometry controller maps the logical and provides thread control and manage- graphics into recirculation ment. Each TPC in the SPA roughly on the physical SMs by directing all corresponds to a quad-pixel unit in previous primitive and vertex attribute and topology architectures.1 The number of TPCs deter- flow in the TPC. It manages dedicated on- mines a GPU’s programmable processing chip input and output vertex attribute performance and scales from one TPC in a storage and forwards contents as required. small GPU to eight or more TPCs in high- DX10 has two stages dealing with vertex performance GPUs. and primitive processing: the vertex shader and the geometry shader. The vertex shader Texture/processor cluster processes one vertex’s attributes indepen- As Figure 2 shows, each TPC contains a dently of other vertices. Typical operations geometry controller, an SM controller are position space transforms and color and (SMC), two streaming multiprocessors texture coordinate generation. The geome- (SMs), and a texture unit. Figure 3 expands try shader follows the vertex shader and each SM to show its eight SP cores. To deals with a whole primitive and its vertices. balance the expected ratio of math opera- Typical operations are edge extrusion for ...... 42 IEEE MICRO for transcendental functions and attribute interpolation—the interpolation of pixel attributes from vertex attributes defining a primitive. Each SFU also contains four floating-point multipliers. The SM uses the TPC texture unit as a third execution unit and uses the SMC and ROP units to implement external memory load, store, and atomic accesses. A low-latency inter- connect network between the SPs and the shared-memory banks provides shared- memory access. The GeForce 8800 Ultra clocks the SPs and SFU units at 1.5 GHz, for a peak of 36 Gflops per SM. To optimize power and area efficiency, some SM non-data-path units operate at half the SP .

SM multithreading. A graphics vertex or pixel shader is a program for a single thread Figure 3. Streaming multiprocessor (SM). that describes how to process a vertex or a pixel. Similarly, a CUDA kernel is a C stencil shadow generation and cube map program for a single thread that describes texture generation. Geometry shader output how one thread computes a result. Graphics primitives go to later stages for clipping, and computing applications instantiate viewport transformation, and rasterization many parallel threads to render complex into pixel fragments. images and compute large result arrays. To dynamically balance shifting vertex and Streaming multiprocessor pixel shader thread workloads, the unified The SM is a unified graphics and SM concurrently executes different thread computing multiprocessor that executes programs and different types of shader vertex, geometry, and pixel-fragment shader programs. programs and parallel computing programs. To efficiently execute hundreds of As Figure 3 shows, the SM consists of eight threads in parallel while running several streaming processor (SP) cores, two special- different programs, the SM is hardware function units (SFUs), a multithreaded multithreaded. It manages and executes up instruction fetch and issue unit (MT Issue), to 768 concurrent threads in hardware with an instruction cache, a read-only constant zero scheduling overhead. cache, and a 16-Kbyte read/write shared To support the independent vertex, memory. primitive, pixel, and thread programming The shared memory holds graphics input model of graphics languages and buffers or shared data for parallel comput- the CUDA C/C++ language, each SM ing. To pipeline graphics workloads thread has its own thread execution state through the SM, vertex, geometry, and and can execute an independent code path. pixel threads have independent input and Concurrent threads of computing programs output buffers. Workloads can arrive and can synchronize at a barrier with a single depart independently of thread execution. SM instruction. Lightweight thread crea- Geometry threads, which generate variable tion, zero-overhead thread scheduling, and amounts of output per thread, use separate fast barrier synchronization support very output buffers. fine-grained parallelism efficiently. Each SP core contains a scalar multiply- add (MAD) unit, giving the SM eight Single-instruction, multiple-thread. To man- MAD units. The SM uses its two SFU units age and execute hundreds of threads running ......

MARCH–APRIL 2008 43 ...... HOT CHIPS 19

several different programs efficiently, the Tesla SM uses a new processor architecture we call single-instruction, multiple-thread (SIMT). The SM’s SIMT multithreaded instruction unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. The term warp originates from weaving, the first parallel- thread technology. Figure 4 illustrates SIMT scheduling. The SIMT warp size of 32 parallel threads provides efficiency on plen- tiful fine-grained pixel threads and comput- ing threads. Each SM manages a pool of 24 warps, with a total of 768 threads. Individual threads composing a SIMT warp are of the same type and start together at the same program address, but they are otherwise free to branch and execute independently. At each instruction issue time, the SIMT multithreaded instruction unit selects a warp that is ready to execute and issues the next instruction to that warp’s active threads. A SIMT instruction is broadcast synchronously to a warp’s active parallel threads; individual threads can be inactive Figure 4. Single-instruction, multiple- due to independent branching or predica- thread (SIMT) warp scheduling. tion. The SM maps the warp threads to the SP cores, and each thread executes indepen- SIMT architecture is similar to single- dently with its own instruction address and instruction, multiple-data (SIMD) design, register state. A SIMT processor realizes full which applies one instruction to multiple efficiency and performance when all 32 data lanes. The difference is that SIMT threads of a warp take the same execution applies one instruction to multiple inde- path. If threads of a warp diverge via a data- pendent threads in parallel, not just multi- dependent conditional branch, the warp ple data lanes. A SIMD instruction controls serially executes each branch path taken, a vector of multiple data lanes together and disabling threads that are not on that path, exposes the vector width to the software, and when all paths complete, the threads whereas a SIMT instruction controls the reconverge to the original execution path. execution and branching behavior of one The SM uses a branch synchronization stack thread. to manage independent threads that diverge In contrast to SIMD vector architectures, and converge. Branch divergence only SIMT enables programmers to write thread- occurs within a warp; different warps level parallel code for independent threads execute independently regardless of whether as well as data-parallel code for coordinated they are executing common or disjoint code threads. For program correctness, program- paths. As a result, Tesla architecture GPUs mers can essentially ignore SIMT execution are dramatically more efficient and flexible attributes such as warps; however, they can on branching code than previous generation achieve substantial performance improve- GPUs, as their 32-thread warps are much ments by writing code that seldom requires narrower than the SIMD width of prior threads in a warp to diverge. In practice, this GPUs.1 is analogous to the role of cache lines in ...... 44 IEEE MICRO traditional codes: Programmers can safely programs are becoming longer and more ignore cache line size when designing for scalar, and it is increasingly difficult to fully correctness but must consider it in the code occupy even two components of the prior structure when designing for peak perfor- four-component vector architecture. Previ- mance. SIMD vector architectures, on the ous architectures employed vector pack- other hand, require the software to manu- ing—combining sub-vectors of work to ally coalesce loads into vectors and to gain efficiency—but that complicated the manually manage divergence. scheduling hardware as well as the compiler. Scalar instructions are simpler and compiler SIMT warp scheduling. The SIMT ap- friendly. Texture instructions remain vector proach of scheduling independent warps is based, taking a source coordinate vector and simpler than previous GPU architectures’ returning a filtered color vector. complex scheduling. A warp consists of up High-level graphics and computing-lan- to 32 threads of the same type—vertex, guage compilers generate intermediate in- geometry, pixel, or compute. The basic unit structions, such as DX10 vector or PTX of pixel-fragment shader processing is the 2 scalar instructions,10,2 which are then opti- 3 2 pixel quad. The SM controller groups mized and translated to binary GPU eight pixel quads into a warp of 32 threads. instructions. The optimizer readily expands It similarly groups vertices and primitives DX10 vector instructions to multiple Tesla into warps and packs 32 computing threads SM scalar instructions. PTX scalar instruc- into a warp. The SIMT design shares the tions optimize to Tesla SM scalar instruc- SM instruction fetch and issue unit effi- tions about one to one. PTX provides a ciently across 32 threads but requires a full stable target ISA for compilers and provides warp of active threads for full performance compatibility over several generations of efficiency. GPUs with evolving binary instruction set As a unified graphics processor, the SM architectures. Because the intermediate lan- schedules and executes multiple warp types guages use virtual registers, the optimizer concurrently—for example, concurrently analyzes data dependencies and allocates executing vertex and pixel warps. The SM real registers. It eliminates dead code, folds warp scheduler operates at half the 1.5-GHz instructions together when feasible, and processor clock rate. At each cycle, it selects optimizes SIMT branch divergence and one of the 24 warps to execute a SIMT warp convergence points. instruction, as Figure 4 shows. An issued warp instruction executes as two sets of 16 Instruction set architecture. The Tesla SM threads over four processor cycles. The SP cores and SFU units execute instructions has a register-based instruction set including independently, and by issuing instructions floating-point, integer, bit, conversion, tran- between them on alternate cycles, the scendental, flow control, memory load/store, scheduler can keep both fully occupied. and texture operations. Implementing zero-overhead warp sched- Floating-point and integer operations uling for a dynamic mix of different warp include add, multiply, multiply-add, mini- programs and program types was a chal- mum, maximum, compare, set predicate, lenging design problem. A scoreboard and conversions between integer and float- qualifies each warp for issue each cycle. ing-point numbers. Floating-point instruc- The instruction scheduler prioritizes all tions provide source operand modifiers for ready warps and selects the one with highest negation and absolute value. Transcenden- priority for issue. Prioritization considers tal function instructions include cosine, warp type, instruction type, and ‘‘fairness’’ sine, binary exponential, binary logarithm, to all warps executing in the SM. reciprocal, and reciprocal square root. Attribute interpolation instructions provide SM instructions. The Tesla SM executes efficient generation of pixel attributes. scalar instructions, unlike previous GPU Bitwise operators include shift left, shift vector instruction architectures. Shader right, logic operators, and move. Control ......

MARCH–APRIL 2008 45 ...... HOT CHIPS 19

flow includes branch, call, return, trap, and load-to-use latency for local and global barrier synchronization. memory implemented in external DRAM. The floating-point and integer instruc- The latest Tesla architecture GPUs tions can also set per-thread status flags for provide efficient atomic memory opera- zero, negative, carry, and overflow, which tions, including integer add, minimum, the thread program can use for conditional maximum, logic operators, swap, and branching. compare-and-swap operations. Atomic op- erations facilitate parallel reductions and Memory access instructions. The texture parallel data structure management. instruction fetches and filters texture sam- ples from memory via the texture unit. The Streaming processor. The SP core is the ROP unit writes pixel-fragment output to primary thread processor in the SM. It memory. performs the fundamental floating-point To support computing and C/C++ operations, including add, multiply, and language needs, the Tesla SM implements multiply-add. It also implements a wide memory load/store instructions in addition variety of integer, comparison, and conver- to graphics texture fetch and pixel output. sion operations. The floating-point add and Memory load/store instructions use integer multiply operations are compatible with the byte addressing with register-plus-offset IEEE 754 standard for single-precision FP address arithmetic to facilitate conventional numbers, including not-a-number (NaN) compiler code optimizations. and infinity values. The unit is fully For computing, the load/store instruc- pipelined, and latency is optimized to tions access three read/write memory spaces: balance delay and area. The add and multiply operations use N local memory for per-thread, private, IEEE round-to-nearest-even as the default temporary data (implemented in ex- rounding mode. The multiply-add opera- ternal DRAM); tion performs a multiplication with trunca- N shared memory for low-latency access tion, followed by an add with round-to- to data shared by cooperating threads nearest-even. The SP flushes denormal in the same SM; and source operands to sign-preserved zero and N global memory for data shared by all flushes results that underflow the target threads of a computing application output exponent range to sign-preserved (implemented in external DRAM). zero after rounding. The memory instructions load-global, store-global, load-shared, store-shared, Special-function unit. The SFU supports load-local, and store-local access global, computation of both transcendental func- 11 shared, and local memory. Computing tions and planar attribute interpolation. A programs use the fast barrier synchroniza- traditional vertex or pixel shader design tion instruction to synchronize threads contains a functional unit to compute within the SM that communicate with each transcendental functions. Pixels also need other via shared and global memory. an attribute-interpolating unit to compute To improve memory bandwidth and the per-pixel attribute values at the pixel’s x, reduce overhead, the local and global load/ y location, given the attribute values at the store instructions coalesce individual paral- primitive’s vertices. lel thread accesses from the same warp into For functional evaluation, we use qua- fewer memory block accesses. The addresses dratic interpolation based on enhanced must fall in the same block and meet minimax approximations to approximate alignment criteria. Coalescing memory the reciprocal, reciprocal square root, log2x, x requests boosts performance significantly 2 , and sin/cos functions. Table 1 shows the over separate requests. The large thread accuracy of the function estimates. The SFU count, together with support for many unit generates one 32-bit floating point outstanding load requests, helps cover result per cycle...... 46 IEEE MICRO Table 1. Function approximation statistics.

Input Accuracy (good % exactly Function interval ) ULP* error rounded Monotonic

1/x [1, 2) 24.02 0.98 87 Yes 1/sqrt(x) [1, 4) 23.40 1.52 78 Yes 2x [0, 1) 22.51 1.41 74 Yes

log2x [1, 2) 22.57 N/A** N/A Yes sin/cos...... [0, p/2) 22.47 N/A N/A...... No * ULP: unit-in-the-last-place. ** N/A: not applicable.

The SFU also supports attribute interpo- neously: vertex, geometry, and pixel. It lation, to enable accurate interpolation of packs each of these input types into the attributes such as color, depth, and texture warp width, initiating shader processing, coordinates. The SFU must interpolate and unpacks the results. these attributes in the (x, y) screen space Each input type has independent I/O to determine the values of the attributes at paths, but the SMC is responsible for load each pixel location. We express the value of balancing among them. The SMC supports a given attribute U in an (x, y) plane in static and dynamic load balancing based on plane equations of the following form: driver-recommended allocations, current allocations, and relative difficulty of addi- UxðÞ, y ~ tional resource allocation. Load balancing of the workloads was one of the more ðÞAU | x z BU | y z CU = challenging design problems due to its ðÞAW | x z BW | y z CW impact on overall SPA efficiency. where A, B,andC are interpolation Texture unit parameters associated with each attribute The texture unit processes one group of U,andW is related to the distance of the four threads (vertex, geometry, pixel, or pixel from the viewer for perspective compute) per cycle. Texture instruction projection. The attribute interpolation sources are texture coordinates, and the hardware in the SFU is fully pipelined, outputs are filtered samples, typically a and it can interpolate four samples per four-component (RGBA) color. Texture is cycle. a separate unit external to the SM connect- In a shader program, the SFU can ed via the SMC. The issuing SM thread can generate perspective-corrected attributes as continue execution until a data dependency follows: stall. Each texture unit has four texture address N Interpolate 1/W, and invert to form generators and eight filter units, for a peak W. GeForce 8800 Ultra rate of 38.4 gigabi- N Interpolate U/W. lerps/s (a bilerp is a bilinear interpolation of N Multiply U/W by W to form perspec- four samples). Each unit supports full-speed tive-correct U. 2:1 anisotropic filtering, as well as high- dynamic-range (HDR) 16-bit and 32-bit floating-point data format filtering. SM controller. The SMC controls multiple The texture unit is deeply pipelined. SMs, arbitrating the shared texture unit, Although it contains a cache to capture load/store path, and I/O path. The SMC filtering locality, it streams hits mixed with serves three graphics workloads simulta- misses without stalling......

MARCH–APRIL 2008 47 ...... HOT CHIPS 19

Rasterization ROPs handle depth and stencil testing and Geometry primitives output from the updates and color blending and updates. SMs go in their original round-robin input The uses lossless color order to the viewport/clip/setup/raster/zcull (up to 8:1) and depth compression (up to block. The viewport and clip units clip the 8:1) to reduce bandwidth. Each ROP has a primitives to the standard view frustum and peak rate of four pixels per clock and to any enabled user clip planes. They supports 16-bit floating-point and 32-bit transform postclipping vertices into screen floating-point HDR formats. ROPs support (pixel) space and reject whole primitives double-rate-depth processing when color outside the view volume as well as back- writes are disabled. facing primitives. Each memory partition is 64 bits wide Surviving primitives then go to the setup and supports double-data-rate DDR2 and unit, which generates edge equations for the graphics-oriented GDDR3 protocols at up rasterizer. Attribute plane equations are also to 1 GHz, yielding a bandwidth of about generated for linear interpolation of pixel 16 Gbytes/s. attributes in the pixel shader. A coarse- Antialiasing support includes up to 163 rasterization stage generates all pixel tiles multisampling and . HDR that are at least partially inside the primi- formats are fully supported. Both algo- tive. rithms support 1, 2, 4, 8, or 16 samples per The zcull unit maintains a hierarchical z pixel and generate a weighted average of the surface, rejecting pixel tiles if they are samples to produce the final pixel color. conservatively known to be occluded by Multisampling executes the pixel shader previously drawn pixels. The rejection rate once to generate a color shared by all pixel is up to 256 pixels per clock. The screen is samples, whereas supersampling runs the subdivided into tiles; each TPC processes a pixel shader once per sample. In both cases, predetermined subset. The pixel tile address depth values are correctly evaluated for each therefore selects the destination TPC. Pixel sample, as required for correct interpene- tiles that survive zcull then go to a fine- tration of primitives. rasterization stage that generates detailed Because multisampling runs the pixel coverage information and depth values for shader once per pixel (rather than once the pixels. per sample), multisampling has become the OpenGL and Direct3D require that a most popular antialiasing method. Beyond depth test be performed after the pixel four samples, however, storage cost increases shader has generated final color and depth faster than image quality improves, espe- values. When possible, for certain combi- cially with HDR formats. For example, a nations of API state, the Tesla GPU single 1,600 3 1,200 pixel surface, storing performs the depth test and update ahead 16 four-component, 16-bit floating-point of the fragment shader, possibly saving samples, requires 1,600 3 1,200 3 16 3 thousands of cycles of processing time, (64 bits color + 32 bits depth) 5 368 without violating the API-mandated seman- Mbytes. tics. For the vast majority of edge pixels, two The SMC assembles surviving pixels into colors are enough; what matters is more- warps to be processed by a SM running the detailed coverage information. The cover- current pixel shader. When the pixel shader age-sampling antialiasing (CSAA) algorithm has finished, the pixels are optionally depth provides low-cost-per-coverage samples, al- tested if this was not done ahead of the lowing upward scaling. By computing and shader. The SMC then sends surviving storing Boolean coverage at up to 16 pixels and associated data to the ROP. samples and compressing redundant color and depth and stencil information into the Raster operations processor memory footprint and bandwidth of four or Each ROP is paired with a specific eight samples, 163 antialiasing quality can memory partition. The TPCs feed data to be achieved at 43 antialiasing performance. the ROPs via an interconnection network. CSAA is compatible with existing rendering ...... 48 IEEE MICRO Table 2. Comparison of antialiasing modes.

Antialiasing mode

Feature Brute-force supersampling Multisampling Coverage sampling

Quality level 13 43 163 13 43 163 13 43 163 Texture and shader samples 1 4 16 1 1 1 1 1 1 Stored color and z samples 1 4 16 1 4 16 1 4 4 Coverage samples 1 4 16 1 4 16 1 4 16 techniques including HDR and stencil performs virtual to physical translation. algorithms. Edges defined by the intersec- Hardware reads the page tables from local tion of interpenetrating polygons are ren- memory to respond to misses on behalf of a dered at the stored sample count quality hierarchy of translation look-aside buffers (43 or 83). Table 2 summarizes the spread out among the rendering engines. storage requirements of the three algo- rithms. Parallel computing architecture The Tesla scalable parallel computing Memory and interconnect architecture enables the GPU processor The DRAM memory data bus width is array to excel in throughput computing, 384 pins, arranged in six independent executing high-performance computing ap- partitions of 64 pins each. Each partition plications as well as graphics applications. owns 1/6 of the physical address space. The Throughput applications have several prop- memory partition units directly enqueue erties that distinguish them from CPU serial requests. They arbitrate among hundreds of applications: in-flight requests from the parallel stages of the graphics and computation pipelines. N extensive data parallelism—thousands The arbitration seeks to maximize total of computations on independent data DRAM transfer efficiency, which favors elements; grouping related requests by DRAM bank N modest task parallelism—groups of and read/write direction, while minimizing threads execute the same program, latency as far as possible. The memory and different groups can run different controllers support a wide range of DRAM programs; clock rates, protocols, device densities, and N intensive floating-point arithmetic; data bus widths. N latency tolerance—performance is the amount of work completed in a given Interconnection network. A single hub unit time; routes requests to the appropriate partition N streaming data flow—requires high from the nonparallel requesters (PCI-Ex- memory bandwidth with relatively press, host and command front end, input little data reuse; assembler, and display). Each memory N modest inter-thread synchronization partition has its own depth and color and communication—graphics ROP units, so ROP memory traffic origi- threads do not communicate, and nates locally. Texture and load/store re- parallel computing applications re- quests, however, can occur between any quire limited synchronization and TPC and any memory partition, so an communication. interconnection network routes requests and responses. GPU parallel performance on through- put problems has doubled every 12 to Memory management unit. All processing 18 months, pulled by the insatiable de- engines generate addresses in a virtual mands of the 3D game market. Now, Tesla address space. A memory management unit GPUs in laptops, desktops, workstations, ......

MARCH–APRIL 2008 49 ...... HOT CHIPS 19

The two-level parallel decomposition maps naturally to the Tesla architecture: Parallel SMs compute result blocks, and parallel threads compute result elements. The programmer or compiler writes a program that computes a sequence of result grids, partitioning each result grid into coarse-grained result blocks that are com- puted independently in parallel. The pro- gram computes each result block with an array of fine-grained parallel threads, parti- tioning the work among threads that compute result elements. Cooperative thread array or thread block Unlike the graphics programming model, which executes parallel shader threads independently, parallel-computing pro- gramming models require that parallel threads synchronize, communicate, share data, and cooperate to efficiently compute a result. To manage large numbers of con- current threads that can cooperate, the Tesla computing architecture introduces the co- operative thread array (CTA), called a thread block in CUDA terminology. A CTA is an array of concurrent threads that execute the same thread program and Figure 5. Decomposing result data into a grid of blocks partitioned into can cooperate to compute a result. A CTA elements to be computed in parallel. consists of 1 to 512 concurrent threads, and each thread has a unique thread ID (TID), numbered 0 through m. The programmer declares the 1D, 2D, or 3D CTA shape and and systems are programmable in C with dimensions in threads. The TID has one, CUDA tools, using a simple parallel two, or three dimension indices. Threads of programming model. a CTA can share data in global or shared memory and can synchronize with the Data-parallel problem decomposition barrier instruction. CTA thread programs To map a large computing problem use their TIDs to select work and index effectively to a highly parallel processing shared data arrays. Multidimensional TIDs architecture, the programmer or compiler can eliminate integer divide and remainder decomposes the problem into many small operations when indexing arrays. problems that can be solved in parallel. For Each SM executes up to eight CTAs example, the programmer partitions a large concurrently, depending on CTA resource result data array into blocks and further demands. The programmer or compiler partitions each block into elements, so that declares the number of threads, registers, the result blocks can be computed indepen- shared memory, and barriers required by dently in parallel, and the elements within the CTA program. When an SM has each block can be computed cooperatively sufficient available resources, the SMC in parallel. Figure 5 shows the decomposi- creates the CTA and assigns TID numbers tion of a result data array into a 3 3 2 grid to each thread. The SM executes the CTA of blocks, in which each block is further threads concurrently as SIMT warps of 32 decomposed into a 5 3 3 array of elements. parallel threads...... 50 IEEE MICRO Figure 6. Nested granularity levels: thread (a), cooperative thread array (b), and grid (c). These have corresponding memory-sharing levels: local per-thread, shared per-CTA, and global per-application.

CTA grids Parallel granularity To implement the coarse-grained block Figure 6 shows levels of parallel granu- and grid decomposition of Figure 5, the larity in the GPU computing model. The GPU creates CTAs with unique CTA ID three levels are and grid ID numbers. The compute work distributor dynamically balances the GPU N thread—computes result elements se- workload by distributing a stream of CTA lected by its TID; work to SMs with sufficient available N CTA—computes result blocks selected resources. by its CTA ID; To enable a compiled binary program to N grid—computes many result blocks, run unchanged on large or small GPUs with and sequential grids compute sequen- any number of parallel SM processors, tially dependent application steps. CTAs execute independently and compute result blocks independently of other CTAs Higher levels of parallelism use multiple in the same grid. Sequentially dependent GPUs per CPU and clusters of multi-GPU application steps map to two sequentially nodes. dependent grids. The dependent grid waits for the first grid to complete; then the CTAs Parallel memory sharing of the dependent grid read the result blocks Figure 6 also shows levels of parallel written by the first grid. read/write memory sharing: ......

MARCH–APRIL 2008 51 ...... HOT CHIPS 19

N local—each executing thread has a tially on one core, or partially in parallel on private per-thread local memory for a few cores. register spill, stack frame, and address- able temporary variables; CUDA programming model N shared—each executing CTA has a CUDA is a minimal extension of the C per-CTA shared memory for access to and C++ programming languages. A pro- data shared by threads in the same grammer writes a serial program that calls CTA; parallel kernels, which can be simple N global—sequential grids communicate functions or full programs. The CUDA and share large data sets in global program executes serial code on the CPU memory. and executes parallel kernels across a set of parallel threads on the GPU. The program- Threads communicating in a CTA use mer organizes these threads into a hierarchy the fast barrier synchronization instruction of thread blocks and grids as described to wait for writes to shared or global earlier. (A CUDA thread block is a GPU memory to complete before reading data CTA.) written by other threads in the CTA. The Figure 7 shows a CUDA program exe- load/store memory system uses a relaxed cuting a series of parallel kernels on a memory order that preserves the order of heterogeneous CPU–GPU system. Ker- reads and writes to the same address from nelA and KernelB execute on the GPU the same issuing thread and from the as grids of nBlkA and nBlkB thread viewpoint of CTA threads coordinating blocks (CTAs), which instantiate nTidA with the barrier synchronization instruction. and nTidB threads per CTA. Sequentially dependent grids use a global The CUDA compiler nvcc compiles an intergrid synchronization barrier between integrated application C/C++ program grids to ensure global read/write ordering. containing serial CPU code and parallel GPU kernel code. The CUDA runtime API Transparent scaling of GPU computing manages the GPU as a computing device Parallelism varies widely over the range of that acts as a coprocessor to the host CPU GPU products developed for various market with its own memory system. segments. A small GPU might have one SM The CUDA programming model is with eight SP cores, while a large GPU similar in style to a single-program multi- might have many SMs totaling hundreds of ple-data (SPMD) software model—it ex- SP cores. presses parallelism explicitly, and each The GPU computing architecture trans- kernel executes on a fixed number of parently scales parallel application perfor- threads. However, CUDA is more flexible mance with the number of SMs and SP than most SPMD implementations because cores. A GPU computing program executes each kernel call dynamically creates a new on any size of GPU without recompiling, grid with the right number of thread blocks and is insensitive to the number of SM and threads for that application step. multiprocessors and SP cores. The program CUDA extends C/C++ with the declara- does not know or care how many processors tion specifier keywords __global__ for it uses. kernel entry functions, __device__ for The key is decomposing the problem into global variables, and __shared__ for independently computed blocks as de- shared-memory variables. A CUDA kernel’s scribed earlier. The GPU compute work text is simply a C function for one distribution unit generates a stream of sequential thread. The built-in variables CTAs and distributes them to available threadIdx.{x, y, z} and block SMs to compute each independent block. Idx.{x, y, z} provide the thread ID Scalable programs do not communicate within a thread block (CTA), while block among CTA blocks of the same grid; the Idx provides the CTA ID within a grid. same grid result is obtained if the CTAs The extended function call syntax ker- execute in parallel on many cores, sequen- nel,,,nBlocks,nThreads...(args); ...... 52 IEEE MICRO Figure 7. CUDA program sequence of kernel A followed by kernel B on a heterogeneous CPU–GPU system. invokes a parallel kernel function on a grid It uses parallel threads to compute the same of nBlocks, where each block instanti- array indices in parallel, and each thread ates nThreads concurrent threads, and computes only one sum. args are ordinary arguments to function kernel(). Scalability and performance Figure 8 shows an example serial C pro- The Tesla unified architecture is designed gram and a corresponding CUDA C program. for scalability. Varying the number of SMs, The serial C program uses two nested loops to TPCs, ROPs, caches, and memory parti- iterate over each array index and compute tions provides the right mix for different c[idx] 5 a[idx] + b[idx] each trip. performance and cost targets in the value, The parallel CUDA C program has no loops. mainstream, enthusiast, and professional

Figure 8. Serial C (a) and CUDA C (b) examples of programs that add arrays.

......

MARCH–APRIL 2008 53 ...... HOT CHIPS 19

N 384-pin DRAM interface; N 1.08-GHz DRAM clock; N 104-Gbyte/s peak bandwidth; and N typical power of 150 W at 1.3 V.

he Tesla architecture is the first Tubiquitous supercomputing platform. NVIDIA has shipped more than 50 million Tesla-based systems. This wide availability, coupled with C programmability and the CUDA software development environment, enables broad deployment of demanding parallel-computing and graphics applications. With future increases in transistor density, the architecture will readily scale processor parallelism, memory partitions, and overall performance. Increased number of multipro- cessors and memory partitions will support larger data sets and richer graphics and computing, without a change to the pro- gramming model. We continue to investigate improved sched- Figure 9. GeForce 8800 Ultra die layout. uling and load-balancing algorithms for the unified processor. Other areas of improvement market segments. NVIDIA’s Scalable Link are enhanced scalability for derivative products, Interconnect (SLI) enables multiple GPUs reduced synchronization and communication to act together as one, providing further overhead for compute programs, new graphics scalability. features, increased realized memory band- CUDA C/C++ applications executing on width, and improved power efficiency. MICRO Tesla computing platforms, Quadro work- stations, and GeForce GPUs deliver com- Acknowledgments pelling computing performance on a range We thank the entire NVIDIA GPU deve- of large problems, including more than lopment team for their extraordinary effort 1003 speedups on molecular modeling, in bringing Tesla-based GPUs to market. more than 200 Gflops on n-body problems, ...... and real-time 3D magnetic-resonance im- References 12–14 aging. For graphics, the GeForce 8800 1. J. Montrym and H. Moreton, ‘‘The GeForce GPU delivers high performance and image 6800,’’ IEEE Micro, vol. 25, no. 2, Mar./ 15 quality for the most demanding games. Apr. 2005, pp. 41-51. Figure 9 shows the GeForce 8800 Ultra 2. CUDA Technology, NVIDIA, 2007, http:// physical die layout implementing the Tesla www.nvidia.com/CUDA. architecture shown in Figure 1. Implemen- 3. CUDA Programming Guide 1.1, NVIDIA, tation specifics include 2007; http://developer.download.nvidia. com/compute//1_1/NVIDIA_CUDA_ N 681 million transistors, 470 mm2; N Programming_Guide_1.1.pdf. TSMC 90-nm CMOS; 4. J. Nickolls, I. Buck, K. Skadron, and M. N 128 SP cores in 16 SMs; Garland, ‘‘Scalable Parallel Programming N 12,288 processor threads; with CUDA,’’ ACM Queue, vol. 6, no. 2, N 1.5-GHz processor clock rate; Mar./Apr. 2008, pp. 40-53. N peak 576 Gflops in processors; 5. DX Specification, Microsoft; http://msdn. N 768-Mbyte GDDR3 DRAM; microsoft.com/directx...... 54 IEEE MICRO 6. E. Lindholm, M.J. Kilgard, and H. Moreton, group. His research interests include graph- ‘‘A User-Programmable Vertex Engine,’’ ics processor design and parallel graphics Proc. 28th Ann. Conf. architectures. Lindholm has an MS in and Interactive Techniques (Siggraph 01), electrical engineering from the University ACM Press, 2001, pp. 149-158. of British Columbia. 7. G. Elder, ‘‘Radeon 9700,’’ Eurographics/ Siggraph Workshop , John Nickolls is director of GPU comput- Hot 3D Session, 2002, http://www. ing architecture at NVIDIA. His interests graphicshardware.org/previous/www_2002/ include parallel processing systems, languag- presentations/Hot3D-RADEON9700.ppt. es, and architectures. Nickolls has a BS in 8. Microsoft DirectX 9 Programmable Graph- electrical engineering and computer science ics Pipeline, Microsoft Press, 2003. from the University of Illinois and MS and 9. J. Andrews and N. Baker, ‘‘Xbox 360 PhD degrees in electrical engineering from System Architecture,’’ IEEE Micro, Stanford University. vol. 26, no. 2, Mar./Apr. 2006, pp. 25-37. 10. D. Blythe, ‘‘The Direct3D 10 System,’’ Stuart Oberman is a design manager in the ACM Trans. Graphics, vol. 25, no. 3, July GPU hardware group at NVIDIA. His 2006, pp. 724-734. research interests include computer arith- 11. S.F. Oberman and M.Y. Siu, ‘‘A High- metic, processor design, and parallel archi- Performance Area-Efficient Multifunction tectures. Oberman has a BS in electrical Interpolator,’’ Proc. 17th IEEE Symp. Com- engineering from the University of Iowa puter Arithmetic (Arith-17), IEEE Press, and MS and PhD degrees in electrical 2005, pp. 272-279. engineering from Stanford University. He is 12. J.E. Stone et al., ‘‘Accelerating Molecular a senior member of the IEEE. Modeling Applications with Graphics Pro- cessors,’’ J. Computational Chemistry, John Montrym is a chief architect at vol. 28, no. 16, 2007, pp. 2618-2640. NVIDIA, where he has worked in the 13. L. Nyland, M. Harris, and J. Prins, ‘‘Fast N- development of several GPU product fam- Body Simulation with CUDA,’’ GPU Gems ilies. His research interests include graphics 3, H. Nguyen, ed., Addison-Wesley, 2007, processor design, parallel graphics architec- pp. 677-695. tures, and hardware-software interfaces. 14. S.S. Stone et al., ‘‘How GPUs Can Improve Montrym has a BS in electrical engineering the Quality of Magnetic Resonance Imag- from the Massachusetts Institute of Tech- ing,’’ Proc. 1st Workshop on General nology. Purpose Processing on Graphics Process- ing Units, 2007; http://www.gigascale.org/ Direct questions and comments about pubs/1175.html. this article to Erik Lindholm or John 15. A.L. Shimpi and D. Wilson, ‘‘NVIDIA’s Nickolls, NVIDIA, 2701 San Tomas GeForce 8800 (G80): GPUs Re-architected Expressway, Santa Clara, CA 95050; for DirectX 10,’’ AnandTech,Nov.2006; [email protected] or jnickolls@nvidia. http://www.anandtech.com/video/showdoc. com. aspx?i52870. For more information on this or any other Erik Lindholm is a distinguished engineer computing topic, please visit our Digital at NVIDIA, working in the architecture Library at http://computer.org/csdl.

......

MARCH–APRIL 2008 55