Nvidia Tesla:Aunified Graphics and Computing Architecture
Total Page:16
File Type:pdf, Size:1020Kb
he modern 3D graphics process- In this article, we discuss the require- ing unit (GPU) has evolved from a fixed- ments that drove the unified graphics and function graphics pipeline to a programma- parallel computing processor architecture, ble parallel processor with computing power describe the Tesla architecture, and how it is exceeding that of multicore CPUs. Tradi- enabling widespread deployment of parallel tional graphics pipelines consist of separate computing and graphics applications. programmable stages of vertex processors executing vertex shader programs and pixel The road to unification Erik Lindholm fragment processors executing pixel shader The first GPU was the GeForce 256, programs. (Montrym and Moreton provide introduced in 1999. It contained a fixed- John Nickolls additional background on the traditional function 32-bit floating-point vertex trans- graphics processor architecture.1) form and lighting processor and a fixed- Stuart Oberman NVIDIA’s Tesla architecture, introduced function integer pixel-fragment pipeline, in November 2006 in the GeForce 8800 which were programmed with OpenGL John Montrym GPU, unifies the vertex and pixel processors and the Microsoft DX7 API.5 In 2001, and extends them, enabling high-perfor- the GeForce 3 introduced the first pro- NVIDIA mance parallel computing applications writ- grammable vertex processor executing vertex ten in the C language using the Compute shaders, along with a configurable 32-bit Unified Device Architecture (CUDA2–4) floating-point fragment pipeline, pro- parallel programming model and develop- grammed with DX85 and OpenGL.6 The ment tools. The Tesla unified graphics and Radeon 9700, introduced in 2002, featured computing architecture is available in a a programmable 24-bit floating-point pixel- scalable family of GeForce 8-series GPUs fragment processor programmed with DX9 and Quadro GPUs for laptops, desktops, and OpenGL.7,8 The GeForce FX added 32- workstations, and servers. It also provides bit floating-point pixel-fragment processors. the processing architecture for the Tesla The XBox 360 introduced an early unified GPU computing platforms introduced in GPU in 2005, allowing vertices and pixels 2007 for high-performance computing. to execute on the same processor.9 ........................................................................ 0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Computer Society. 39 ......................................................................................................................................................................................................................... HOT CHIPS 19 Vertex processors operate on the vertices texture units. The generality required of a of primitives such as points, lines, and unified processor opened the door to a triangles. Typical operations include trans- completely new GPU parallel-computing forming coordinates into screen space, capability. The downside of this generality which are then fed to the setup unit and was the difficulty of efficient load balancing the rasterizer, and setting up lighting and between different shader types. texture parameters to be used by the pixel- Other critical hardware design require- fragment processors. Pixel-fragment proces- ments were architectural scalability, perfor- sors operate on rasterizer output, which fills mance, power, and area efficiency. the interior of primitives, along with the The Tesla architects developed the interpolated parameters. graphics feature set in coordination with Vertex and pixel-fragment processors the development of the Microsoft Direct3D have evolved at different rates: Vertex DirectX 10 graphics API.10 They developed processors were designed for low-latency, the GPU’s computing feature set in coor- high-precision math operations, whereas dination with the development of the pixel-fragment processors were optimized CUDA C parallel programming language, for high-latency, lower-precision texture compiler, and development tools. filtering. Vertex processors have tradition- ally supported more-complex processing, so Tesla architecture they became programmable first. For the The Tesla architecture is based on a last six years, the two processor types scalable processor array. Figure 1 shows a have been functionally converging as the block diagram of a GeForce 8800 GPU result of a need for greater programming with 128 streaming-processor (SP) cores generality. However, the increased general- organized as 16 streaming multiprocessors ity also increased the design complexity, (SMs) in eight independent processing units area, and cost of developing two separate called texture/processor clusters (TPCs). processors. Work flows from top to bottom, starting Because GPUs typically must process at the host interface with the system PCI- more pixels than vertices, pixel-fragment Express bus. Because of its unified-processor processors traditionally outnumber vertex design, the physical Tesla architecture processors by about three to one. However, doesn’t resemble the logical order of typical workloads are not well balanced, graphics pipeline stages. However, we will leading to inefficiency. For example, use the logical graphics pipeline flow to with large triangles, the vertex processors explain the architecture. are mostly idle, while the pixel processors At the highest level, the GPU’s scalable are fully busy. With small triangles, streaming processor array (SPA) performs the opposite is true. The addition of all the GPU’s programmable calculations. more-complex primitive processing in The scalable memory system consists of DX10 makes it much harder to select a external DRAM control and fixed-function fixed processor ratio.10 All these factors raster operation processors (ROPs) that influenced the decision to design a unified perform color and depth frame buffer architecture. operations directly on memory. An inter- A primary design objective for Tesla was connection network carries computed to execute vertex and pixel-fragment shader pixel-fragment colors and depth values from programs on the same unified processor the SPA to the ROPs. The network also architecture. Unification would enable dy- routes texture memory read requests from namic load balancing of varying vertex- and the SPA to DRAM and read data from pixel-processing workloads and permit the DRAM through a level-2 cache back to the introduction of new graphics shader stages, SPA. such as geometry shaders in DX10. It also The remaining blocks in Figure 1 deliver let a single team focus on designing a fast input work to the SPA. The input assembler and efficient processor and allowed the collects vertex work as directed by the input sharing of expensive hardware such as the command stream. The vertex work distri- ....................................................................... 40 IEEE MICRO Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor. bution block distributes vertex work packets Command processing to the various TPCs in the SPA. The TPCs The GPU host interface unit communi- execute vertex shader programs, and (if cates with the host CPU, responds to enabled) geometry shader programs. The commands from the CPU, fetches data from resulting output data is written to on-chip system memory, checks command consisten- buffers. These buffers then pass their results cy, and performs context switching. to the viewport/clip/setup/raster/zcull block The input assembler collects geometric to be rasterized into pixel fragments. The primitives (points, lines, triangles, line pixel work distribution unit distributes pixel strips, and triangle strips) and fetches fragments to the appropriate TPCs for associated vertex input attribute data. It pixel-fragment processing. Shaded pixel- has peak rates of one primitive per clock fragments are sent across the interconnec- and eight scalar attributes per clock at the tion network for processing by depth and GPU core clock, which is typically color ROP units. The compute work 600 MHz. distribution block dispatches compute The work distribution units forward the thread arrays to the TPCs. The SPA accepts input assembler’s output stream to the array and processes work for multiple logical of processors, which execute vertex, geom- streams simultaneously. Multiple clock etry, and pixel shader programs, as well as domains for GPU units, processors, computing programs. The vertex and com- DRAM, and other units allow independent pute work distribution units deliver work to power and performance optimizations. processors in a round-robin