How a GPU Works Kayvon Fatahalian 15-462 (Fall 2011) Today
Total Page:16
File Type:pdf, Size:1020Kb
How a GPU Works Kayvon Fatahalian 15-462 (Fall 2011) Today 1. Review: the graphics pipeline 2. History: a few old GPUs 3. How a modern GPU works (and why it is so fast!) 4. Closer look at a real GPU design – NVIDIA GTX 285 2 Part 1: The graphics pipeline (an abstraction) 3 Vertex processing Vertices are transformed into “screen space” v2 v0 v5 v4 v3 v1 Vertices Vertex processing Vertices are transformed into “screen space” v2 v0 v5 EACH VERTEX IS TRANSFORMED v4 INDEPENDENTLY v3 v1 Vertices Primitive processing Then organized into primitives that are clipped and culled… v2 v2 v0 v0 v5 v5 v4 v4 v1 v3 v3 v1 Vertices Primitives (triangles) Rasterization Primitives are rasterized into “pixel fragments” Fragments Rasterization Primitives are rasterized into “pixel fragments” EACH PRIMITIVE IS RASTERIZED INDEPENDENTLY Fragment processing Fragments are shaded to compute a color at each pixel Shaded fragments Fragment processing Fragments are shaded to compute a color at each pixel EACH FRAGMENT IS PROCESSED INDEPENDENTLY Pixel operations Fragments are blended into the frame buffer at their pixel locations (z-buffer determines visibility) Pixels Pipeline entities v2 v2 v0 v0 v5 v5 v4 v4 v3 v1 v3 v1 Vertices Primitives Fragments Fragments (shaded) Pixels Graphics pipeline Memory Bu!ers Vertex Generation Vertex Data Bu!ers Fixed-function Vertices Vertex stream Programmable Vertex Processing Textures Vertex stream Primitive Generation Primitive stream Primitives Primitive Processing Textures Primitive stream Fragment Generation Fragment stream Fragments Fragment Processing Textures Fragment stream Pixels Pixel Operations Output image (pixels) Part 2: Graphics architectures (implementations of the graphics pipeline) 14 Independent • What’s so important about “independent” computations? 15 Silicon Graphics RealityEngine (1993) “graphics supercomputer” Vertex Generation Vertex Processing Primitive Generation Primitive Processing Fragment Generation Fragment Processing Pixel Operations Pre-1999 PC 3D graphics accelerator Vertex Generation CPU Vertex Processing Primitive Generation Primitive Processing 3dfx Voodoo Clip/cull/rasterize Fragment Generation NVIDIA RIVA TNT Tex Tex Pixel operations Fragment Processing Pixel Operations GPU* circa 1999 CPU Vertex Generation Vertex Processing GPU Primitive Generation Primitive Processing Fragment Generation NVIDIA GeForce 256 Fragment Processing Pixel Operations Direct3D 9 programmability: 2002 Vertex Generation Vtx Vtx Vtx Vtx Vertex Processing Clip/cull/rasterize Primitive Generation Tex Tex Tex Tex Frag Frag Frag Frag Primitive Processing Tex Tex Tex Tex Frag Frag Frag Frag Fragment Generation Pixel operations Fragment Processing ATI Radeon 9700 Pixel Operations Direct3D 10 programmability: 2006 Vertex Generation Vertex Processing Core Core Core Core Pixel op Pixel op Core Core Core Core Pixel op Pixel op Primitive Generation Tex Pixel op Pixel op Core Core Core Core Clip/Cull/Rast Primitive Processing Core Core Core Core Scheduler Fragment Generation NVIDIA GeForce 8800 Fragment Processing (“unified shading” GPU) Pixel Operations Part 3: How a shader core works (three key ideas) 21 GPUs are fast Intel Core i7 Quad Core AMD Radeon HD 5870 ~100 GFLOPS peak ~2.7 TFLOPS peak 730 million transistors 2.2 billion transistors (obtainable if you code your program to (obtainable if you write OpenGL programs use 4 threads and SSE vector instr) like you’ve done in this class) 22 A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** Big Guy, lookin’ di!use Compile shader 1 unshaded fragment input record sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] float4(diffuseShader(float3(norm,(float2(uv) madd(r3,(v1,(cb0[1],(r3 { madd(r3,(v2,(cb0[2],(r3 ((float3(kd; clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 ((kd(=(myTex.Sample(mySamp,(uv); mul((o1,(r1,(r3 ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); mul((o2,(r2,(r3 ((return(float4(kd,(1.0);((( mov((o3,(l(1.0) } 1 shaded fragment output record Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) “CPU-style” cores Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher Slimming down Fetch/ Decode Idea #1: ALU (Execute) Remove components that help a single instruction Execution stream run fast Context Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU ALU sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3 (Execute) (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o0,(r0,(r3 mul((o1,(r1,(r3 mul((o1,(r1,(r3 mul((o2,(r2,(r3 Execution Execution mul((o2,(r2,(r3 mov((o3,(l(1.0) Context Context mov((o3,(l(1.0) Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams Instruction stream sharing But ... many fragments should be able to share an instruction stream! <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context Add ALUs Idea #2: Fetch/ Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction ALU 5 ALU 6 ALU 7 ALU 8 stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data Modifying the shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o1,(r1,(r3 Ctx Ctx Ctx Ctx mul((o2,(r2,(r3 mov((o3,(l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0 VEC8_mul((vec_r3,(vec_v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3 VEC8_madd(vec_r3,(vec_v2,(cb0[2],(vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0) VEC8_mul((vec_o0,(vec_r0,(vec_r3 VEC8_mul((vec_o1,(vec_r1,(vec_r3 Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3 VEC8_mov((o3,(l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes eight fragments using vector ops on vector registers Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0 VEC8_mul((vec_r3,(vec_v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3 VEC8_madd(vec_r3,(vec_v2,(cb0[2],(vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0) VEC8_mul((vec_o0,(vec_r0,(vec_r3 VEC8_mul((vec_o1,(vec_r1,(vec_r3 Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3 VEC8_mov((o3,(l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 128 fragments in parallel 16 cores = 128 ALUs , 16 simultaneous instruction streams vertices/fragments primitives 128 [ OpenCL work items ] in parallel CUDA threads vertices primitives fragments But what about branches? 1 2 .