10× Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent

10× Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS 10× Faster Transparency (Software) ● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering Transparency Objects, glass, visualization Antialiasing Particles Shadow Maps Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps. Transparency ● Transparency uses alpha blending ● Weighted average ● Based on surface order Not Sorted Sorted Sorting for Transparency Sort triangles ● Geometry dependent Sort fragments (potential pixel colors) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ... Order Independent Transparency (OIT) ● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit 1. Deep Image ● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015. 2. Sort and composite Full screen pass: vec2 frags[MAX_FRAGS]; 1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } } OpenGL+GLSL vs CUDA OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support ● CUDA gives insight into GLSL execution ● Some significant architectural differences... GPU Architecture GPU Slow Global Memory L2 Cache SM/SMX SM/SMXSM/SMX Faster L1 Cache / Shared / “Local” Fastest Registers SP SP SP SP … OpenGL vs CUDA - An Interesting Example #define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour; void main() { fragColour = myArray[zero]; } ● Why would allocating more memory make a shader slower? OpenGL vs CUDA - An Interesting Example ● In GLSL local memory is reserved GPU ● The more required Global Memory ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMXSM/SMX L1 Cache / Shared / “Local” Thread Thread Thread Registers SP SP SP SP … Thread Sorting in OIT ● Local memory is fixed ● Use conservative maximum ● Want dynamic size #define MAX_FRAGS set_by_application vec2 frags[MAX_FRAGS]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ... Backwards Memory Allocation Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013. Register-Based Block Sort ● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014. Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; ● glGetProgramBinary TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by Nvidia driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; ● Poor man’s --keep (CUDA) MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; TEMP R0, R1; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; ... MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; TEMP lmem[8]; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ########################################## Results - Milliseconds per frame Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9 ● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080 GPU Progression Power plant scene (milliseconds per frame) ● Speedup improves with each new GPU GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3 Conclusion ● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes Questions? [email protected].

10× Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent

GLSL 4.50 Spec

First Person Shooting (FPS) Game

Advanced Computer Graphics to Do Motivation Real-Time Rendering

Developer Tools Showcase

Real-Time Rendering Techniques with Hardware Tessellation

NVIDIA Quadro P620

Nvidia Quadro T1000

A Qualitative Comparison Study Between Common GPGPU Frameworks

Graphics Shaders Mike Hergaarden January 2011, VU Amsterdam

Embedded Solutions Nvidia Quadro Mxm Modules

High Dynamic Range Rendering on the Geforce 6800 Simon Green / Cem Cebenoyan Overview

NVIDIA Quadro 3000M 2GB Graphics Overview