10× Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent
460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons
460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS 10× Faster Transparency (Software)
● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering Transparency
Objects, glass, visualization Antialiasing Particles Shadow Maps
Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps. Transparency
● Transparency uses alpha blending ● Weighted average ● Based on surface order Not Sorted Sorted Sorting for Transparency
Sort triangles ● Geometry dependent Sort fragments (potential pixel colors) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ... Order Independent Transparency (OIT)
● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit 1. Deep Image
● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures
Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015. 2. Sort and composite
Full screen pass: vec2 frags[MAX_FRAGS];
1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort
Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } } OpenGL+GLSL vs CUDA
OpenGL CUDA
Same hardware
Graphics Compute
GLSL Shaders Kernels
Specification, not implementation Implementation well documented
Improving Nsight support Good Nsight support
● CUDA gives insight into GLSL execution ● Some significant architectural differences... GPU Architecture
GPU Slow Global Memory
L2 Cache
SM/SMX SM/SMXSM/SMX
Faster L1 Cache / Shared / “Local”
Fastest Registers
SP SP SP SP … OpenGL vs CUDA - An Interesting Example
#define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour;
void main() { fragColour = myArray[zero]; }
● Why would allocating more memory make a shader slower? OpenGL vs CUDA - An Interesting Example
● In GLSL local memory is reserved GPU ● The more required Global Memory ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMXSM/SMX
L1 Cache / Shared / “Local” Thread Thread Thread Registers
SP SP SP SP … Thread Sorting in OIT
● Local memory is fixed ● Use conservative maximum ● Want dynamic size
#define MAX_FRAGS set_by_application
vec2 frags[MAX_FRAGS]; //conservative max
void main() { int count = loadFragments(gl_FragCoord.xy);
sortFragments(count); //insertion sort
... Backwards Memory Allocation
Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013. Register-Based Block Sort
● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge
Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014. Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; ● glGetProgramBinary TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by Nvidia driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; ● Poor man’s --keep (CUDA) MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; TEMP R0, R1; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; ... MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; TEMP lmem[8]; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ########################################## Results - Milliseconds per frame
Scene Atrium Hairball (front / back) Power Plant
Baseline 7 170 / 652 374
BMA+RBS 6 195 / 212 30
Opaque (no OIT) 1 5 / 3 9
● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080 GPU Progression
Power plant scene (milliseconds per frame)
● Speedup improves with each new GPU
GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015)
Baseline 1004 670 476 374
BMA+RBS 258 94 56 30
Speedup 3.9 7.1 8.5 12.3 Conclusion
● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes Questions?