10× Faster Transparency from Low Level Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent

460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons

460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS 10× Faster Transparency (Software)

● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering Transparency

Objects, glass, visualization Antialiasing Particles Shadow Maps

Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps. Transparency

● Transparency uses alpha blending ● Weighted average ● Based on surface order Not Sorted Sorted Sorting for Transparency

Sort triangles ● Geometry dependent Sort fragments (potential pixel ) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ... Order Independent Transparency (OIT)

● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit 1. Deep Image

● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures

Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015. 2. Sort and composite

Full screen pass: vec2 frags[MAX_FRAGS];

1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort

Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } } OpenGL+GLSL vs CUDA

OpenGL CUDA

Same hardware

Graphics Compute

GLSL Kernels

Specification, not implementation Implementation well documented

Improving Nsight support Good Nsight support

● CUDA gives insight into GLSL execution ● Some significant architectural differences... GPU Architecture

GPU Slow Global Memory

L2

SM/SMX SM/SMXSM/SMX

Faster L1 Cache / Shared / “Local”

Fastest Registers

SP SP SP SP … OpenGL vs CUDA - An Interesting Example

#define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour;

void main() { fragColour = myArray[zero]; }

● Why would allocating more memory make a shader slower? OpenGL vs CUDA - An Interesting Example

● In GLSL local memory is reserved GPU ● The more required Global Memory ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMXSM/SMX

L1 Cache / Shared / “Local” Thread Thread Thread Registers

SP SP SP SP … Thread Sorting in OIT

● Local memory is fixed ● Use conservative maximum ● Want dynamic size

#define MAX_FRAGS set_by_application

vec2 frags[MAX_FRAGS]; //conservative max

void main() { int count = loadFragments(gl_FragCoord.xy);

sortFragments(count); //insertion sort

... Backwards Memory Allocation

Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013. Register-Based Block Sort

● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge

Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014. Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; ● glGetProgramBinary TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; ● Poor man’s --keep (CUDA) MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; TEMP R0, R1; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; ... MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; TEMP lmem[8]; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ########################################## Results - Milliseconds per frame

Scene Atrium Hairball (front / back) Power Plant

Baseline 7 170 / 652 374

BMA+RBS 6 195 / 212 30

Opaque (no OIT) 1 5 / 3 9

● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080 GPU Progression

Power plant scene (milliseconds per frame)

● Speedup improves with each new GPU

GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015)

Baseline 1004 670 476 374

BMA+RBS 258 94 56 30

Speedup 3.9 7.1 8.5 12.3 Conclusion

● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes Questions?

[email protected]