Master GPU-engineering in one hour
Andrey Volodin Senior Matrix Multiplicator Agenda
• Computer graphics history • Modern rendering • Apple side of things • What is GPGPU? • Metal Compute Shaders • Hype train
2 History
3 1977
Empire Strikes Back (Atari 2600)
The first notable system to use sprite graphics
4 1977 Atari 2600
128 bytes of RAM including call stack and the state of game world
• Typical resolution: 160x192 • 128 colors in palette • 160 * 192 * 7 bits = 26 880 bytes per frame
No framebuffer, graphics were generated in real-time.
Literally.
5 6 7 8 9 10 11 1977 Atari 2600
VCS could only display five interactive objects at any one time:
• 2 «player» sprites • 2 «missile» sprites • 1 «ball» sprite
12 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it
13 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it
14 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it
15 Blind spots are the only times programmers could do anything that didn't involve drawing graphics on the screen, such as computing joystick inputs, player movements, scoring
16 1977 Nintendo Entertainment System
1983 (Vice Project Doom)
17 1977 Nintendo Entertainment System
1983
• 8-bit colors • Still no framebuffer • PPU (Picture Processing Unit) • Tiled graphics (aka «character graphics») • Operates with tiles of 8x8 (or 8x16) pixels • 8 sprites per scanline • Another advantage is collision detection
(movable/non-movable sprites)
18 1977 Nintendo Entertainment System
1983 (Vice Project Doom)
19 20 1977 Second Generation - Shaded Solids
1983
1987
…
1991
• Very expensive, mostly used in professional simulators • Vertex lighting • Rasterization of filled polygons • Depth buffer and blending
21 1977 Second Generation - Shaded Solids
1983
1987
…
1991
22 1977 First «GPU»
1983 NVidia releases first «Graphics Processing Unit» - GeForce 256 1987
… • Made a definition of what GPU should be 1991 • Achieved 10 million polygons processed in 1 second 1999 • Vertex transform • Lighting • Barely programmable
23 1977 GeForce3 with GeForceFX first programmable GPU 1983
Introduced a concept of shaders 1987 • • Vertex and fragment operations … • Macro assembly language 1991 • Very limited
ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; 1999 MULR R0.xyz, R0.w, R0.xyzx; ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; 2001 DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x;
24 Recent trends
25 Recent trends
Time Trans MHz GFLOPS Aug02 121M 500 8 Jan03 130M 475 20 Dec03 222M 400 53
• 1.8x increase of transistors • 20% decrease in clock speed • 6.6x GFLOP speedup
26 Modern rendering process
27 • GPUs are very limited in what they can do • Can only draw primitives: triangles, lines, points • Highly optimized for floating point operations
28 3D objects are stored as a set of triangles
29 They are placed in the 3D scene which is simply a coordinate space
30 Next, there is usually a camera to provide different view angles
31 Last step in vertex stage: projection
32 Affine transform matrices
Scale Translation
Rotation (Y)
Same concepts in UIKit (CGAffineTransform) 33 Affine transform matrices
x x x
34 Typical vertex shader
return uniforms.modelViewProjectionTransform * float4(newPosition, 1.0);
35 Typical vertex shader
return uniforms.modelViewProjectionTransform * float4(newPosition, 1.0);
Calculated on CPU as a combination of all transforms
36 Next, rasterizer comes into play
37 Next, rasterizer comes into play
38 Fragment shaders
Two triangles, that cover the screen
39 Fragment shaders
fragColor = vec4(1.0, 0.0, 0.0, 1.0);
40 Fragment shaders
fragColor = vec4(length(deltaToCenter), 0.0, 0.0, 1.0);
41 Fragment shaders
sin(time) 1.0
-1.0
42 Fragment shaders
fragColor = vec4(length(deltaToCenter) * sin(time), 0.0, 0.0, 1.0);
43 Fragment shaders
140 LOC
66 LOC
800 LOC 44 Fragment shaders
Business card challenge by Paul Heckbert (1984)
45 Custom shader example
Imagine we have a sphere
46 Custom shader example
What if we will displace every vertex by a smooth noise value?
47 Custom shader example
And also offset read coordinates every frame
48 Custom shader example
Next, we will take gradient texture and will read as high as current displacement is
49 Custom shader example
50 Draw calls happen for every set of geometry with unique render state
51 It happens for every set of geometry with unique render state
52 It happens for every set of geometry with unique render state
53 What Apple has to offer? �
54 2007 OpenGL ES 1.1 (iPhone 2G)
55 56 2007 OpenGL ES 1.1 (iPhone 2G)
2010 OpenGL ES 2.0 (iPhone 4)
57 58 2007 OpenGL ES 1.1 (iPhone 2G)
2010 OpenGL ES 2.0 (iPhone 4)
2016 OpenGL ES 3.0 (iPhone 7)
59 Unity GLKit.framework libGDX
OpenGLES.framework Cocos2D-X Cocos2D-ObjC
SceneKit Unreal SpriteKit
60 Announced by Khronos in 2015
Initially, Apple was part of the working group
61 62 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinnest possible API
63 64 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API
65 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API
66 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API
67 68 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API
69 MTLDevice
70 Represents a single GPU
MTLDevice
71 MTLDevice
MTLCreateSystemDefaultDevice()
NOTE: Don’t treat this object as a singletone
72 MTLDevice
MTLCommandQueue
73 MTLDevice
Retrieved from device via:
device.makeCommandQueue()
MTLCommandQueue
74 MTLDevice
Often referred to as «Metal context»
MTLCommandQueue
75 MTLDevice
MTLCommandQueue � � �
76 MTLDevice MTLCommandQueue
� = MTLCommandBuffer
77 MTLDevice MTLCommandQueue
� = MTLCommandBuffer
Made on per-queue basis: guard let commandBuffer = commandQueue.makeCommandBuffer() else { fatalError("Could not create command buffer") }
78 Command types:
1. Render commands
2. Blit commands
3. Compute commands
79 1. Render commands MTLRenderCommandEncoder
2. Blit commands MTLBlitCommandEncoder
3. Compute commands MTLComputeCommandEncoder
80 �
81 � commandBuffer.makeRenderCommandEncoder(:)
MTLRenderCommandEncoder
82 � commandBuffer.makeRenderCommandEncoder(:)
MTLRenderCommandEncoder
83 Pipeline state
• Represents GPU state that is need to be set for the
current command
• Must be initialized with shader functions
• Each pipeline state type has its own optional parameters
• Usually being cached and reused
84 Pipeline state
// Create a reusable pipeline state for rendering geometry let stateDescriptor = MTLRenderPipelineDescriptor() stateDescriptor.vertexFunction = vertexFunc stateDescriptor.fragmentFunction = fragmentFunc
85 Pipeline state
let ps = try device.makeRenderPipelineState(descriptor: stateDescriptor)
86 �
MTLRenderCommandEncoder .setRenderPipelineState(pipeLineState)
MTLRenderPipelineState
87 �
MTLRenderCommandEncoder .setVertexBuffer(:index:)
Geometry buffer MTLRenderPipelineState
88 �
.setVertexBuffer(:index:) MTLRenderCommandEncoder .setFragmentBuffer(:index:)
Texture 0
Uniform buffer
Geometry buffer MTLRenderPipelineState
89 � +1
MTLRenderCommandEncoder .drawPrimitives(***)
90 � +1 +1
MTLRenderCommandEncoder .drawPrimitives(***)
Another geometry Another state
91 � +1 +1
MTLRenderCommandEncoder .endEncoding()
92 � +1 +1
// Send buffer to the command queue commandBuffer.commit()
// Wait until all command are executed commandBuffer.waitUntilCompleted()
// Subscribe to completion event commandBuffer.addCompletionHandler {}
93 Metal Compute Shaders �
94 1999-2001 Early GPGPU
• Hoff (1999): Voronoi diagrams on NVIDIA TNT2 • Larsen &McAllister (2001): first GPU matrix multiplication (8-bit) • Rumpf & Strzodka (2001): first GPU PDEs (diffusion, image
segmentation)
• NVIDIA SDK Game of Life, Shallow Water (Greg James, 2001)
95 1999-2001 Early GPGPU
• PHD in computer graphics to do this • Financial companies hired game developers
96 1999-2001 GPGPU.org
2002
General Purposed GPU
97 1999-2001
2002
98 1999-2001 R G B A R G B A
2002
99 1999-2001 R G B A R G B A
0.17 0.21 0.1 0.2 0.1 0.21 0.2 0.0 2002
100 1999-2001 CUDA
2002
• First GPU arch. and software platform designed for computing • First C/C++ language and compiler for GPUs • 2007 began a massive surge in GPGPU development
2007
101 1999-2001 CUDA
2002 Input registers Thread ID
Fragment Thread program program
Output registers Output registers 2007
102 Metal Compute Shaders
• Act just like fragment or vertex shader, but general purposed • Programmed with keyword kernel • Suitable for highly parallel tasks • Can be put in the same command buffer with render/blit commands
103 Task: multiply every element in float buffer by a certain value
Purely parallel thing - suitable for compute shaders
104 Code Time! �
105 1. Declare class ArrayProcessor
2. Use MTLDevice or MTLCommandQueue as a dependency injection
3. Cache static elements in init(:)
106 public class ArrayProcessor {
public let commandQueue: MTLCommandQueue public let device: MTLDevice public let bufferMultiplierPipelineState: MTLComputePipelineState
public init(commandQueue: MTLCommandQueue) { … }
… }
107 Next, implement encoding GPU work on CPU side
4. Prepare type-container for kernel’s parameters
fileprivate struct Uniforms { public let multiplier: Float public let count: UInt32 }
NOTE: Be careful with Swift’s memory layout, use C/C++ delcarations to avoid tricky bugs
108 5. Encode compute kernel command into command queue
public class ArrayProcessor {
MTLDevice MTLCommandQueue MTLComputePipelineState
… public func process(array: [Float], multiplier: Float) { … }
… }
109 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
MTLDevice
MTLCommandQueue
MTLComputePipelineState
} } 110 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
MTLDevice
MTLCommandQueue
MTLComputePipelineState
} } 111 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
� MTLDevice
MTLCommandQueue
MTLComputePipelineState
} } 112 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
� MTLDevice
MTLCommandQueue
MTLComputeCommandEncoder MTLComputePipelineState
} } 113 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
� MTLDevice
MTLCommandQueue
MTLComputeCommandEncoder MTLComputePipelineState
} } 114 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
� MTLDevice
MTLCommandQueue
MTLComputeCommandEncoder MTLComputePipelineState
} } 115 Threads and threadgroups
• Metal executes your kernel function over 1D, 2D or 3D grid • Each point in the grid represents a single instance of your kernel function • That is called thread • Threads are organized together into threadgroups that can share common block of memory
116 Threads and threadgroups
117 Threads and threadgroups
kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]], uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]])
118 Threads and threadgroups
kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]], uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]])
119 Threads and threadgroups
120 Threads and threadgroups
121 Threads and threadgroups
Threads in a threadgroup are executed in SIMD way
All threads execute both branches, keep divergence to minimum (Single Instruction Multiple Data)
if
122 Threads and threadgroups
The division of threadgroups into SIMD groups is defined by Metal
SIMD group size is returned by threadExecutionWidth of compute pipeline state object
All you have to do is define threadgroup size
123 6. Calculate threadgroup count and size
let executionWidth = bufferMultiplierPipelineState.threadExecutionWidth let threadgroupsPerGrid = MTLSize(width: (buffer.count + executionWidth - 1) / executionWidth, height: 1, depth: 1) let threadsPerThreadgroup = MTLSize(width: executionWidth, height: 1, depth: 1)
124 public class ArrayProcessor {
public func process(array: [Float], multiplier: Float) {
Array buffer Uniform buffer
� MTLDevice
MTLCommandQueue
MTLComputeCommandEncoder MTLComputePipelineState
computeEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
} } 125 7. Write shaders
Write your kernels kernel void bufferMultiplier(device float* inputBuffer [[buffer(0)]], const device BufferMultiplierUniforms& uniforms [[buffer(1)]], const uint threadIndex [[ thread_position_in_grid ]]) { if (threadIndex >= uniforms.bufferSize) { return; }
const float initialValue = inputBuffer[threadIndex]; inputBuffer[threadIndex] = initialValue * uniforms.multiplier; }
126 Benchmarks
What we will be playing with: var inputBuffer = [Float](repeating: 1.0, count: 1_000_000) let multiplier: Float = 2.0
What we will be comparing to:
// CPU Implementation for i in 0.. 127 Benchmarks • Metal finished in 0.006s • CPU finished in 0.1s • Which is ~17 times slower 128 Benchmarks 1_000_000 1_000 129 Benchmarks • Metal finished in 0.003s • CPU finished in 0.0001s • Which is ~30 times faster 130 Tips 1. Beware of memory alignment 2. Beware of CPU-side encoding overhead 3. Keep code divergence to minimum 4. Use half instead of float whenever possible 5. Avoid using ints 6. Calculate threadgroup sizes thoughtfully 7. Cache reusable CPU-side objects 8. Don’t wait for GPU to finish execution 131 Metal Performance Shaders • A framework of data-parallel algorithms for the GPU • Optimized for iOS • As simple as calling a library function 132 Metal Performance Shaders iOS9 Thresholding Median Filter Lanczos Resampling Image Integral Histogram, Equalization, and Specification 133 Metal Performance Shaders iOS10 MPSCNN 134 Metal Performance Shaders 135 Metal NN Graph API 136 CoreML 137 CoreML 138 CoreML • Easy to use • Wide range of desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control 139 Thanks! @s1ddok @s1ddok [email protected] 140