Master GPU-engineering in one hour

Andrey Volodin Senior Matrix Multiplicator Agenda

• Computer graphics history • Modern rendering • Apple side of things • What is GPGPU? • Metal Compute • Hype train

2 History

3 1977

Empire Strikes Back (Atari 2600)

The first notable system to use sprite graphics

4 1977 Atari 2600

128 bytes of RAM including call stack and the state of game world

• Typical resolution: 160x192 • 128 colors in palette • 160 * 192 * 7 bits = 26 880 bytes per frame

No framebuffer, graphics were generated in real-time.

Literally.

5 6 7 8 9 10 11 1977 Atari 2600

VCS could only display five interactive objects at any one time:

• 2 «player» sprites • 2 «missile» sprites • 1 «ball» sprite

12 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it

13 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it

14 «Racing the beam» the VCS could only display five interactive objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it

15 Blind spots are the only times programmers could do anything that didn't involve drawing graphics on the screen, such as joystick inputs, player movements, scoring

16 1977 Nintendo Entertainment System

1983 (Vice Project Doom)

17 1977 Nintendo Entertainment System

1983

• 8-bit colors • Still no framebuffer • PPU (Picture Processing Unit) • Tiled graphics (aka «character graphics») • Operates with tiles of 8x8 (or 8x16) pixels • 8 sprites per scanline • Another advantage is collision detection

(movable/non-movable sprites)

18 1977 Nintendo Entertainment System

1983 (Vice Project Doom)

19 20 1977 Second Generation - Shaded Solids

1983

1987

1991

• Very expensive, mostly used in professional simulators • Vertex lighting • Rasterization of filled polygons • Depth buffer and blending

21 1977 Second Generation - Shaded Solids

1983

1987

1991

22 1977 First «GPU»

1983 NVidia releases first «» - GeForce 256 1987

… • Made a definition of what GPU should be 1991 • Achieved 10 million polygons processed in 1 second 1999 • Vertex transform • Lighting • Barely programmable

23 1977 GeForce3 with GeForceFX first programmable GPU 1983

Introduced a concept of shaders 1987 • • Vertex and fragment operations … • Macro assembly language 1991 • Very limited

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; 1999 MULR R0.xyz, R0.w, R0.xyzx; ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; 2001 DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x;

24 Recent trends

25 Recent trends

Time Trans MHz GFLOPS Aug02 121M 500 8 Jan03 130M 475 20 Dec03 222M 400 53

• 1.8x increase of transistors • 20% decrease in clock speed • 6.6x GFLOP speedup

26 Modern rendering process

27 • GPUs are very limited in what they can do • Can only draw primitives: triangles, lines, points • Highly optimized for floating point operations

28 3D objects are stored as a set of triangles

29 They are placed in the 3D scene which is simply a coordinate space

30 Next, there is usually a camera to provide different view angles

31 Last step in vertex stage: projection

32 Affine transform matrices

Scale Translation

Rotation (Y)

Same concepts in UIKit (CGAffineTransform) 33 Affine transform matrices

x x x

34 Typical vertex

return uniforms.modelViewProjectionTransform * float4(newPosition, 1.0);

35 Typical vertex shader

return uniforms.modelViewProjectionTransform * float4(newPosition, 1.0);

Calculated on CPU as a combination of all transforms

36 Next, rasterizer comes into play

37 Next, rasterizer comes into play

38 Fragment shaders

Two triangles, that cover the screen

39 Fragment shaders

fragColor = vec4(1.0, 0.0, 0.0, 1.0);

40 Fragment shaders

fragColor = vec4(length(deltaToCenter), 0.0, 0.0, 1.0);

41 Fragment shaders

sin(time) 1.0

-1.0

42 Fragment shaders

fragColor = vec4(length(deltaToCenter) * sin(time), 0.0, 0.0, 1.0);

43 Fragment shaders

140 LOC

66 LOC

800 LOC 44 Fragment shaders

Business card challenge by Paul Heckbert (1984)

45 Custom shader example

Imagine we have a sphere

46 Custom shader example

What if we will displace every vertex by a smooth noise value?

47 Custom shader example

And also offset read coordinates every frame

48 Custom shader example

Next, we will take gradient texture and will read as high as current displacement is

49 Custom shader example

50 Draw calls happen for every set of geometry with unique render state

51 It happens for every set of geometry with unique render state

52 It happens for every set of geometry with unique render state

53 What Apple has to offer? �

54 2007 OpenGL ES 1.1 (iPhone 2G)

55 56 2007 OpenGL ES 1.1 (iPhone 2G)

2010 OpenGL ES 2.0 (iPhone 4)

57 58 2007 OpenGL ES 1.1 (iPhone 2G)

2010 OpenGL ES 2.0 (iPhone 4)

2016 OpenGL ES 3.0 (iPhone 7)

59 Unity GLKit.framework libGDX

OpenGLES.framework Cocos2D-X Cocos2D-ObjC

SceneKit Unreal SpriteKit

60 Announced by Khronos in 2015

Initially, Apple was part of the working group

61 62 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinnest possible API

63 64 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API

65 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API

66 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API

67 68 Low CPU overhead Modern GPU features Do expensive tasks less often Optimized for CPU behaviour Thinest possible API

69 MTLDevice

70 Represents a single GPU

MTLDevice

71 MTLDevice

MTLCreateSystemDefaultDevice()

NOTE: Don’t treat this object as a singletone

72 MTLDevice

MTLCommandQueue

73 MTLDevice

Retrieved from device via:

device.makeCommandQueue()

MTLCommandQueue

74 MTLDevice

Often referred to as «Metal context»

MTLCommandQueue

75 MTLDevice

MTLCommandQueue � � �

76 MTLDevice MTLCommandQueue

� = MTLCommandBuffer

77 MTLDevice MTLCommandQueue

� = MTLCommandBuffer

Made on per-queue basis: guard let commandBuffer = commandQueue.makeCommandBuffer() else { fatalError("Could not create command buffer") }

78 Command types:

1. Render commands

2. Blit commands

3. Compute commands

79 1. Render commands MTLRenderCommandEncoder

2. Blit commands MTLBlitCommandEncoder

3. Compute commands MTLComputeCommandEncoder

80 �

81 � commandBuffer.makeRenderCommandEncoder(:)

MTLRenderCommandEncoder

82 � commandBuffer.makeRenderCommandEncoder(:)

MTLRenderCommandEncoder

83 Pipeline state

• Represents GPU state that is need to be set for the

current command

• Must be initialized with shader functions

• Each pipeline state type has its own optional parameters

• Usually being cached and reused

84 Pipeline state

// Create a reusable pipeline state for rendering geometry let stateDescriptor = MTLRenderPipelineDescriptor() stateDescriptor.vertexFunction = vertexFunc stateDescriptor.fragmentFunction = fragmentFunc

85 Pipeline state

let ps = try device.makeRenderPipelineState(descriptor: stateDescriptor)

86 �

MTLRenderCommandEncoder .setRenderPipelineState(pipeLineState)

MTLRenderPipelineState

87 �

MTLRenderCommandEncoder .setVertexBuffer(:index:)

Geometry buffer MTLRenderPipelineState

88 �

.setVertexBuffer(:index:) MTLRenderCommandEncoder .setFragmentBuffer(:index:)

Texture 0

Uniform buffer

Geometry buffer MTLRenderPipelineState

89 � +1

MTLRenderCommandEncoder .drawPrimitives(***)

90 � +1 +1

MTLRenderCommandEncoder .drawPrimitives(***)

Another geometry Another state

91 � +1 +1

MTLRenderCommandEncoder .endEncoding()

92 � +1 +1

// Send buffer to the command queue commandBuffer.commit()

// Wait until all command are executed commandBuffer.waitUntilCompleted()

// Subscribe to completion event commandBuffer.addCompletionHandler {}

93 Metal Compute Shaders �

94 1999-2001 Early GPGPU

• Hoff (1999): Voronoi diagrams on NVIDIA TNT2 • Larsen &McAllister (2001): first GPU matrix multiplication (8-bit) • Rumpf & Strzodka (2001): first GPU PDEs (diffusion, image

segmentation)

• NVIDIA SDK Game of Life, Shallow Water (Greg James, 2001)

95 1999-2001 Early GPGPU

• PHD in computer graphics to do this • Financial companies hired game developers

96 1999-2001 GPGPU.org

2002

General Purposed GPU

97 1999-2001

2002

98 1999-2001 R G B A R G B A

2002

99 1999-2001 R G B A R G B A

0.17 0.21 0.1 0.2 0.1 0.21 0.2 0.0 2002

100 1999-2001 CUDA

2002

• First GPU arch. and software platform designed for computing • First C/C++ language and compiler for GPUs • 2007 began a massive surge in GPGPU development

2007

101 1999-2001 CUDA

2002 Input registers ID

Fragment Thread program program

Output registers Output registers 2007

102 Metal Compute Shaders

• Act just like fragment or vertex shader, but general purposed • Programmed with keyword kernel • Suitable for highly parallel tasks • Can be put in the same command buffer with render/blit commands

103 Task: multiply every element in float buffer by a certain value

Purely parallel thing - suitable for compute shaders

104 Code Time! �

105 1. Declare class ArrayProcessor

2. Use MTLDevice or MTLCommandQueue as a dependency injection

3. static elements in init(:)

106 public class ArrayProcessor {

public let commandQueue: MTLCommandQueue public let device: MTLDevice public let bufferMultiplierPipelineState: MTLComputePipelineState

public init(commandQueue: MTLCommandQueue) { … }

… }

107 Next, implement encoding GPU work on CPU side

4. Prepare type-container for kernel’s parameters

fileprivate struct Uniforms { public let multiplier: Float public let count: UInt32 }

NOTE: Be careful with Swift’s memory layout, use C/C++ delcarations to avoid tricky bugs

108 5. Encode compute kernel command into command queue

public class ArrayProcessor {

MTLDevice MTLCommandQueue MTLComputePipelineState

… public func process(array: [Float], multiplier: Float) { … }

… }

109 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

MTLDevice

MTLCommandQueue

MTLComputePipelineState

} } 110 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

MTLDevice

MTLCommandQueue

MTLComputePipelineState

} } 111 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

� MTLDevice

MTLCommandQueue

MTLComputePipelineState

} } 112 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

� MTLDevice

MTLCommandQueue

MTLComputeCommandEncoder MTLComputePipelineState

} } 113 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

� MTLDevice

MTLCommandQueue

MTLComputeCommandEncoder MTLComputePipelineState

} } 114 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

� MTLDevice

MTLCommandQueue

MTLComputeCommandEncoder MTLComputePipelineState

} } 115 Threads and threadgroups

• Metal executes your kernel function over 1D, 2D or 3D grid • Each point in the grid represents a single instance of your kernel function • That is called thread • Threads are organized together into threadgroups that can share common block of memory

116 Threads and threadgroups

117 Threads and threadgroups

kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]], uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]])

118 Threads and threadgroups

kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]], uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]])

119 Threads and threadgroups

120 Threads and threadgroups

121 Threads and threadgroups

Threads in a threadgroup are executed in SIMD way

All threads execute both branches, keep divergence to minimum (Single Instruction Multiple Data)

if

122 Threads and threadgroups

The division of threadgroups into SIMD groups is defined by Metal

SIMD group size is returned by threadExecutionWidth of compute pipeline state object

All you have to do is define threadgroup size

123 6. Calculate threadgroup count and size

let executionWidth = bufferMultiplierPipelineState.threadExecutionWidth let threadgroupsPerGrid = MTLSize(width: (buffer.count + executionWidth - 1) / executionWidth, height: 1, depth: 1) let threadsPerThreadgroup = MTLSize(width: executionWidth, height: 1, depth: 1)

124 public class ArrayProcessor {

public func process(array: [Float], multiplier: Float) {

Array buffer Uniform buffer

� MTLDevice

MTLCommandQueue

MTLComputeCommandEncoder MTLComputePipelineState

computeEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)

} } 125 7. Write shaders

Write your kernels kernel void bufferMultiplier(device float* inputBuffer [[buffer(0)]], const device BufferMultiplierUniforms& uniforms [[buffer(1)]], const uint threadIndex [[ thread_position_in_grid ]]) { if (threadIndex >= uniforms.bufferSize) { return; }

const float initialValue = inputBuffer[threadIndex]; inputBuffer[threadIndex] = initialValue * uniforms.multiplier; }

126 Benchmarks

What we will be playing with: var inputBuffer = [Float](repeating: 1.0, count: 1_000_000) let multiplier: Float = 2.0

What we will be comparing to:

// CPU Implementation for i in 0..

127 Benchmarks

• Metal finished in 0.006s • CPU finished in 0.1s • Which is ~17 times slower

128 Benchmarks

1_000_000 1_000

129 Benchmarks

• Metal finished in 0.003s • CPU finished in 0.0001s • Which is ~30 times faster

130 Tips

1. Beware of memory alignment

2. Beware of CPU-side encoding overhead

3. Keep code divergence to minimum

4. Use half instead of float whenever possible

5. Avoid using ints

6. Calculate threadgroup sizes thoughtfully

7. Cache reusable CPU-side objects

8. Don’t wait for GPU to finish execution

131 Metal Performance Shaders

• A framework of data-parallel algorithms for the GPU

• Optimized for iOS

• As simple as calling a library function

132 Metal Performance Shaders

iOS9

Thresholding Median Filter Lanczos Resampling

Image Integral

Histogram, Equalization, and Specification

133 Metal Performance Shaders

iOS10

MPSCNN

134 Metal Performance Shaders

135 Metal NN Graph API

136 CoreML

137 CoreML

138 CoreML

• Easy to use • Wide range of desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control

139 Thanks!

@s1ddok

@s1ddok [email protected]

140