Faculty of Science and Technology

Department of Computer Science A closer look at problems related to the next- gen Vulkan API

Håvard Mathisen

INF-3981 Master’s Thesis in Computer Science June 2017

Abstract Vulkan is a significantly lower-level graphics API than OpenGL and require more effort from application developers to do memory management, synchronization, and other low-level tasks that are spe- cific to this API. The API is closer to the hardware and offer features that is not exposed in older APIs. For this thesis we will extend an existing with a Vulkan back-end. This allows us to eval- uate the API and compare with OpenGL. We find ways to efficiently solve some challenges encountered when using Vulkan.

i Contents

1 Introduction 1 1.1 Goals ...... 2

2 Background 3 2.1 GPU Architecture ...... 3 2.2 GPU Drivers ...... 3 2.3 Graphics APIs ...... 5 2.3.1 What is Vulkan ...... 6 2.3.2 Why Vulkan ...... 7

3 Vulkan Overview 8 3.1 Vulkan Architecture ...... 8 3.2 Vulkan Execution Model ...... 8 3.3 Vulkan Tools ...... 9

4 Vulkan Objects 10 4.1 Instances, Physical Devices, Devices ...... 10 4.1.1 Lost Device ...... 12 4.2 Command buffers ...... 12 4.3 Queues ...... 13 4.4 Memory Management ...... 13 4.4.1 Memory Heaps ...... 13 4.4.2 Memory Types ...... 14 4.4.3 Host visible memory ...... 14 4.4.4 Memory Alignment, Aliasing and Allocation Limitations 15 4.5 Synchronization ...... 15 4.5.1 Execution dependencies ...... 16 4.5.2 Memory dependencies ...... 16 4.5.3 Image Layout Transitions ...... 17 4.5.4 Queue Family Ownership Transfers ...... 17 4.6 Render Pass ...... 17 4.7 Shaders ...... 18 4.8 Pipeline State Objects ...... 18 4.9 Resource Descriptors ...... 18

5 A Vulkan Game Engine 20 5.1 Engine Overview ...... 20 5.2 Previous work on DirectX 12 ...... 20

ii 5.3 Designing a Vulkan Graphics Engine ...... 21 5.4 Command Buffers ...... 21 5.4.1 Multi-threading ...... 22 5.5 Memory Management ...... 22 5.6 Synchronization ...... 23 5.7 Vulkan API ...... 23 5.8 Debug Markers ...... 24

6 Results 25 6.1 Vulkan vs OpenGL ...... 25 6.2 Higher Graphics Settings ...... 26 6.3 Async Compute ...... 26

7 Discussion 28 7.1 Vulkan vs OpenGL ...... 28 7.2 Command buffers and multi-threading ...... 28 7.3 Queues ...... 30 7.4 Memory Management ...... 30 7.5 Synchronization ...... 31 7.6 The Vulkan API ...... 31 7.7 Validation layers ...... 31

8 Conclusion 32

9 Figures 35

iii

1 Introduction

GPUs have evolved significantly since their early history to meet the de- mand for better graphics and smoother frame-rates. Exposing new hardware features to developers have been done through extending graphics APIs like OpenGL. OpenGL had its initial release in 1992 and was designed for graphics hardware that was significantly different from the modern GPU. It has prob- lems adapting to modern multi-core processors, new GPU architectures and applications that required more efficient and predictable performance. One example of such an application is Gear VR, which combines mobile graphics with VR. VR require low latency, high performance and more predictable performance while mobile graphics require high efficiency, less power usage and support for tile-based GPU architectures. It was time for a grounds-up redesign even though OpenGL has aged quite well through added extensions and new versions. Vulkan is an API that is designed to meet the demands of modern graphics applications. Not only does the API expose new graphics hardware features, it is also designed to be programmed using modern multi-core processors. Some advantages of Vulkan are:

• Designed to allow for more efficient use of CPU and GPU resources. The added efficiency comes from a closer mapping of the API to the hardware.

• It is a lower-level1 API that giver more control to developers.

• Thinner drivers with less overhead and latency that should remove some micro-stuttering that older drivers had. Drivers should not have to do any run-time shader re-compilations.

• Intended to scale to multiple threads.

• Exposes new hardware queues for compute and DMA.

A good motivation for learning about Vulkan is gaining a better under- standing of how GPU drivers work. In this project we will take a closer look at Vulkan. 1Note that a lower-level API is not the same as a low-level API

1 1.1 Goals The main goal for this thesis is how we can design a graphics engine to better utilize the underlying hardware by using the Vulkan API. Central questions are:

• How can we multi- the tasks involved in generating commands for the GPU

• How can we utilize new hardware features the API exposes like the new queues for doing compute and DMA concurrently to the graphics engine

• What is the best way to do memory management

• How can we best manage and synchronize resources

• How does Vulkan compare to OpenGL

• How do we make useful abstraction that minimize complexity of the API

To answer these questions we will design and implement a Vulkan graphics engine for an existing game engine written by the author.

2 2 Background

In this chapter, we explain modern GPU architectures and drivers before introducing the Vulkan graphics API.

2.1 GPU Architecture Modern GPU architectures come in multiple forms. We have both GPUs in- tegrated on a SOC, and we have standalone discrete graphics cards. We have GPUs for mobile and GPUs for desktop. Different GPUs have different tech- nical capabilities. Integrated GPUs all have a uniform memory architecture (UMA), meaning that the CPU and the GPU share memory, while discrete GPUs have dedicated memory in addition to sharing system memory with the CPU. Modern GPUs also share virtual memory space with the CPU. Desktop GPUs usually use a feed-forward rasterizing architecture while mobile GPUs use a tiled rendering architecture. Tiled rendering works by deferring the rasterization by rather storing the geometry data of the scene in a screen-space tiled cache that is later used to render the scene a tile at the time. By using this technique we can move the framebuffer out of main memory and into high-speed on-chip memory which can reduce the used memory bandwidth [16]. Even discrete desktop GPUs are starting to use similar techniques to reduce memory bandwidth23. Modern GPUs have features not exposed in the OpenGL API. They can have multiple compute engines that can execute compute workloads asyn- chronously to the graphics engine. They can also execute memory copies using the DMA engine asynchronously to the other engines.

2.2 GPU Drivers GPU drivers work by packing commands for the GPU into command buffers. There are two components in a graphics driver, a user space library and a kernel module. Commands like Draw*() or Dispatch() are not executed immediately on the GPU when the function is called, but rather staged for later execution in a command buffer in the user space driver. When the command buffer fills up with enough commands they are optimized and sent to the kernel. The kernel ensures that the commands are valid and don’t access memory not belonging to the application before staging the

2http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/ 3http://www.anandtech.com/show/11002/the-amd-vega-gpu-architecture-teaser/ 3

3 Figure 1: OpenGL Driver commands for execution on the GPU. When the GPU runs out of commands in the current command buffer it is executing, an interrupt is sent to the OS requesting a new command buffer to execute. The GPU front-end can fetch its own commands from command buffers in system memory through DMA operations and is executing commands at its own pace. Figure 1 shows an overview of GPU drivers. As an optimization, some drivers allow the optimization step of the com- mand buffers to be done in a separate driver thread as shown in Figure 2. This makes draw and dispatch calls really fast in the application but comes at the cost of additional latency. To take full advantage of this technique it might also be necessary for the application to triple buffer per-frame re- sources, as opposed to the traditional double buffering. One buffer is used by the application, one by the driver thread and one by the GPU. This comes at the cost of extra memory usage. Marchesin [14] has an extensive but not finished introduction to graphics drivers. There are multiple sources on Approaching Driver Overhead (AZDO) techniques that shed light on the problems of the tradi- tional graphics driver architecture and how to circumvent those [4] [5] [10].

4 Figure 2: Multi-threaded OpenGL Driver

AZDO is a collection of multiple different GPU techniques to remove driver overhead. The most predominant AZDO techniques are about moving the logic used to select which resources should be used by a shader from the CPU to the shader itself. It is often recommended to start with the AZDO techniques when learning about GPU drivers in-depth. McDonald [15] has a presentation about driver models and how to avoid sync points.

2.3 Graphics APIs Modern graphics is based around a pipeline that specifies some fixed function stages and some programmable shader stages. Fixed function stages consists of steps like vertex fetching, rasterization, fragment operations and tessel- lation primitive generation. Programmable stages are the vertex shader, fragment shader, geometry shader, and tessellation control and evaluation shaders. The pipeline begins by fetching an index buffer used to look up a vertex buffer. Vertexes are processed by vertex shaders and some optional stages before being assembled into polygons. Polygons are razterized into fragments that are processed by the fragment shaders. The shaded fragments go through some fixed function stages like blending before being written to the framebuffer. The way graphics APIs have been adapted to the modern graphics pipeline is by adding extensions. OpenGL got extensions for replacing fixed function

5 1 glBegin(GL TRIANGLES) ; 2 glColor3f(1.0, 0.0, 0.0); glVertex3f( −1.0 , −1.0 , 0 . 0 ) ; 3 glColor3f(0.0, 1.0, 0.0); glVertex3f( 0.0, 1.0, 0.0); 4 glColor3f(0.0, 0.0, 1.0); glVertex3f( 1.0, −1.0 , 0 . 0 ) ; 5 glEnd ( ) ;

Figure 3: Ancient OpenGL

shading stages with programmable shader stages. Vertex Array Objects and Vertex buffer Objects are used to specify buffers for holding vertex data instead of specifying vertexes with an immediate mode shown in Figure 3. Separate to the graphics pipeline is also the inclusion of the general pur- pose compute shaders, similar to CUDA and OpenCL kernels. What most of these changes are doing is making the graphics APIs more like general purpose compute APIs. The performance of GPUs have evolved roughly according to moore’s law. The same cannot be said about single-threaded CPU performance which have stagnated the last couple of years. This has resulted in a time where single threaded OpenGL applications are not fast enough to generate work for modern GPUs. This has resulted in a handful of responses. AZDO techniques aim at reducing the driver overhead associated with generating work for the GPU by providing extensions to OpenGL that reduce the amount of work the driver has to do. Among these techniques is also the possibility to let the GPU generate its own commands through indirect draw commands and kernel launches. Another approach is a grounds up redesign of the API to be more efficient and allow applications to be more easily multi-threaded. This has resulted in next-gen APIs like Mantle, DirectX 12, Metal and Vulkan.

2.3.1 What is Vulkan Vulkan is an API for programming graphics and compute hardware. It allows the programmer to specify commands for execution of graphics shaders and compute kernels on a GPU. Vulkan is based on a series of modern techniques and design decisions for making a more efficient API. In Vulkan commands are recorded into command buffers for later submission to queues that a device consumes commands from. This forces the developer to be more aware of the asynchronous nature of modern GPUs. Older APIs can give the illusion that a graphics command executes immediately on the GPU. The API is defined in the Vulkan Specification [8] released by the Khronos Group. LunarG provides an SDK4 for developers to get started using Vulkan

4https://www.lunarg.com/vulkan-sdk/

6 and several IHVs implement the Vulkan API5.

2.3.2 Why Vulkan Vulkan is the next-gen API that is supported on most platforms. IHVs are free to support Vulkan on any OS including Linux and any Windows versions. This is contrary to DirectX 12 which is only supported on Windows 10 and Metal which is only for Mac. The API is also the same across mobile and desktop, making it easier to port applications and increases the number of developers with familiarity about the API. Mobile IHVs tend to promote Vulkan quite heavily due to the decreased battery usage gained from having a lower overhead API. The API has also been proven to work with games like The Talos Principle, Dota 2 and DOOM.

5https://www.khronos.org/vulkan/

7 3 Vulkan Overview

Vulkan is an API that consists of a set of structures and functions that are used to program a GPU. This section describes the architecture and the execution model used by those structures and functions. There is also a section describing the ecosystem and where we find tools that should be used when developing Vulkan applications.

3.1 Vulkan Architecture The architecture of Vulkan is meant to better corresponds to modern GPU architectures and give the developers more control. Command buffers and queues gives the application developer more control of the asynchronous na- ture of GPUs and also allow applications to take advantage of multi-core CPUs. The memory architecture of the API is designed with both discrete and integrated GPUs in mind and also gives the application developer more control over what type of memory is used for what kind of resources. All the state associated with the traditional graphics pipeline is grouped together into a single pipeline state object to allow the driver to better optimize the operation of the GPU. Shaders access resources through descriptor sets that gives the developer more control of the indirection and layout of the struc- tures used to look up resources. Synchronization is explicitly managed by the application developers, this keeps the driver from having to inspect the commands for what kind of resources they touch. Vulkan is also one of the few APIs that is explicitly designed with tile-based GPUs in mind with the addition of render passes.

3.2 Vulkan Execution Model Vulkan makes few implicit guarantees about the execution model to allow the drivers to better optimize for performance. Command buffers submitted to a queue execute asynchronously to the CPU threads. Different queues can also execute commands asynchronously to each others. Even the different commands within a command buffer may overlap or execute out of order to better utilize the execution units in the GPU. It is the responsibility of the application developer to insert synchronization primitives to ensure correct execution of the commands.

8 3.3 Vulkan Tools Vulkan drivers does not do any validation of the draw calls (except for at the OS level). This is one of the primary reasons why the Vulkan API is so much more efficient than OpenGL. Every OpenGL call has to go through a series of state checking and validation code that checks for undefined behavior. This adds overhead to the draw calls. It is possible to limit this through extensions like GL KHR no error, but support for this extension is limited and it does not make significant differences on most implementations. We still need to validate our Vulkan applications even if the drivers them- selves don’t provide support for this. This is where the validation layers come in. The validation layers use callbacks to the application to notify the de- velopers when they make mistakes. They don’t have 100% coverage yet, but does cover the most common mistakes. The validation layers come as part of the LunarG Vulkan SDK and are injected between the application and the driver library through the Vulkan loader. The Vulkan loader is a driver independent loader library that allows multiple Vulkan drivers to coexist on the same system. There is a large area for other Vulkan associated tools. Most importantly when learning about the API are the numerous tutorials and examples that are online. RenderDoc is a graphics debugger that can be used to debug Vulkan applications. One of the most significant parts about GPU program- ming is writing shaders. Shaders in Vulkan are defined with the SPIR-V intermediate representation. There are numerous open tools for SPIR- V. Most importantly are the compilers glslang and shaderc. SPIR-V can be translated to other representations through projects like SPIRV-LLVM and SPIRV-Cross. SPIRV-Cross also support shader introspection. The SPIRV- Tools project has tools like assemblers, disassemblers, optimizers and vali- dation. Any IHVs need to have their Vulkan implementation tested by the Vulkan conformance tests (CTS) before they may publicly use the Vulkan trademark to promote their products. The Vulkan CTS is an open source test suite that anyone can contribute test to. This improves the quality of the Vulkan drivers which is important since one of the most criticized aspects about OpenGL was buggy drivers.

9 4 Vulkan Objects

One of the fundamental differences between Vulkan and OpenGL is that Vulkan has an object based design as opposed to OpenGLs global state based approach. An object based design has some inherent advantages over a global state based approach. In Vulkan the connection between a GPU and the application is represented with a VkDevice object which is used to issue commands for that specific GPU. Multiple VkDevice objects can be used to issue commands for multiple GPUs and we can freely choose which thread we want to issue the commands in. It is even possible to create multiple VkDevice objects for a single GPU. This can be used to make each VkDevice object have some advantages similar to multiple processes on the OS like preemption and separate memory. While it is technically possible to issue commands for multiple GPUs in OpenGL, it has quite limited support since OpenGL ties the connection for GPU to a specific thread. One advantage of a global state based approach like OpenGL is that the user don’t have to keep track of which VkDevice object to use for every call to the driver which can add up to quite a lot of extra code. Global state based APIs are often recommended for small projects since the user of the API is most likely to want the standard device with the default state and having a simple API can speed up development. In Vulkan the application is responsible for synchronizing certain objects on the CPU by for example using locks (and memory barriers on platforms where this is required like ARM CPU architectures). Other objects are im- mutable or the implementation handle the synchronization so the application don’t have to even if they are used on multiple threads.

4.1 Instances, Physical Devices, Devices An object called VkInstance is used as the base for all Vulkan operations since the API does not have any global state. VkInstance represents a connection from the application to the API itself and the Vulkan function pointers are fetched through the VkInstance object. Contrary to OpenGL the Vulkan API is not exposed in some driver specific dynamic library but through a common loader library that is shared for all drivers on the system. The Vulkan Loader allows multiple Vulkan implementations to be present on the same system and also allow layers to be injected between the application and the driver. The VkInstance object is used to enumerate resources on the system like VkPhysicalDevice objects which represents GPUs, or VkDisplay objects

10 Figure 4: Outline of Vulkan objects

11 which represent monitors connected to the system. A VkSurface object is also created to allow the application to output rendered images for the user to view. The VkSurface is a connection to the desktop compositor or a physical display. When the application has found an appropriate VkPhysicalDevice, a VkDevice (“Logical Device” or simply “Device”) object can be created from that physical device. The VkDevice object serves as the base for commu- nication between the application and the GPU and all rendering happens through that object. Most of Vulkan depend either explicitly or implicitly on the VkDevice object, except for objects like VkInstance, VkPhysicalDevice and VkSurface. See figure 4 for a rough outline of how these objects depend on each other.

4.1.1 Lost Device Under certain circumstances a VkDevice may “become lost”. This is usually caused by infinite loops in shaders that require a GPU reset, but can also be caused by hardware errors, driver errors, execution timeouts, power manage- ment events, platform-specific reasons or even errors in other processes on the system. It is not possible to use the VkDevice object any further when a device is lost. It is however possible to try to recover from a lost device by creating a new VkDevice object and restart the GPU work on the new device. All data and memory on the device is however lost and needs to be uploaded or recomputed on the GPU.

4.2 Command buffers In Vulkan commands for the GPU like vkCmdDispatch or vkCmdDraw* are recorded into a special object called VkCommandBuffer. This fits much more closely to how GPU drivers are structured and allow the application to choose when the commands are submitted to the GPU. Multiple command buffers can also be recorded concurrently on different threads to distribute the work of recording the commands to multiple cores. An application can also record a command buffer once and use it multiple times to avoid the overhead asso- ciated with recording. This is however hard to do since when a rendered scene change, the structure of the command buffer also tend to change. Reusing command buffers can be useful for post-processing or similar passes that usually do the same work each frame.

12 4.3 Queues Command buffers can be submitted to VkQueue objects after they have been recorded. The commands in the command buffers are first executed on the GPU after they have been submitted to a queue. Queues execute commands asynchronously to the host CPU and the driver is free to return from the vkQueueSubmit call before any work is done on the GPU. It is possible to create more than one VkQueue objects for each device depending on hard- ware and driver support. Some hardware have support for separate queues for doing compute work or memory transfers in addition to the graphics queue. The different queue capabilities are exposed in queue families that are used to create VkQueue objects. Separate queue families usually cor- responds to hardware that can execute commands asynchronously to each other, like asynchronous compute engines (ACE) or DMA engines. It might also be possible to create multiple queues from a single queue family. Those types of VkQueue objects might be implemented with software multiplexing, preemption or other times hardware engines. Queues in the same family are compatible with one another and may share work and resources freely. It is also possible to assign a priority to queues when creating the VkQueue objects to indicate that some queues should be assigned more processing time than others.

4.4 Memory Management Memory is explicitly managed except for a few pool objects used to manage memory for specific uses. This means that the users of the API has to implement their own memory allocation strategy and explicitly map memory to objects. In addition to the problems inherent with implementing a general memory allocation strategy, the developer has to be aware of lots of additional restrictions that are specific to either Vulkan in general or a specific type of devices that may be present.

4.4.1 Memory Heaps Most discrete graphics cards use a non-unified memory architecture (NUMA)6 where the graphics card has access to on-board dedicated graphics memory in addition to the normal system memory shared by the rest of the system.

6The terminology for non-unified memory architecture (NUMA) found in graphics cards should not be confused with the terminology for non- (NUMA) typically found in clusters.

13 This is in contrast to integrated and mobile GPUs, that has a unified mem- ory architecture (UMA). UMA is where the GPU only use system memory shared by the rest of the components on the system. Vulkan use multiple memory heaps to allow the application to specify what memory to use. There is at least one heap that is “Device Local”, and this might be the only heap, for example in UMA systems. NUMA systems also have access to a “Host Local” heap that refers to system memory, in addition to the on-board “De- vice Local” heap. On those systems the “Device Local” heap is usually not accessible to the host CPU. Some AMD graphics cards have two “Device Local” heaps, where one heap refer to graphics memory that the host CPU can access.

4.4.2 Memory Types An allocation of memory can in addition to having multiple memory heaps to allocate from, have certain properties that change how the memory is cached on the CPU. Combinations of different memory properties and heaps are grouped into memory types that are used when allocating memory. The memory types determine whether memory is “Device Local” or “Host Local” and if the memory can be accessed on the Host CPU. If the memory is accessible on the host CPU then it is also possible that it might be coherent with the GPU and cached on the CPU. If the memory is not coherent, the application has to explicitly flush the memory when it is written on the CPU and invalidate the memory when we read back results generated on the GPU from the CPU. Whether memory is cached does not imply restrictions on the application, but cause huge performance differences. Tile-based GPU architectures (Particularly Mali GPUs) can also have a special memory type called “Lazily Allocated Memory”. Lazily Allocated Memory can be used for transient data that may fit entirely in the tile cache and so may or may not need a physical allocation from the heap.

4.4.3 Host visible memory There are two primary use-cases for host visible memory. The first is to upload data to the GPU and the second is to download data from the GPU. A coherent memory type is often recommended for uploads to the GPU while downloading data from the GPU should be done with cached mem- ory [1]. Coherent uncached memory is usually implemented by write com- bining (WC) techniques, this makes it important to do memory writes are aligned and that write enough memory to fill a WC burst [9]. Cached memory is required for prefetching to the CPU caches and this speeds up readbacks

14 on the CPU. Cached memory is sometimes uncoherent. Uncoherent mem- ory require the developers to call vkFlushMeppedRanges for any writes and vkInvalidateMappedRanges before any reads.

4.4.4 Memory Alignment, Aliasing and Allocation Limitations There is a series of alignment requirements that specify extra constraints on how sub-allocation can be implemented. Every buffer and image object created has a specific alignment requirement for the memory that can be mapped to that object. This value can first be queried after the object has been created, and there are few guarantees except that it must be a power of two and that similar buffers has similar alignment requirements. The align- ment is required because of hardware specific reasons like texture fetching and filtering, caches and other optimizations. The values can be quite arbitrary (especially for image objects that have multiple formats and layouts that require different alignments) and wary significantly across implementations. When making sub-allocations within a buffer the application also has to make sure any uniform buffer, storage buffer or texel buffer adhere to the limitations for minimum buffer offset alignment for that type of buffer. When buffers or linear images resources are mapped to memory regions separated by offsets of bufferImageGranularity with other optimal image resources, they are said to alias. Any resources are also aliasing if they are mapped on top of overlapping memory. Resources that are aliasing have several additional requirements to be meet, particularly that they cannot be used at the same time and use of additional memory barriers are required. To avoid aliasing it is common to use separate sub-allocates for linear and optimal resources that allocates from different base allocations. Aliasing can also be useful to reduce the required memory usage. Not only is it slow to do base allocations to further do sub-allocations from but there is a limit maxMemoryAllocationCount of how many base al- locations an application can make. This limit can be quite small, for example 4096 on all windows platforms.

4.5 Synchronization It is essential to insert the right synchronization primitives to ensure that a Vulkan application operates in a valid and well defined manner across all implementations. There are few implicit synchronization guarantees and de- velopers are expected to insert explicit synchronization primitives like fences, semaphores, events, pipeline barriers and render passes.

15 4.5.1 Execution dependencies Fences are used to ensure the ordering of host CPU execution dependent on completion of command buffers on the GPU. This is mainly used to limit the CPU from generating new commands before the GPU has completed execution of previously submitted command buffers. Fences has to be used to keep the CPU from writing to memory the GPU is using. Double-buffering or n-buffering are common techniques to ensure that both the GPU and CPU can work concurrently, and each buffer would then be associated with a separate fence. Fences are a course grain synchronization primitive and it is recommended to use as few as possible. A rule of thumb is to use fewer than five fences per frame. Most applications will usually only need one fence per frame. Semaphores are used to ensure ordering of command buffers submitted to a single queue or command buffer submitted to different queues. The most common use of semaphores is to ensure that the GPU is finished rendering a scene before it is displayed on the screen. It is also necessary to insert semaphores when using multiple queues. Semaphores are also heavyweight synchronization primitives so it is important to limit their usage to ensure the best performance. Pipeline barriers and Events are used to ensure ordering of commands within a command buffer. Pipeline barriers provides synchronization at a single point and events have separate signal and wait commands. Separate signal and wait commands are useful for ensuring that those commands can be overlapped with other independent commands so the GPU is not starved for work when the synchronizations takes place. Pipeline barriers can also have overlapping execution since the dependent pipeline stages are not always immediately followed by pipeline stages of the dependency. It is also possible to signal and wait on events on the CPU.

4.5.2 Memory dependencies It is not enough for developers to ensure ordering of operations. Memory ac- cesses has to be made available and visible to dependent commands. These dependencies roughly corresponds to flushing and invalidating caches when they have been written or should be read. When specifying pipeline barri- ers and events, the user add lists of structs specifying memory dependencies (both availability and visibility operations) that the application has. There are specific lists for buffers, images and global memory. Global memory bar- riers specifies dependency operations for all memory. Buffer memory barriers are for specific buffers, or even parts of those buffers. Image memory bar-

16 riers are for images, or sub-ranges of those images. Fences implicitly makes all device memory accesses available. Semaphores implicitly makes all device memory accesses available and visible for further commands on the queue. What this means is that when developing Vulkan applications it is neces- sary to have a scheme for keeping track of what resources are read and written with specific commands. This is used to solve data hazards like WAW and RAW. WAR can be solved with just execution dependencies.

4.5.3 Image Layout Transitions Modern GPUs use compression techniques like delta color compression [2] for color buffers and Z-compression [17] for depth/stencil render targets to save memory bandwidth. Images also need special layouts when they are copied. Image layout transitions have to be explicitly managed by the application and are specified with the image memory barriers. This means that the application has to keep track of what layout the images currently have and what layout they transition to. GPUs implicitly manage image layouts in the driver and don’t respect the image layouts specified in Vulkan. They recommend using the general layout. For AMD GPUs image layouts are explicitly managed by the user. Differences like these are quite common, and developers should be aware that the application might not work on other implementations even if it works on one.

4.5.4 Queue Family Ownership Transfers It is possible to create resources both for exclusive use in a single queue and concurrent use across queues. Resources that are created for a single queue can utilize more optimizations than other resources [11]. A queue family ownership transfer is needed when content stored in resources created with the exclusive flag are going to be used in a different queue. The transfer is specified in the structures for buffer memory barriers and image memory barriers. A transfer consists of a release operation on the queue giving up ownership, and an acquire operation on the other queue. This means that the application needs to keep track of what queues are using what resources.

4.6 Render Pass A major difference between Vulkan and other APIs (except for Metal) is that Vulkan groups rendering commands into render passes. Render passes are a set of framebuffer attachments of the same size. Render passes has

17 subpasses that reads or writes the attachments. We also specify what kind of synchronizations should be inserted between the subpasses when specifying render passes. The primary use case for render passes is to allow tile-based GPU architectures to better utilize high-speed on-chip device memory so that they don’t have to write intermediate results back to device memory. Render passes can also be used by forward immediate renderers for certain optimizations [18].

4.7 Shaders Vulkan is similar to other graphics APIs when it comes to shader stages. There are two groups of shaders. Graphics shaders correspond to the usual pipeline stages for vertex, tesselation control, tesselation eveluation, geome- try and fragment stages. The other type is compute shaders, and those are mandatory for Vulkan. Shaders are specified with the SPIR-V intermediate representation (IR) format. SPIR-V can currently be generated using the Khronos provided glslang or Googles shaderc. It is possible to generate SPIR-V from multiple shading languages, most notably GLSL and HLSL. GL KHR vulkan glsl [7] is an extension that should be used when writing shaders using GLSL for Vulkan. GLSL shaders written for Vulkan need to be extended to access resources from descriptor sets, as opposed to the traditional binding model used by OpenGL. There are also some new concepts like push-constants, specialization constants and specifics for render passes that can be utilized with Vulkan.

4.8 Pipeline State Objects Shaders are grouped together with state required for execution in an object called a VkPipeline. Having all this merged in a single object allows the drivers to do optimizations across this data when the object is created. In older APIs like OpenGL this had to be done at draw time and could cause hitches if the driver decided that it needed to do a shader compile at run- time. Pipeline state objects can be cached on disk to prevent the driver from having to recompile them each time the application is restarted. Multiple threads can also be utilized to speed up compilation.

4.9 Resource Descriptors Shaders access resources through a data structure called a descriptor. Buffer and image views have corresponding descriptors. Multiple descriptors are

18 grouped together into a descriptor set object before they can be bound to the pipeline. Descriptor sets can also contain high speed variables called push constants. The layout of a descriptor set object is described using a descriptor set layout. The bindings described in descriptor set layouts have to correspond to the bindings specified in the shaders that use those descrip- tors. Descriptor set objects are usually placed into a high speed memory on the GPU before rendering can start. There is limited amount of this high speed memory and it is possible that some descriptors spill into system memory when there are to many descriptors. This causes an extra indirec- tion before shaders can access resources. Push constants can be accessed directly from the descriptor sets and can therefore be very high speed on some implementations.

19 5 A Vulkan Game Engine

This section describes the design of a Vulkan graphics back-end for a game engine. It was designed around an existing game engine written by the author. The engine has an existing OpenGL implementation and had a DirectX 12 back-end that was removed to reduce the maintenance cost when adding the Vulkan back-end. The OpenGL implementation was based around AZDO techniques.

5.1 Engine Overview The engine is based on tile-based light culling and has OpenGL back-ends for both deferred lighting and forward shading. Only forward shading is im- plemented in Vulkan since the deferred shading path has low performance when using modern screen resolutions. In this thesis we will only focus on the forward shading path. Rendering starts with a depth pre-pass where a depth-buffer is generated. The depth-buffer is used for screen space ambient occlusion (SSAO) and tile-based light culling. There are shadow-maps gener- ated for both directional lighting (the sun) and some omni-directional lights (point lights). All the aforementioned data is combined in a forward shading pass generating a high dynamic range (HDR) color-buffer. A filter is used to resolve multi-sampling (MSAA) from the HDR-buffer with a tone-mapping function used to map the HDR values into low dynamic range (LDR) that is presented to the user. The engine uses some external libraries, most importantly for this project is the GLFW library used for creating windows and surfaces to render to in an OS independent manner. The Dear ImGUI library is used for creating graphical user interfaces, and a back-end was created for rendering the GUI in Vulkan.

5.2 Previous work on DirectX 12 Much of the work done on the DirectX 12 implementation was around struc- turing the base engine to allow for multiple graphics engines. The DirectX 12 implementation also laid the foundation for how to structure an engine for the new lower-level graphics APIs. This allowed the Vulkan implementation to focus more on advanced techniques and optimizations like asynchronous queues and multi-threaded command buffer recording. Particularly impor- tant lessons learned from the DirectX 12 implementation was how not to do things. It is easy to over-design a new graphics engine, so the Vulkan engine

20 was designed to have a base engine that is as simple as possible but can be extended to do advanced things.

5.3 Designing a Vulkan Graphics Engine The initial plan for the Vulkan implementation was to keep everything as simple as possible. A single command buffer was recorded on a single thread using a single fence per frame. Double buffering was used so the GPU and CPU could execute in parallel. Barriers was used for every resource that needed synchronization on the GPU and was added adjacent to commands using the resources. One of the first challenges was coming up with a memory allocation scheme that was simple yet correct. The most time consuming aspect of getting the early implementation up and running was augmenting all the preexisting GLSL shaders for Vulkan. Vulkan uses a different binding model than OpenGL. In Vulkan resources are bound to shaders using descriptor sets, and all shaders have to specify from what descriptor set and at what location the resource is accessed from. Work on more advanced techniques could begin when the base imple- mentation was stabilizing. One of the first design issues that came up was splitting the command buffer so that multiple queues could be used and work could be distributed to multiple threads. There were some additional barriers required when adding multiple queues. Multi-threading command buffer recording also required re-writing how barriers was implemented and the barriers was also optimized during this work. Refactoring and other op- timizations was done after this. All the descriptor set layouts and pipeline layouts could be merged, this reduces the code size and is also recommended by some IHVs as an optimization [12].

5.4 Command Buffers To allow the command buffer recording to be multi-threaded, and usage of the DMA and Async Compute Engines the frame was split into the following command buffers:

1. Copy 2. Depth Pass 3. Compute 4. Directional Shadows 5. Omni Directional Shadows

21 6. GUI/Sky

7. Forward Pass

8. Post

The copy buffer can be executed on the graphics queue or the copy queue before the previous frame is done. The compute buffer can be executed on the async compute engine while the shadow buffers and the GUI/Sky buffer are executing on the graphics engine. Lastly the post process buffer can be executed on the async compute engine while the next frame is running on the graphics engine.

5.4.1 Multi-threading Multiple command buffers also allows them to be recorded on different threads. However, there are some issues to be aware of. Some command buffers like the compute buffer takes a trivial amount of time to record, and creating tasks to record those on threads can cause more overhead than recording the command buffer. Recording of such command buffers are scheduled with other buffers to amortize the cost. Other buffers like the shadow buffers that contains lots of draw calls take significantly longer to record and can be split into smaller buffers with a pre- processor define. It is also important to note that the time it takes to record a command buffer does not correspond to the time it takes to execute it on the GPU. Creating to small command buffers adds overhead on the GPU. Command buffers should at minimum be about 200-500 us [19].

5.5 Memory Management One of the advantages the engine was designed to have is that all allocations happens statically at initialization time and memory is only freed at shut- down. This means that we will never have any hitches associated with doing base allocations and that we only have to implement a linear allocation strat- egy. There are a couple of exceptions to this. Any framebuffers or resolution dependent resources has to be reallocated when the window is resized or the user changes MSAA settings from the options. All those resources are given a dedicated base allocation to solve this problem. This if often recommended by IHVs for driver related issues. Nvidia even have a special extension called VK NV dedicate allocation that is used for this purpose. There are four different pools of memory that all sub-allocations happens from. First is a pool for optimal image resources that uses device local

22 memory. The second is for static buffers and uses device local memory. The third is a host local allocation and is used for per-frame resources and uploading data to the GPU. It tries to find memory that is coherent but not cached (usually implemented as write combining pinned memory). The fourth is for downloading data from the GPU to the CPU. This is a host local allocation, and tries to use cached memory which can significantly speed up reads on the CPU. A single circular buffer is allocated to handle per-frame resources and uploading data to the GPU. At the start of each frame any allocations that was done from this buffer 2 frames ago are collectively freed from this buffer. The only data that is currently downloaded from the GPU is timestamp data so this system does its own allocations.

5.6 Synchronization An important aspect about synchronization is that the application is respon- sible for tracking certain resource states and memory accesses. The problem about resource state tracking is that we loose the ability to know what state a resource is currently in when using multiple threads to record command buffers because another thread might change the state of a resource that we want to use on this thread. This is a consequence of the fact that the commands might not be recorded in the same order across multiple threads as they are executed on the GPU. To overcome this we do a fast single- threaded pre-pass over the scene where all the state tracking happens and all the memory barriers are staged into temporary buffers which are later in- serted in the command buffers. An extra advantage of this approach is that we can batch memory barriers together at specific points in the frame which can significantly reduce overhead and increase concurrent kernel execution. This solves one of the most problematic aspects about synchronization which is redundant memory barriers that should be merged.

5.7 Vulkan API Vulkan is a really verbose API and it is reasonable to find ways to re- move some of this verbosity. Some state was identified as mostly coher- ent between Vulkan function calls when implementing the Vulkan back- end. This includes objects like VkInstance, VkPhysicalDevice, VkDevice, VkAllocationCallbacks, VkQueue, VkCommandBuffer, VkCommandPool. Func- tions that use those objects can be simplified by setting the currently used object in thread local storage and have simplified Vulkan commands. This remove a lot of unnecessary code since all Vulkan command use those objects.

23 1 // This Vulkan function 2 void vkCmdDispatchIndirect(VkCommandBuffer, VkBuffer , VkDeviceSize); 3 // Gets substituted with this engine function 4 void VkgCmdDispatchIndirect(VkBuffer , VkDeviceSize);

5.8 Debug Markers Debug markers was added to better understand how the engine performs. Markers are manually pushed and popped when the engine does different tasks and there is support for them on different threads. The markers were also extended with queries to measure how much time the tasks take on the GPU as described by Lux [13] and Fuentes [6]. For Vulkan there is also support markers on the compute queue in addition to the graphics queue. An ImGui widget was added to visualize how all the tasks relate to each other as shown in Figure 6. This view was also extended to show what kind of barriers are used in-between the GPU tasks. The markers from the GPU are approximately aligned to the markers on the CPU by measuring how the timestamps on the GPU drifts relative to the CPU time. While this works reasonably well for most GPUs there is always the possibility that changes in the GPU power states invalidate this relation. The reason why debug markers are favored for profiling the engine is that they are implemented to capture metrics about a single frame. Traditional sample based profilers don’t understand the concept of a frame and so it is hard to differentiate what happens at a specific time in a scene from other parts, including loading the engine.

24 6 Results

The experiments consist of running a benchmark scene. The benchmark scene is a series of cinematic clips that has different performance character- istics. When running the benchmark we measure how much time it takes to generate a frame on both the CPU and the GPU. When the benchmark is done average, minimum and maximum frame times are reported. The bench- mark also report how much time loading the engine takes. Note that all results report time per frame measured in milliseconds as opposed to frames per second (fps) which is often used elsewhere. This means that lower values are better. All tests are run in full screen mode with a 1440p resolution. One problem about evaluating the performance of a graphics engine is that both the CPU and GPU are potential bottlenecks. Even the refresh- rate of the monitor can be considered a bottleneck, but this is purely a hardware-limitation. A modern monitor operates at a refresh-rate of 60-240 Hz, this corresponds to a frame time of 16.6-4.1 ms. Both the CPU and the GPU time per frame count towards this limit separately since we are using double buffering. We want to ideally generate frames faster than this on both the CPU and GPU.

6.1 Vulkan vs OpenGL This test was done using an Nvidia GTX 1080 with driver version Linux- 64 375.27.15. It was run using Xubunut 16.10 with the 4.8.0-52-generic kernel release. The CPU was an I7-3930K. This test is run without async compute.

Vulkan Timer avg (ms) min (ms) max (ms) CPU 2.07709 1.60082 5.71299 GPU 4.24093 3.69152 5.62586 OpenGL Timer avg (ms) min (ms) max (ms) CPU 6.43733 2.89253 10.467 GPU 5.4776 4.15437 7.25606

Loading time was 1.63 s for Vulkan and 0.92 s for OpenGL. The debug markers was used to investigate this closer after the benchmark. It was found that allocating memory (base allocations) with vulkan takes 0.585 seconds. This is something that does not exist in OpenGL and pretty much account for the whole difference between the loading times.

25 It is also worth noting that the engine spends more than 1 ms per frame doing work like physics and updating objects on the CPU. This is not part of the graphics API dependent code. This means that the CPU code cannot be optimized much more by making the graphics API any faster for the Vulkan back-end.

6.2 Higher Graphics Settings This test is similar to the previous, but we use 8x MSAA. Vulkan Timer avg (ms) min (ms) max (ms) CPU 2.1753 1.63672 8.95798 GPU 7.96721 6.97139 9.63891 OpenGL Timer avg (ms) min (ms) max (ms) CPU 6.33906 2.87087 10.1907 GPU 8.69585 7.00006 10.6772

6.3 Async Compute This test was done with an AMD R7 370 graphics card using driver Crimson ReLive 17.1.1. The OS is Windows 10 (64 bit) with a AMD FX-8320 CPU. The test consists of running the benchmark scene with and without async compute enabled. Without Async Compute Timer avg (ms) min (ms) max (ms) CPU 11.4426 4.14576 25.3467 GPU 26.4357 19.8317 34.6815 With Async Compute Timer avg (ms) min (ms) max (ms) CPU 11.9048 4.85172 23.7548 GPU 24.2288 17.1314 30.5724 With Async Present Timer avg (ms) min (ms) max (ms) CPU 11.6492 4.50107 19.8327 GPU 24.3833 18.1535 29.9009 With Async Compute and Present Timer avg (ms) min (ms) max (ms) CPU 11.2626 4.73872 20.1495 GPU 23.1377 16.0412 28.4003

26 The debug markers was used to investigate further. The test with just async present adds about 6-8 ms of latency while the test with async compute and async present adds about 2-4 ms of latency. Using async present adds latency because running the tone mapping shader on the async compute queue causes that shader to compete for GPU resources with graphics queue working on the next frame. The test with just async present adds more latency because scheduling the tone mapping shader with other compute shaders makes both compete for the same resources. If both compute and present is scheduled on the async compute queue then they are scheduled together with shadow maps on the graphics queue. Shadow maps tend to be more taxing on the razterizer something that cannot be used on the compute queue making that workload a good match to schedule together with compute tasks.

27 7 Discussion

This section discusses some design and performance related aspects about Vulkan. Vulkan puts more demand on a good design process than older APIs. Many design related issues like multi-threading can also be quite hard to fix late in the development process. The goal of using the Vulkan API is most often increased performance. Even if the API offer good performance, there has also been developers that have not gained any significant performance from using it. It is therefore important to evaluate how the API can be used effectively in a design process.

7.1 Vulkan vs OpenGL Vulkan does a lot better when it comes to CPU performance, getting more than 3.2x speedup on average in the first test case. It is also worth not- ing that this changes the bottleneck from the CPU to the GPU. The GPU performance also improves a bit with Vulkan (about 1.3x speedup or 1.2 ms faster). All this gives a total relative speedup of approximately 1.5x, which is a pretty good performance improvement for a game. The relative performance improvement diminished when rendering the scene with higher graphics settings. In this case the scene is heavily GPU bound. There is still an absolute performance improvement of 0.7 ms but this cannot be considered a big win compared to how much effort was put into developing the Vulkan back-end. Those two scenarios gives a pretty good overview of how Vulkan performs when speeding up an existing application. Another aspect that was not tested in this thesis is how the API performs when scaling up the workload. Vulkan is much more efficient at processing draw calls than OpenGL, this means that we can have scenes with many more different object. We can also use the processing power we free up to do more expensive physics or particle simulations. Considering that there is improvement in CPU usage in both the test cases we also expect to have less power usage in those applications. This can be an important aspect for mobile applications.

7.2 Command buffers and multi-threading There has not been any performance reasons for multi-threading command buffer recording in the engine since it is so fast to record them single threaded. It is still a good idea to face the design issues related to multi-threading them early in the development of a product since those issues have implications

28 for almost all Vulkan specific code. There are still some reasons that users might not want to multi-thread command buffer recording at all:

• Vulkan is such a fast API that recording command buffers is not a bottle-neck

• Kicking of multiple tasks cause overhead

• When multi-threading command buffer recording, we loose the ability easily make decisions that require knowledge of other parts of the scene that are local to other threads

• Switching between command buffers cause overhead on the GPU and in the OS kernel

• It might not be possible to do optimizations across different command buffers

It is also really hard to evenly distribute the work related to recording commands to multiple command buffers. One of the command buffers tend to be an order of magnitude more expensive to record than the others. This does not mean that multi-threaded command buffer recording is useless. It is particularly useful when scaling up the workload. When rendering partic- ularly rich scenes with may unique draw calls, it is important to distribute the workload to different threads. The problem is that the existing OpenGL back-end has no way to render similar scenes. We have to rely exclusively on the next-gen APIs if we are to use those techniques. One example of use of command buffers is for depth peeling [20], where the scene is rendered multiple times to solve the order independent transparency problem. It is also possible to reuse already recorded command buffers when doing post processing. This can save work on the CPU. We can also start submitting command buffers before all other command buffers have been recorded on the CPU since there can be multiple command buffers for a single pass of the scene. This can reduce latency by not keeping the GPU waiting while the CPU is recording the command buffers. Since the validation layers add significant overhead to command buffer recording, it could have been useful to use multi-threading in those cases. The problem with this is that the validation layers have a serializing effect on the command buffer recording, meaning that recording command buffers on multiple threads takes about the same amount of time as recording them on a single thread.

29 7.3 Queues Finding workloads that are good to schedule on the async compute queues is hard. We want to schedule workloads with different bottlenecks together. What kind of workload that should be scheduled together will wary depend- ing on GPU bottlenecks. This means that different GPUs should schedule different workloads together. To properly do this we would have to dynami- cally profile the different tasks that could potentially be scheduled together. This is too hard to solve in practice. We also don’t have access to any open APIs for relevant hardware performance counters. This is still not a catas- trophic issue. Different tasks tend to have different bottlenecks on many GPUs if they have it on one. Depth passes and shadow maps tend to have rasterization as a bottleneck and can almost always be scheduled together with compute tasks and have improved throughput. We saw reasonably good performance improvements when doing async compute for this project. There were speedups both when scheduling the compute command buffer (Tiled light culling and SSAO) and when schedul- ing post processing and presentation on the asynchronous compute queue. It should also be noted that even if doing post processing and presentation on the async compute queue resulted in improved throughput it also causes extra latency for the frame. Usage of the copy queue did not provide any performance differences, even if the copy queue is the fastest way to transfer data over PCIe [3]. This was expected since the application only use about 2-3% of the PCIe bandwidth.

7.4 Memory Management Memory sub-allocation is not that much of problem for smaller and even most reasonably large application. A characteristic of most graphics applications is that they allocate graphics memory at init time and only free that mem- ory when the application shuts down. Most dynamic objects are fitted into already existing allocations, for example circular buffers rather than given a dedicated allocation. This means that we don’t really need to implement anything much more advanced than a linear allocator. A linear allocator is also the fastest possible sub-allocation strategy. A disadvantage of using a linear allocator is that that alignment requirements can be quite large. This means that much memory is wasted between allocations. The most com- plicated part about Vulkan memory management is managing API specific restrictions like general and object specific alignment requirements, caching and coherence, different memory types, aliasing and allocation limitations.

30 7.5 Synchronization Most synchronization needed is actually quite simple. During a frame, a write to some memory is read later. This implies a RAW hazard that has to be managed by the application. There is also a WAR hazard between frames, but this is usually implicitly handled by the fences and semaphores between frames. It is really hard to find an abstraction where end-programmers don’t have to think about synchronization. For this engine we had to separate command buffer recording from generating memory barriers into two passes.

7.6 The Vulkan API There are a couple of open-source wrappers for Vulkan like Vulkan-Hpp and AMDs Anvil library. While these libraries can help some users to make Vulkan easier to use, they also tend to be complicated and can cause Vulkan development to be overwhelming. Wrappers should ideally make Vulkan easier to learn and cause less complicated code. A good alternative to the open source wrappers is having the application developers make their own abstractions as they encounter problems that can be simplified. This thesis identified some general ways to wrap Vulkan functions. The most effective ways to do this is still expected to be engine dependent. One example of this is how the memory barriers was implemented. Memory barriers should ideally be merged together to reduce overhead. To do this they had to be generated in an engine specific pass over the scene and staged in temporary buffers. Another example that was not completed for this project is to store the state associated with the PSOs in a graphics API independent abstraction. Such abstractions can be moved out of the API specific code and into the common engine code.

7.7 Validation layers While the validation layers warn about incorrect usage of the API there are no warnings for API calls that are most likely wrong. A prototype for a validation layer (VkLayer sane parameters) that checks if some Vulkan functions are called with unlikely arguments was made during development. One example that was causing problems when developing the Vulkan back- end was calling draw functions with 0 as the draw count. There are also tests for values that are so high that they are most likely uninitialized variables.

31 8 Conclusion

Implementing a Vulkan back-end for a game engine is a difficult task but can also offer significant performance improvements for certain workloads. Even if most of the improvement comes from a much more efficient API on the CPU side, there were also better performance on the GPU, even before async compute was used. Features like async compute can be used to take advantage of hardware that OpenGL applications did not have access to. This gives Vulkan applications an extra edge. There are also more subtle aspects about Vulkan that can be better optimized than for OpenGL ap- plications. Pipeline stage objects gives the driver more opportunities for optimization. Descriptor sets are closer to the hardware and gives develop- ers more opportunities to optimize how shaders access resources. Memory and synchronization can also be optimized more in the application than in the driver since the application has more access to information about how the scene is put together, but this also require more effort from the devel- opers. Command buffers allow work to be recorded on multiple threads and executed on multiple asynchronous queues. This gives the developer more opportunities to design for parallelism. The Vulkan back-end developed for this thesis shed light on the design challenges encountered when using the Vulkan API. Vulkan memory man- agement and synchronization are challenging topics that require careful plan- ning to ensure the quality expected from modern applications. Combining this with multi-threaded command buffer recording increases the complexity even further. It is important to find good abstractions to simplify develop- ment using the Vulkan API. Even if there are some areas for general purpose abstraction libraries the most useful abstractions are still expected to be engine specific.

32 References

[1] ARM R MaliTM Application Developer Best Practices Version 1.0 Developer Guide. https://static.docs.arm.com/100019/0100/arm_ mali_application_developer_best_practices_developer_guide_ 100019_0100_00_en2.pdf. [2] Chris Brennan. Delta Color Compression Overview. http://gpuopen. com/dcc-overview/. [3] Dr. Matth¨aus G. Chajdas. D3d12 and vulkan: Lessons learned. Presentation, GDC 2016, San Francisco, March 2016. http://gpuopen.com/wp-content/uploads/2016/03/d3d12_vulkan_ lessons_learned.pdf. [4] Cass Everitt, Tim Foley, John McDonald, and Graham Sellers. Ap- proaching Zero Driver Overhead in OpenGL (Presented by NVIDIA). Presentation, GDC 2014, San Francisco, March 2014. http:// gdcvault.com/play/1020791/. [5] Cass Everitt and John McDonald. Beyond Porting - How Modern OpenGL can Radically Reduce Driver Overhead. Presentation, Steam Dev Days 2014, Seattle, Januar 2014. [6] Lionel Fuentes. A real-time profiling tool. In Patrick Cozzi and Christophe Riccio, editors, OpenGL Insights, pages 503–512. CRC Press, 2012.

[7] The Khronos Group Inc. GL KHR vulkan glsl, 2016. https: //www.khronos.org/registry/vulkan/specs/misc/GL_KHR_vulkan_ glsl.txt.

[8] The Khronos Group Inc. Vulkan R 1.0.39 - a specification (with many extensions). 2017. [9] Ph.D Liang-min Wang. How to Implement a 64B PCIe* Burst Transfer on Intel R Architecture. http://www.intel. com/content/dam/www/public/us/en/documents/white-papers/ pcie-burst-transfer-paper.pdf. [10] Tristan Lorach. OpenGL NVIDIA ”Command-List”:”Approaching Zero Driver Overhead”. Presentation, SIGGRAPH 2015, Los Angeles, August 2015. http://on-demand.gputechconf.com/siggraph/2015/video/ SIG512-Tristan-Lorach.html.

33 [11] Timothy Lottes. Vulkan and DOOM. http://gpuopen.com/ vulkan-and-doom/.

[12] Timothy Lottes, Graham Sellers, and Dr. Matth¨aus G. Chaj- das. Vulkan fast paths. Presentation, GDC 2016, San Francisco, March 2016. http://gpuopen.com/wp-content/uploads/2016/03/ VulkanFastPaths.pdf.

[13] Christopher Lux. The timer query. In Patrick Cozzi and Christophe Riccio, editors, OpenGL Insights, pages 493–502. CRC Press, 2012.

[14] St´ephaneMarchesin. Linux Graphics Drivers: an Introduction. Ver- sion 3. March 2012. https://people.freedesktop.org/~marcheu/ linuxgraphicsdrivers.pdf.

[15] John McDonald. Avoiding Catastrophic Performance Loss De- tecting CPU-GPU Sync Points. Presentation, GDC 2014, San Francisco, March 2014. https://developer.nvidia. com/sites/default/files/akamai/gameworks/events/gdc14/ AvoidingCatastrophicPerformanceLoss.pdf.

[16] Bruce Merry. Performance tuning for tile-based architectures. In Patrick Cozzi and Christophe Riccio, editors, OpenGL Insights, pages 323–336. CRC Press, 2012.

[17] Emil Persson. Depth In-depth. http://developer.amd.com/ wordpress/media/2012/10/Depth_in-depth.pdf.

[18] Graham Sellers. Vulkan Renderpasses. http://gpuopen.com/ vulkan-renderpasses/.

[19] Gareth Thomas and Alex Dunn. Practical 12 - programming model and hardware capabilities. Presentation, GDC 2016, San Francisco, March 2016. http://gpuopen.com/wp-content/uploads/ 2016/03/Practical_DX12_Programming_Model_and_Hardware_ Capabilities.pdf.

[20] Matthew Weelings. Depth peeling order independent trans- parency in vulkan, July 2016. https://matthewwellings.com/blog/ depth-peeling-order-independent-transparency-in-vulkan/.

34 9 Figures

Figure 5: Some of the passes in the engine: Upper left - Depth Pass, Upper right - SSAO, Lower left - Shadow Maps, Lower right - Final Render

Figure 6: Debug Markers

35