An evaluation of hardware tessellation in a visibility buffer renderer in real-time graphics applications.

Cameron McPherson

Bsc (Hons) Computer Games Technology, 2019

Abertay University

School of Design and Informatics CONTENTS

Abstract ...... 1

1. Introduction ...... 1

1.1 Research Question ...... 3

1.2 Aims and Objectives ...... 3

2. Literature Review ...... 3

2.1 Memory Bandwidth in Games ...... 4

2.2 The Visibility Buffer ...... 4

2.3 Hardware Tessellation ...... 7

2.4 Visibility Buffer with Tessellation ...... 9

3. Methodology ...... 10

3.1 Application Overview ...... 10

3.2 Development Environment ...... 11

3.3 Application Framework ...... 12

3.3.1 Renderers ...... 12

3.4 Testing ...... 18

3.4.1 Hardware ...... 18

3.4.2 Evaluation Methods ...... 19

3.4.3 Evaluation Parameters ...... 19

3.4.4 Performance Metrics ...... 20

4. Results ...... 21

4.1 Memory usage ...... 22

4.1.1 Working Set ...... 22

4.1.2 Forward Writes...... 23 4.1.3 Deferred Reads ...... 24

4.1.4 Total Read/Writes...... 25

4.1.5 Coherency ...... 25

4.2 Streaming Multi-Processor Usage ...... 26

4.3 Net Performance Impact ...... 27

4.3.1 Per Resolution ...... 27

4.3.2 Per Triangle Count ...... 28

4.3.3 Full Frame Performance ...... 28

5. Discussion ...... 29

5.1 Analysis of Results ...... 29

5.1.1 Memory Usage ...... 30

5.1.2 Processor Usage ...... 32

5.1.3 Net Performance ...... 33

5.2 Design Considerations and Future Work ...... 34

5.2.1 Triangle Culling/Filtering ...... 34

5.2.2 Support of Hardware Tessellation ...... 34

5.2.3 Compute-based Tessellation ...... 35

5.2.4 Turing Mesh ...... 36

6. Conclusion ...... 36

7. Acknowledgements ...... 38

8. List of References ...... 39

9. Appendices ...... 41

Appendix 1 ...... 41

ABSTRACT

One of the most common performance bottlenecks in real-time graphics applications is memory bandwidth, especially on low-end and mobile hardware. Among multiple developments in both software design techniques and hardware architectures to alleviate this restriction, the visibility buffer and hardware tessellation are two rendering processes shown to reduce memory bandwidth usage — but there has been little research in employing the two techniques together. To determine the effects of supporting hardware tessellation in the visibility buffer on memory efficiency, a real-time graphics application with two distinct graphics pipelines was developed to render a procedurally generated terrain. Performance metrics pertaining to bandwidth usage, total working set size and pass times were gathered across multiple resolutions and geometric detail levels, both with and without hardware tessellation. The results show that while the support of hardware tessellation in the visibility buffer resulted in frame time improvements across all test cases, the associated bandwidth usage and compute cost would likely prove prohibitive to mobile and integrated platforms. Despite this, there remains opportunities for further improvements to be made to both renderers to improve performance and more closely reflect production rendering systems. Specifically, optimisations in geometry processing such as triangle culling, Mesh Shaders (NVIDIA, 2018a) or compute-based tessellation could drastically improve the visibility buffer’s suitability for high quality real-time applications on bandwidth-limited platforms.

1. INTRODUCTION

Since its emergence as a leading form of entertainment media, the video games industry has experienced a constant progression in visual fidelity. As processing hardware becomes more powerful, the demand for triple-A development studios to produce consistently improving graphics continues to grow. In order to achieve this, efficient use of available hardware resources is imperative, and this necessity has ‘fuelled an explosion of research in the field of interactive ’ (Akenine-Moller, Haines and Hoffman, 2018). Therefore, the focus of research in real-time graphics is often concentrated on mitigating areas of the

1 rendering pipeline and underlying hardware that are considered to be performance bottlenecks.

As the movement towards high-resolution and virtual-reality gaming intensifies, modern applications require huge amounts of data to be transferred to and from memory in the rendering of a 3D scene. For this reason, the amount of bandwidth available to the graphics hardware is key for determining how quickly data can be moved between memory and processing units. This is especially true for mobile platforms with integrated graphics architectures, where memory bandwidth capability is often the limiting factor as to the amount of detail that may be rendered per frame (Niessner et al., 2016). It is therefore necessary to develop rendering patterns which minimise the required bandwidth to process, shade, and render geometry. In doing so, systems of all capabilities may benefit from the freeing-up of GPU resources; allowing for tasks to utilise general-purpose compute features of high-performance platforms, as well as improved visual quality on mobile and integrated systems.

For these reasons, there has been extensive research into the development of such rendering techniques and their subsequent impact on the performance of real-time applications. One such study (Burns and Hunt, 2013) introduces a method named the visibility buffer. This technique, while based on the commonly used deferred pattern, aims to drastically reduce the associated memory cost of separating geometry processing from shading — a cost which restricts traditional deferred renderers from achieving efficient performance on bandwidth-limited platforms.

Further to the development of novel design patterns (such as the visibility buffer) for the benefit of real-time rendering performance, graphics API vendors will often introduce new fixed-function stages to the . These stages aim to provide hardware- accelerated functionality to optimise resource-intensive rendering and compute tasks. One such addition was hardware tessellation, introduced to the Direct3D 11 API (Microsoft Corporation, 2018) in 2008. By automating the procedural subdivision of geometry meshes, tessellation allowed for less-detailed topologies to be stored in memory. In suitable use cases, this results in significant bandwidth savings when transferring the geometry to the graphics card for rendering.

While the retention of relevant data between passes is heavily optimised by the visibility buffer technique, the cost of processing large amounts of geometry in detailed scenes is unaccounted for. The proven benefits to memory efficiency of the visibility buffer method could thus be promoted by the addition of hardware tessellation. However, this feature of modern graphics pipelines is afforded little consideration in previous studies surrounding the

2 visibility buffer. Therefore, by evaluating the relative impact of combining these techniques on memory efficiency, a more robust and performant rendering system could potentially be developed, with the minimising of bandwidth usage being a primary concern.

1.1 Research Question

Following the implementation and subsequent evaluation of the discussed rendering techniques with respect to bandwidth usage, this paper proposes to answer the following research question:

“Does the addition of support for hardware tessellation improve or inhibit the memory efficiency of a typical visibility buffer renderer?”

1.2 Aims and Objectives

To answer this question, a series of objectives are presented:

• Analyse related work on the visibility buffer method, hardware tessellation, and the performance evaluation of graphics frameworks. • Develop a real-time graphics application implementing visibility buffer rendering with tessellated geometry. • Gather performance data from this application under various conditions; tracking both memory usage and frame times with and without tessellation. • Evaluate the consequent impact of introducing hardware tessellation into the visibility buffer pipeline and discuss the value of its implementation with regards to memory efficiency and subsequent frame performance.

It is outwith the scope of this work to compare the visibility buffer to more traditional rendering methods such as deferred shading, as this area is well researched in the studies by Burns and Hunt (2013), Schied and Dachsbacher (2015), and Engel (2016).

2. LITERATURE REVIEW

The design and implementation of real-time graphics applications is a field in a constant state of research, with the goal of improving both power efficiency and final image quality. This chapter will examine GPU memory bandwidth as a key performance resource, and introduce previous studies which have developed methods for improving the efficiency of its use.

3

2.1 Memory Bandwidth in Games

In any high-quality rendering system, multiple textures and buffers will be loaded, sampled, and written to in a single pass, transferring data to and from memory in the process. The bandwidth between the processing cores and both off-chip memory and cache hierarchies is, therefore, a key indicator of how quickly a GPU can render memory-intensive frames such as those in high-resolution AAA games. As applications seek to render increasingly detailed geometry at higher resolutions, memory bandwidth quickly becomes a performance bottleneck (Niessner et al., 2016).

Hardware manufacturers have recently begun to address this bottleneck with the development of (HBM). HBM is a low power-consumption and high- bandwidth memory technology that aims to address the power efficiency problems of GDDR memory (Advanced Micro Devices, 2015). Some high-end GPUs have adopted HBM2 memory to offer up to 1TB/s of memory bandwidth — over doubling that of a competing card utilising GDDR6 memory. However, this comes at the cost of almost one-seventh of the clock speed, and memory bandwidth rarely causes a bottleneck in high-end GPUs. Future iterations of these memory technologies will aim to balance bandwidth against power efficiency, and the trickle-down effect will subsequently reach mobile and integrated platforms. However, even with increasing bandwidth capabilities in graphics architectures, algorithms which favour compute operations over memory usage will still make the most efficient use of available power:

“While it may be the case that future architectures will have significantly more memory bandwidth than today’s, compute resources will likely scale faster, which means the relative cost of memory bandwidth will continue to increase.” (Burns and Hunt, 2013)

Hence, in addition to fundamental methods of mitigating bandwidth usage, such as texture compression and mipmapping, it is important that novel rendering techniques are developed to both limit memory traffic and promote high cache hit-rates (O’Conor, 2017).

2.2 The Visibility Buffer

The visibility buffer rendering technique is an alteration of the widely adopted multi-pass renderer known as deferred shading or deferred lighting. Bor-Sung Liang et al. (2000) present deferred lighting as an approach to rasterization that avoids unnecessary lighting and shading calculations on sections of geometry that are later overwritten by closer objects. It is possible to avoid this by writing each interpolated surface attribute of a fragment into an individual image buffer — collectively referred to as the g-buffer — and inherently retaining attributes

4 of visible fragments only. Lighting and texturing are then performed in a later pass by sampling attributes from the g-buffer. While Bor-Sung Liang et al. (2000) prove this method successfully reduces lighting calculations (and especially optimises complex scenes with many lights), the size of a typical g-buffer in a high-quality renderer can be as large as 32 bytes per sample. For this reason, the memory bandwidth required on the GPU to read and write all surface attributes of a mesh can be prohibitive on mobile and integrated platforms, especially at high resolutions (Burns and Hunt, 2013; Engel, 2016).

The visibility buffer was first proposed in the 2013 paper ‘The Visibility Buffer: A Cache- Friendly Approach to Deferred Shading’ (Burns and Hunt, 2013). The research cites memory bandwidth usage as a key drawback in traditional deferred renderers and proposes a trade- off which leverages the increasing compute:bandwidth ratio in modern graphics hardware. The proposed implementation aims to reduce the memory footprint of deferred-style renderers by storing only references to each primitive to the g-buffer — thus completely decoupling geometry processing from materials and lighting. While a deferred renderer may store position, colour, normal, and texture information into memory per fragment, the visibility buffer in its simplest form stores only the primitive ID and draw call ID packed into as little as 4 bytes (although Burns and Hunt (2013) suggest that 8 bytes would be necessary for scenes with a large amount of geometry or which implement hardware tessellation). The stored IDs describe where the referenced primitive is stored in memory, and allow the shading pass to index into the vertex buffer so as to retrieve and interpolate surface attributes manually. By doing so, the visibility buffer can be considered as a more memory-efficient variant of deferred shading.

It is apparent from Burns and Hunt’s (2013) implementation that a visibility buffer renderer must carry out a certain amount of extra work in order to successfully decouple shading from geometry processing. The calculation of barycentric coordinates, re-computation of the vertex stage, and the read accesses into the vertex buffer are the computational overheads that pay for the compact memory usage of the visibility buffer. However, despite this cost, the research by Burns and Hunt (2013) reports up to 40% decreases in frame times over a traditional deferred renderer in a high-detail scene. The most pronounced gains in performance were observed on platforms with limited bandwidth, therefore confirming the value of trading increased computation for an efficient memory footprint on lower-end graphics architectures, such as those of mobile and integrated platforms.

The research carried out by Burns and Hunt (2013) is developed in the 2015 paper ‘Deferred Attribute Interpolation for Memory-Efficient Deferred Shading’ (Schied and Dachsbacher, 2015). The paper presents an alteration to the initial visibility buffer renderer proposed by Burns and Hunt (2013), and explicitly references the support of tessellation as a factor in their

5 reasoning. Schied and Dachsbacher’s (2015) solution (DAIS) employs a Z pre-pass to first transform all visible triangles into screen-space before storing them into a triangle buffer. Thus, instead of referencing untransformed primitives in the visibility buffer (Burns and Hunt, 2013), DAIS stores transformed triangles separately, using the visibility buffer to store references to each triangle’s address. Storing triangles in this way creates more memory traffic, and therefore higher bandwidth usage, but it also allows for the support of hardware tessellation and saves vertex transformation from having to be carried out at per- frequency in the shading pass. Additionally, DAIS also stores the partial derivatives of attributes to a buffer in the first phase so as to simplify the calculation of barycentric coordinates in the shading pass. Schied and Dachsbacher (2015) argue that the extra memory requirement for storing the partial derivatives is not only minor, but is also compensated for by the gain in performance in the shading pass; however, it may pose a limitation on platforms with restricted bandwidth capabilities.

DAIS, therefore, represents a less memory-conscious solution to deferred shading when compared to the research in Burns and Hunt’s (2013) paper upon which it is based. Schied and Dachsbacher (2015) seem to place less value on the use of memory bandwidth in favour of more efficient computation, and while their findings report further performance gains over traditional deferred shading, they note that architectures with limited bandwidth may struggle with complex scenes. In this way, DAIS may be found to encounter many of the same problems as deferred shading on lower-end hardware.

Research into the visibility buffer technique is continued in the GDC presentation ‘The filtered and culled visibility buffer’ (Engel, 2016). It presents a practical implementation of the visibility buffer that combines the approaches of Burns and Hunt (2013) and Schied and Dachsbacher (2015). In a renderer that consists of multiple phases, Engel (2016) employs cluster culling and triangle filtering passes to eliminate invisible geometry before shading. Geometry that passes these tests is appended to a filtered triangle buffer, as-per Schied and Dachsbacher (2015). The visibility buffer is then populated via a similar solution to that presented by Burns and Hunt (2013). In Table 1, Engel (2016) presents the estimated size of a visibility buffer at two resolutions, and compares these to a typical g-buffer in a deferred shading pipeline.

6

Pipeline 1080p 4K

Visibility Buffer 86.20 MB 276.61 MB

Deferred G-Buffer 160.49 MB 635.29 MB

Table 1 - Estimated memory requirements of visibility buffer against deferred g-buffer with 4x MSAA (Engel, 2016)

In his personal blog ‘Diary of a Graphics Programmer’ (Engel, 2018), Engel expands on his 2016 GDC presentation. In the post, he provides further information regarding the specifics of the implementation in code and offers further context for understanding the benefits of a visibility buffer renderer. Specifically, Engel (2018) describes the tendency towards increasing polygon counts in modern games, and how high amounts of geometry can cause bottlenecks in traditional graphics pipelines. To combat this in his implementation of the visibility buffer, Engel (2018) describes the implementation of a triangle removal compute pass as the first stage in rendering, allowing the graphics pipeline to be concerned only with visible geometry. Engel (2018) also details the execution of triangle cluster culling, a method which removes batches of triangles with similar normals that are back-facing to the camera. In removing geometry that will not appear on screen before the shading passes, further memory traffic is saved when accessing the triangle buffers in the final pass.

Despite the references to bandwidth usage in the aforementioned studies, there are no specific measurements given for reads and writes to memory as used by their respective visibility buffer implementations. Thus, the bandwidth efficiency of the technique is largely unassessed in previous research, and is instead implied through the compact size of the visibility buffer against a typical deferred g-buffer.

2.3 Hardware Tessellation

As the demand for photorealism in interactive graphics software continues to rise, so does the need for increasingly complex and detailed geometry to be viable in rendering pipelines. As with transferring multiple textures from GPU memory, rendering highly-detailed geometry can quickly become prohibitive on platforms with limited bandwidth (Engel, 2016, 2018; Niessner et al., 2016). Hardware tessellation is a stage of modern graphics pipelines which allows for the procedural generation of detailed geometry from a coarser representation stored in memory. It achieves this by taking a set of primitives as control patches and subdividing them to produce a mesh of higher primitive density.

7

In addition to the cheaper global memory cost of storing low-detail meshes, this approach also has the benefit of far fewer memory accesses being required to render a high-density mesh, and therefore a more efficient use of available bandwidth. The study by Niessner et al. (2016), outlines the benefits that the tessellation stage offers in computer games, as further detail can be added to tessellated geometry by displacing generated vertices. This technique is often utilised for creating complex terrains and more detailed character models, as shown in Figure 1 and Figure 2.

Figure 1 - An example image presented by Niessner et al. (2016) demonstrating an object with (right) and without (left) hardware tessellation. Use of a displacement map on the tessellated mesh allows for further detail to be added for a relatively low cost.

Figure 2 - Tessellation is utilised in Max Payne 3 (Rockstar Games, 2012) to add curvature to character and car models, as seen in Max’s ear, collar and suit. 8

2.4 Visibility Buffer with Tessellation

Given the performance increases associated with bandwidth-efficient rendering techniques and the success of the visibility buffer technique in decreasing the resource cost of deferred shading, it follows that the addition of tessellation to the visibility buffer in suitable scenes would even further improve the performance benefits. Within the papers previously discussed, however, tessellation is afforded only passing considerations, and its being supported by the visibility buffer has not been fully explored. While the potential benefits of combining the two techniques may initially seem pronounced, there are certain trade-offs that would have to be made in practice.

In a visibility buffer implementation akin to Burns and Hunt (2013) the complete decoupling of geometry processing from attribute interpolation and shading means that tessellated geometry created in the forward pass would be lost. Burns and Hunt (2013) offer a brief consideration as to how they would alter the solution to be able to support tessellation: following the tessellation of geometry in the forward pass, the instance ID, patch ID, and barycentric coordinates of the generated primitive within the input patch would be stored to the visibility buffer. This data would require double the memory at 8 bytes over the presented implementation, thereby increasing the relative memory traffic of the solution. While this figure is still less than that of a typical deferred renderer, it somewhat represents a backwards- step in terms of the original motivations driving the visibility buffer approach, and could have knock-on effects on the performance of lower-end graphics platforms. Further to this, Burns and Hunt (2013) note that the process of re-calculating the domain shader per fragment in highly tessellated scenes could incur a substantial computational cost. Indeed, for a pipeline which tightly compacts memory usage in exchange for increased computation, it may prove critical to its perceived benefits if both factors are negatively impacted by tessellation. Hence, it is the purpose of this work to determine if a typical use-case of tessellation within a visibility buffer pipeline would inhibit or improve the bandwidth efficiency observed in the original works.

Tessellation is offered greater consideration in Schied and Dachsbacher’s (2015) implementation of DAIS. The paper corroborates Burns and Hunt’s (2013) hypothesis that tessellation may incur a fair amount of overhead, and especially in DAIS since all visible

9

Figure 3 - Graphs comparing the timings and memory usage of a geometry pass with a tessellated scene at two resolutions (Schied and Dachsbacher, 2015). triangles must be stored in memory. The two geometry passes DAIS requires also become costly for high triangle counts, such as with highly tessellated scenes. To demonstrate performance, a simple tessellation shader was implemented and tested (Figure 3). It is shown that the DAIS pipeline outperforms deferred shading at high resolution but fails to match it at low resolution.

Despite this analysis of tessellation performance with DAIS against deferred shading, Schied and Dachsbacher's (2015) paper still falls short of measuring the direct impact of tessellation on the performance of the visibility buffer pipeline, in terms of both frame time and memory usage. Hence, this paper aims to evaluate the aforementioned performance factors of a visibility buffer renderer upon introducing hardware tessellation, and also to discuss the value of its implementation when compared against the rendering of high-detail static meshes.

3. METHODOLOGY

To evaluate the relative impact of adding hardware tessellation to a typical visibility buffer renderer, a real-time graphics application was developed which implements two distinct render pipelines.

3.1 Application Overview

The developed application renders a generated terrain mesh, lit by a single directional light. A terrain was chosen due to the ease of generating a triangle patch-grid at runtime to allow precise control over vertex density and detail. Two distinct meshes may therefore be generated to represent the same terrain but with varying detail stored in host memory. 10

The user may navigate the scene by moving and rotating the camera, and can access various functionality via the on-screen user interface. Specifically, the UI allows the user to switch between the visibility buffer pipeline (→VB) and the visibility buffer with tessellation pipeline (→VBT) in real-time, so as to directly compare the two. Intermediate render targets comprising the visibility buffer are viewable by toggling them on or off. The user may also alter the colour of the lighting and manually input camera transform values.

The overlaid statistics window provides real-time performance metrics as measured by the application. These include full frame times as well as GPU execution times for each subpass, and the user may obtain an average sample of the last 20 frames for each measurement. Also shown are the respective triangle counts of each terrain mesh.

Figure 4 - A screenshot of the final application. A procedurally generated terrain is rendered and lit via two distinct render pipelines which may be switched between in real-time.

3.2 Development Environment

The application is built in C++ using the Vulkan graphics and compute API (Khronos Group, 2019). The framework leverages a number of external libraries to accelerate development and provide reliability to low-level functionality. These libraries are included below.

• GLFW – Open-source library for creating and managing windows with OpenGL/Vulkan contexts (GLFW, 2019). • VMA – Vulkan library for minimising boilerplate code when allocating memory for application resources (Vulkan Memory Allocator, 2019).

11

• Dear ImGui – Graphical user interface library for exposing application settings and outputs to the user. Also handles mouse and keyboard input (Dear ImGui, 2019). • GLM – OpenGL-based mathematics library defining data structures and types consistent with the GLSL shader language (OpenGL Mathematics, 2019). • STB – Open-source image loading library for loading textures and displacement maps into memory (STB Image, 2019).

3.3 Application Framework

A bespoke, lightweight Vulkan framework was built from the ground-up to expose only necessary functionality and allow control over all aspects of the application’s behaviour. The Vulkan instance and device interfaces are created and handled from a single core class, upon which all application functionality is based.

3.3.1 Renderers

Two distinct renderers were built within the same Vulkan and application instance to maximise the resources which may be shared between them. In doing so, any perceived differences in performance of the renderers are isolated as much as possible to only the characteristics which this paper aims to test and evaluate.

Shared Resources and Functionality

Both renderers are created within the same Vulkan instance, device context and swap chain. Renderer-specific objects are assigned from the same resource pools, and both pipelines share the same set of command buffers, which are re-recorded when switching between renderers. Where possible, attachments, resource descriptors and pipeline settings are shared between renderers to maximise overlap in areas not-specific to the inclusion or exclusion of hardware tessellation.

When writing to the visibility buffer, the renderers make use of the same built-in GLSL variable, gl_PrimitiveID, to identify the triangle being stored. In the case of tessellation, this value refers to the input control patch of the mesh, not to the generated tessellation primitive. As such, both renderers can use these values to index into the vertex buffer in the deferred pass in order to access vertex attributes for each primitive, and thus use the same amount of bandwidth per sample to reference primitives in host memory. However, the renderers differ in the context of bandwidth usage when it comes to retaining the detail generated in the tessellation stages, which requires additional data to be stored.

12

To ensure the most efficient use of memory between passes, and to improve the readability of the solution, both renderers are defined within a single VkRenderPass. Vulkan allows render passes to consist of multiple sub-passes, each declaring which set of frame buffer attachments they will work with. In deferred-style renderers, this layout allows the graphics driver to keep attachments in on-chip memory for the duration of the render pass, thereby saving a substantial amount of bandwidth usage. On optimal hardware, this approach allows a fragment shader to directly load the same fragment of an Input Attachment from the previous sub-pass without any accesses to off-chip memory. The renderers in the developed application both make use of sub-passes and Input Attachments for this purpose.

When displacing terrain geometry to more accurately resemble a naturally occurring landscape, a heightmap image was loaded and sampled during both render pipelines. Since VB renders a static mesh which does not change at runtime, it would be most efficient to sample the heightmap data upon generation of the terrain. However, in the case of VBT, only patch control points are stored in memory, and therefore the heightmap must be loaded in the shaders to be sampled for generated vertices. To compound the memory cost of this operation, the height displacement must be carried out in both the visibility and shading phases to ensure consistency throughout the rendering process. The heightmap must be sampled, therefore, for each vertex of the loaded primitive in the fragment shader per sample. This incurs a substantial memory cost, despite usually resulting in high cache hit rates. It was deemed that such an imbalance in cost between the two renderers when supporting a largely cosmetic feature would unnecessarily obscure test results, and thus VB also samples the heightmap per frame in both subpasses.

In the deferred pass of each renderer, the vertex shader makes use of the built-in GLSL value gl_VertexIndex to generate a full screen quad, upon which the final image will be rendered. Such an approach negates the need for storing a quad in host memory and binding it to the draw call. This stage outputs only the screen position of each vertex, as the fragment shader is responsible for all the heavy lifting in the renderers’ deferred stages.

Visibility Buffer (VB)

VB consists of two distinct phases. Firstly, in the forward pass, visibility is determined by projecting the generated terrain mesh into screen space before writing visible primitives to the visibility buffer. A packing function in the fragment shader takes the draw call ID and primitive ID and combines them into a single, unsigned integer. This value is then unpacked into a 4x8 vector value, which is written to the 8:8:8:8 render target being used as the visibility

13 buffer. These values act as references to each visible primitive in the final image, and allow the deferred pass to load surface attributes from the vertex buffer in order to shade them.

This approach to implementing the visibility buffer closely resembles that of Burns and Hunt (2013) and Engel (2016, 2018) as it makes the most efficient use of bandwidth, requiring only 4 bytes per sample. At this same stage, DAIS (Schied and Dachsbacher, 2015) would store the transformed triangles to a triangle buffer, along with computed partial derivatives. However, this solution creates considerably more memory traffic and therefore did not seem to be a suitable solution for the implementation of VB.

In the deferred pass, the fragment shader firstly performs a subpassLoad operation to efficiently load the visibility buffer data for the current fragment directly from the previous sub-pass. The packing methods are reversed to recover the draw call ID and primitive ID of the referenced triangle, which are used to calculate the indices of its component vertices in the vertex buffer. In the case of both VB and VBT, the entire terrain mesh is rendered in a single draw call, and thus draw call ID will always be zero. In more complex scenes which utilise Multi* draw commands such as MultiDrawIndirect, the draw call ID would be used to calculate the start index of the current draw call within the bound vertex buffer.

To pass surface attributes to the fragment shader for loading, the vertex buffer is compressed into a collection of two vec4 values as shown in Figure 5. These attributes are accessed with the calculated indices, meaning we now have all the information we need about the visible triangle at the current fragment. It is this process of only referencing visible geometry in the forward pass and deferring the loading of attributes which is meant by the citation of decoupling geometry from shading in aforementioned studies (Burns and Hunt, 2013; Schied and Dachsbacher, 2015; Engel, 2016, 2018).

Figure 5 - The buffer alignment used to compress surface attributes into an efficient format for loading in the deferred shader stages.

Once surface attributes are acquired, they must be manually interpolated across the primitive to shade the current fragment. This is achieved by projecting the vertices into screen space and, from them, calculating the partial derivatives of the fragment’s barycentric coordinates.

14

The implemented algorithm follows that which is given by Schied and Dachsbacher (2015), and is outlined in Equation 1 and Equation 2.

Having acquired the partial derivatives of the current fragment’s barycentric coordinates, the surface attributes of the primitive can now be interpolated to the point. In this way, the perspective-correct position, normal and texture coordinates are acquired for the point. Following the displacement of vertex height via heightmap sampling, surface shading is then executed as normal.

Equation 1 - The barycentric coordinates λi for a point (x, y) in relation to a triangle pi = (ui , vi),

where D = det(p3 - p2, p1 - p2). (Schied and Dachsbacher, 2015)

Equation 2 - From Equation 1, the partial derivatives of the barycentric coordinates with respect to the point (x, y). (Schied and Dachsbacher, 2015)

Tessellation + Visibility Buffer (VBT)

Although following a largely similar structure to that of VB, VBT must additionally consider how to preserve the detail generated by tessellation in the forward pass — a problem which diverges opinion in previous research concerning the visibility buffer. While Schied and Dachsbacher (2015) argue that storing transformed visible triangles to a new geometry buffer enables the support of tessellation, doing so increases the amount of memory traffic used, which may prove a limitation on lower-end hardware. Alternatively, Burns and Hunt (2013) suggest that tessellation may be supported by storing the barycentric coordinates of generated vertices to a buffer per sample. These coordinates describe each vertex’s location

15 within its input patch in terms of proximity to the patch’s control points. By storing these per tessellated primitive, the domain shader may be recomputed for each in the deferred pass. It is noted by Burns and Hunt (2013) that this approach comes with a computational cost, as domain shading must be carried out per sample for each of the current primitive’s vertices. However, doing so would save memory bandwidth usage versus the solution presented by Schied and Dachsbacher (2015) and thus was the favoured approach for this project.

In their consideration of tessellation within a visibility buffer renderer, Burns and Hunt (2013) suggest that the patch barycentric coordinates may be stored using an additional 4-byte buffer. However, in the practice of developing this solution for a detailed scene requiring a sufficient level of accuracy for stored tessellation coordinates, there was no obvious way in which this size would be sufficient. The need to preserve the coordinates of all three vertices (each already 4 bytes) at a desirable accuracy meant that a larger format would be necessary for the tessellation coordinates buffer.

To preserve the tessellation coordinates, a geometry shader stage was introduced. The gl_TessCoords vector is passed out of the domain stage per vertex and subsequently collected per primitive in the geometry stage as an array of vec3s. Each vector representing the barycentric coordinates of a single vertex is then packed into a 32-bit unsigned integer and stored into a single component of a R32G32B32A32_UINT image buffer. This process is visualised in Figure 6. The net result is a 16-byte addition to the working set of the visibility buffer, equalling 20 bytes in total.

This solution is somewhat wasteful in its memory usage, as the fourth component of each value is not required and is therefore left unwritten. This is in part due to valid usage features of available image formats in Vulkan, as the 3-component R32G32B32_UINT format may not be used as a colour attachment in a render pass. This could be partially alleviated by leveraging the fact that only three components are necessary to represent a barycentric coordinate, using multiple 8-bit UNORM image buffers to store coordinate vectors unpacked in a similar way to that presented in Figure 5. This approach could bring the total size of the working set down to only 13 bytes, but does introduce overheads for creating and binding multiple image buffers as opposed to one.

16

Figure 6 - A representation of the packing process to preserve the three barycentric coordinates of a generated tessellation primitive (white triangle) relative to its parent input patch (RGB triangle). The result is stored per-fragment to an image buffer to be sampled in the deferred pass.

In the deferred pass, the fragment shader receives the visibility buffer, tessellation coordinates buffer, and the vertex buffer of patch control points. In an identical operation to that of VB, the draw call ID and primitive ID are unpacked from the visibility buffer, allowing the attributes of the intersected input patch to be loaded subsequently. However, this pass must then additionally load barycentric coordinates from the tessellation coordinates buffer. These coordinates are used to interpolate the patch control point attributes to the vertices of the generated triangle. Following this, the algorithm may continue as normal. While this extra stage only seems to be a small conceptual addition to the algorithm seen in VB, it introduces large overheads.

Firstly, the interpolation of control point attributes to generated vertices is a repeat of the domain shader stage in the forward pass. Not only is this operation repeated, it is now carried out per sample as opposed to per vertex, and it must be performed for all vertices in the primitive for each sample. The associated computational expense may therefore prove substantial in high-resolution scenes.

In addition to the added compute cost, the chosen solution is unlikely to scale well to very high detail levels. Since the storing of tessellation coordinates is performed in the fragment shader to an image buffer, the accuracy with which primitive boundaries are preserved is bound to screen resolution. As the tessellation factor increases, generated triangles become smaller and saturate the rasterizer with detail that is too fine for the fragment size (Engel, 2019a). This can introduce artefacts in interpolated attributes as the generated triangles reach very small sizes. Observed texturing artefacts at high tessellation factors are shown in Figure 7.

17

Figure 7 - Texturing artefacts observed at high tessellation factors. Tessellation factor in the left image is 34 (c. 170,000 triangles), and 60 on the right (c. 529,000) triangles.

3.4 Testing

To test the efficacy of introducing hardware tessellation to a visibility buffer renderer, the primary measure is memory traffic and subsequent bandwidth usage. As a technique designed to lighten the load of deferred-style shading on platforms with limited GPU bandwidth, the visibility buffer relies on a compact working set per frame. Therefore, the increase in storage required to preserve tessellated geometry must be evaluated to determine whether the relative trade-off is worthwhile.

3.4.1 Hardware

As this work aims to assess the comparative memory efficiency of two renderers, it was considered out of scope to test on multiple configurations of underlying hardware. Save for minor inconsistencies between the drivers of independent hardware vendors, the performance measurements of the two techniques relative to each other should remain constant across platforms. Table 2 outlines the system configuration used to test the application.

18

Component Type

CPU Intel i7-9700K @ 4.80GHz

GPU NVIDIA GeForce RTX 2070

Memory 16GB DDR4 2667MHz

Motherboard ASUS TUF Z370-Plus Gaming

OS Windows 10 Education 64-bit

Table 2 - The system configuration used to build and test the application.

3.4.2 Evaluation Methods

Memory and processor performance metrics were acquired through the Range Profiler features of NVIDIA Nsight Graphics (NVIDIA, 2019). The application is launched through the Nsight software, at which point the profiling tools are injected into the process. GPU performance figures and usage statistics are captured directly from the hardware and presented to an output window. Frame and subpass timings are measured directly in the application using a high-resolution clock and Vulkan timestamp queries.

3.4.3 Evaluation Parameters

To fully assess the comparative performance of the two renderers, key parameters were identified to best reveal the strengths and weaknesses of both approaches. By varying these values during testing, a more complete picture of the characteristics of each renderer may be built up.

For testing the benefits of hardware tessellation at varying detail levels, the final triangle count of each terrain mesh is modified in set increments. The key benefit of hardware tessellation is the elimination of the need to store large amounts of detailed geometry in host memory, to then be passed to the GPU per frame. At each increasing triangle count, VB will be sending more data across the CPU-GPU bus, while VBT may rely on a single, far coarser triangle grid for all detail levels. The test cases for triangle count were:

• c. 250,000 triangles • c. 580,000 triangles

19

• c. 1,000,000 triangles • c. 2,000,000 triangles

As the renderers do not employ any form of triangle-culling, these counts will provide sufficient complexity for the renderers, especially VB.

As pointed out in previous studies (Burns and Hunt, 2013; Schied and Dachsbacher, 2015; Engel, 2016; 2018), the visibility buffer technique scales at resolution far more effectively than a traditional deferred renderer due to its memory efficiency per sample. To evaluate this scalability for this implementation, test results were taken at varying resolutions to assess how the performance of each renderer scales with higher pixel counts. These resolutions were:

• 1280 x 720 (720p) • 1980 x 1080 pixels (1080p) • 2560 x 1440 pixels (1440p) • 3840 x 2160 pixels (4K)

Since a visibility buffer renderer performs a large amount of computation per-fragment, higher resolutions will add noticeable compute overhead. In most cases, however, this is offset by memory efficiency, and as a result will scale well to higher sample rates, at which lower- end systems will often become bandwidth-limited (Burns and Hunt, 2013).

As visibility and shading are entirely de-coupled in a visibility buffer renderer, shading will only be carried out for geometry visible in the current frame. As such, it is important to measure the relative performance of each renderer when a varying amount of geometry is rasterized. Captures for frame execution times were therefore taken at these levels of geometry-frame occupancy:

• 0% Geometry, 100% Sky • 50% Geometry, 50% Sky • 100% Geometry, 0% Sky

3.4.4 Performance Metrics

The purpose of this implementation is to evaluate the memory efficiency of a visibility buffer renderer with and without the use of hardware tessellation. In assessing this, it is important to measure both precise memory usage metrics as well as net performance impacts. As such, measurements were taken for the following metrics:

20

Memory Usage/Efficiency

• Working set sizes (MiB) • Throughput of the CROP unit (%) • Throughput of the L1/Tex Cache (%) • Video memory read throughput (%) • Video memory write throughput (%) • Level 2 cache hit rate (%)

Net Performance

• SM Throughput (%) • GPU execution times of each subpass (ms) • Net percentage speed-up of VBT (frame time +-%)

4. RESULTS

For samples requiring the capturing of single frames, metrics are averaged from a set of 20 captures to eliminate outliers in results. Measurements are then taken over varied application parameters as outlined in Section 3.4.3.

For measurements taken over multiple resolutions, a default detail level of c. 580,000 triangles was chosen. This value represents a stable workload for both renderers whilst limiting texturing artefacts in VBT. When varying triangle count, measurements were taken at a resolution of 1080p, which of the four resolutions chosen for this work is the most commonly used by desktop users at time of writing (Statcounter, 2019). For cases in which geometry- frame occupancy is not explicitly varied, all measurements are taken from an identical camera position, with roughly 50% of the frame containing terrain geometry.

Many of the low-level performance metrics are presented in the form of percentage occupancy or throughput. These measurements are accessed via the Nsight Graphics Range Profiler (NVIDIA, 2019) and are taken directly from hardware counters on the respective GPU unit. The values convey how close the unit is to its maximum theoretical throughput during the given measurement period, and thus in the case of memory systems provide a good metric for understanding bandwidth usage.

21

4.1 Memory usage

4.1.1 Working Set

Table 3 outlines the combined working set size for each renderer under varying parameters. The values account for the size of the vertex and index attribute structures, primitive count in host memory, and the sample sizes and resolution of the required visibility buffer attachments. Appendix 1 outlines the calculation of these totals in further detail, presenting the count and size of each component of the working sets.

While both renderers share the same structure definitions for vertex and index attributes (and thus would consume the same amount of memory to store matching sets of geometry), VBT can store far less geometry in host memory to achieve the same detail level as VB. This saving is contrasted by the requirement of a 16-byte per sample colour buffer that must be stored in addition to the base visibility buffer attachment.

Resolution 720p 1080p Triangle 250k 580k 1,000k 2,000k 250k 580k 1,000k 2,000k Count VB 10.08 19.18 30.86 57.30 14.47 23.57 35.26 61.70

VBT 17.59 17.59 17.59 17.59 39.56 39.56 39.56 39.56

Resolution 1440p 4K Triangle 250k 580k 1,000k 2,000k 250k 580k 1,000k 2,000k Count VB 20.63 29.73 41.41 67.85 38.20 47.30 58.99 85.43

VBT 70.32 70.32 70.32 70.32 158.21 158.21 158.21 158.21

Table 3 - The total working set size (MiB) of each renderer at varying triangle counts and resolutions.

The relative increase or decrease of the tessellation working set against the standard visibility buffer is visualised in Figure 8. A key pattern to note is that VBT’s working set will remain constant for all triangle counts within each resolution, and as such can scale to higher detail levels far more effectively than VB.

However, the increased cost of reading and writing an extra 16-byte per sample colour texture to support tessellation becomes apparent as resolution increases. When compared to VB at 4K resolution, VBT can require over 4 times the memory to store the same level of geometric detail.

22

As proposed in previous studies (Burns and Hunt, 2013; Schied and Dachsbacher, 2015; Engel, 2016, 2018) VB handles the jump to higher resolutions effectively due to its per sample memory efficiency. Thus, while VBT requires almost 9 times the memory to render at 4K resolution versus 720p, VB can require less than double.

Figure 8 - The relative increase/decrease in working set size (MiB) of VBT vs VB.

4.1.2 Forward Writes

Figure 9 presents the percentage occupancy of the Colour Render Output Unit (CROP) used by the forward subpass of each renderer per resolution. The CROP performs colour writes and blending to render targets, and thus is responsible for the filling of the visibility buffer in the forward pass.

Figure 9 - Texture write usage of each renderer’s forward subpass per-resolution. A higher percentage denotes a larger consumption of the maximum bandwidth of the CROP unit.

These measurements in Figure 9 show that the memory bandwidth used in both renderers when writing to the visibility buffer is proportional to the resolution of the render targets. As resolution rises, VB experiences relatively inexpensive increases in write bandwidth usage, 23 with 4K resolution only consuming around 5% more of maximum throughput when compared to 720p.

This is contrasted by VBT and its far larger visibility buffer implementation, which consumes nearly 15% more bandwidth to make the same jump in screen resolution. In general, both renderers require double the bandwidth to write to 4K colour buffers as compared to 720p.

Considering the two renderers alongside one another, these results show a dramatic increase in texture write bandwidth consumption when rendering using VBT. Since this renderer requires an extra 16 bytes of colour data to be written per sample, the occupancy of the CROP unit is generally triple that of VB at all resolutions when filling the visibility buffer in the forward pass.

4.1.3 Deferred Reads

Figure 10 reports each renderer’s percentage throughput of the Level 1 and texture cache during the deferred pass. For the most part, all global memory accesses are routed through either the L1 or texture cache, and therefore these values reflect the loading of uniform buffers as well as visibility buffer textures.

Figure 10 - L1 Cache and Texture unit occupancy of each renderer's deferred pass. A higher percentage denotes a larger consumption of available bandwidth between the L1/Tex cache and the SMs.

The percentages measured demonstrate comparable bandwidth usage between the two renderers’ deferred passes for a constant triangle count. When taking into account uniform buffer reads, and thus the size of the bound vertex buffers, the two renderers vary only slightly in L1/Tex throughput for all resolutions. VBT uniformly occupies roughly 3% more bandwidth than VB in the deferred subpass at each resolution.

24

4.1.4 Total Read/Writes

Figure 11 outlines the percentage throughputs of global video memory (VRAM) during a full frame per triangle count. The results show that VB consumes a considerable amount of bandwidth as triangle count increases due to the increased read throughput. This is somewhat offset by VBT’s need to write additional data during the forward pass, but the margin, at worst, is around 50% greater than VB. In contrast, VB occupies almost 500% more read bandwidth than VBT when processing 2 million triangles.

These results demonstrate that VB becomes read-limited as triangle count increases. The far larger vertex buffers must be passed to the forward pipeline for geometry processing, before also being bound to the deferred stage for attribute loading. VBT handles higher triangle counts with great efficiency, and in fact consumes less bandwidth in both reads and writes as geometry detail increases.

Figure 11 - The percentage throughput of VRAM of each renderer per triangle count. The left image shows the share of total throughput consumed by read operations, with the right image demonstrating write operations.

4.1.5 Coherency

Figure 12 presents the level 2 cache hit-rates of each renderer’s deferred subpass per- resolution. This cache provides a small amount of high-bandwidth memory which all global memory accesses are routed through during shading. Rendering algorithms whose memory accesses are highly coherent are likely to experience high cache hit-rates, resulting in less fetches from VRAM and a more efficient GPU workload.

25

Figure 12 - Level 2 cache hit-rates of each renderer's deferred pass. The measurements indicate a higher level of memory access coherency when rendering with VB.

These results show a correlation between render target resolution and cache hits. In the case of VB, cache hits increase with resolution, and it consistently maintains a high hit rate of at least 87%. However, VBT suffers from a poorer hit rate across all resolutions and experiences fewer cache hits as resolution increases.

4.2 Streaming Multi-Processor Usage

Figure 13 displays the throughput of the GPU streaming-multiprocessors (SM) over a full frame per resolution. Besides memory usage, the two renderers differ in the calculations they must perform to load and interpolate attributes. The results reflect the workload the SM must execute per frame when running each renderer’s shaders.

Figure 13 - The percentage throughput of the GPU streaming-multiprocessors (SM) during the rendering of a full frame per resolution. A higher percentage denotes a workload which consumes a larger share of the SM's maximum theoretical throughput in the measurement period.

26

The results reflect the extra calculations that are required in VBT per fragment. The SM workload at 720p using VBT is roughly double that of VB. This appears to be an isolated case, however, as the two renderers scale uniformly to higher resolutions. As resolution increases, VBT typically uses an additional 5% of available throughput compared to VB. These values suggest that the pipelines are not compute-bound and may scale resolution without encountering substantial processing bottlenecks.

4.3 Net Performance Impact

Figure 14 presents the time in milliseconds the GPU takes to complete each subpass in the two renderers. Times are recorded both per triangle and per resolution to capture how the renderers’ net performance scales with each parameter.

It is important to note that since GPU processing is highly parallel, and any stalled threads will seek to be actively filled with work waiting to execute, there is likely to be overlap in the execution of the forward and deferred subpasses, despite the latter’s dependency on the former’s completion. This phenomenon was observed during testing, as the time reported for rendering a whole frame could frequently be less than the sum of its two subpass times.

Figure 14 - The GPU time (ms) to complete each subpass. Measured by binding GPU timestamp queries either side of each subpass in the command buffers and calculating the difference. The left image shows subpass times per resolution, while the right image shows timings per triangle count.

4.3.1 Per Resolution

These measurements demonstrate the correlation between resolution and pass time performance for the two renderers. It is observed how the forward pass of the visibility buffer handles the increase of render target resolution well, due to the light per-fragment workload. The computational complexity of the deferred pass is reflected, however, in its performance at higher resolutions. While the forward pass is observed to take roughly the same amount of time between 720p and 4K, the deferred pass takes over double the time to execute across the same range.

27

VBT is observed at lower resolutions to take a fraction of the time to execute its forward pass compared to VB. As resolution increases, however, the margin by which the forward pass outperforms that of the visibility buffer uniformly narrows. Similarly, while the deferred pass of VBT is consistently the fastest to execute across all resolutions (generally taking half the time of its preceding forward pass), its gains over the visibility buffer deferred pass become less pronounced at higher resolutions.

4.3.2 Per Triangle Count

The first measurement which becomes apparent when varying triangle count is the considerable rise in the forward pass execution time of VB at higher triangle counts. The rate at which this time increases with triangle count is exponential, as the forward pass must process more and more geometry to resolve visibility. In comparison, VBT’s forward times do not increase at anywhere near the same rate, and never reach above 1 millisecond.

It is important to note that there is no observable increase or decrease in the execution time of VBT’s deferred pass as triangle count rises. Since the primitive count in bound vertex buffers will never change and resolution is constant, the deferred pass will carry out the same workload regardless of final tessellated triangle count. In contrast, the deferred pass of the visibility buffer is seen to nearly triple between the lowest and highest detail levels. These results therefore favour the performance of VBT in all measured cases.

4.3.3 Full Frame Performance

Figure 15 presents the net speed increase of VBT against VB in terms of total frame time per resolution. It is shown that VBT outperforms the visibility buffer in all cases, especially at lower resolutions. The performance gains at these resolutions are pronounced, with the best-case being an over 2000% decrease in frame time.

Results are varied over geometry-frame occupancy as well as resolution. In this context, frame occupancy denotes the portion of the measured frame in which terrain geometry is visible. Since most calculations in the deferred pass of both pipelines will only run on fragments that have values stored in the visibility buffer, frames which contain more sky and less geometry on-screen will have lighter workloads.

28

The results show that VBT benefits from lower frame occupancy for all resolutions, as it must carry out more calculations per fragment than VB in order to acquire interpolated attributes. Gains are less pronounced at full-frame occupancy for this reason, and this trend scales with resolution; the tessellation frame time increase between 0% and 100% frame occupancy rises from around 0.15ms at 720p to over 1.4ms at 4K.

Figure 15 - The percentage speed increase in terms of total frame time of VBT vs VB. Measurements are taken at varying frame occupancy levels, dictating the portion of the rendered frame occupied by geometry.

In general, the gain in frame time performance of VBT becomes less noticeable at higher resolutions. In fact, the speed increase of rendering a fully occupied 4K frame is an order of magnitude less than that of a 720p frame at 0% occupancy. As resolution scales, therefore, VBT begins to be limited by the computational expense per fragment of its deferred pass.

5. DISCUSSION

To compare the practical application of any one rendering technique over another, each must be considered within the context of the job it aims to perform. This chapter will therefore consider the implementation of both VB and VBT within the context of video game development, as it remains one of the most widespread use cases for real-time graphics applications.

5.1 Analysis of Results

The results presented in Chapter 4 appear at face value to be far from conclusive in terms of their support for either of the two renderers. Whilst VBT vastly out-performed VB in terms of frame time in most test cases, this was achieved on a test system with very large bandwidth capabilities and a high ceiling in terms of hardware limitations. Therefore, in order to

29 determine the impact of adding hardware tessellation to the visibility buffer on more bandwidth-limited platforms, further analysis is required.

5.1.1 Memory Usage

Section 4.1.1 contains the total working set sizes of the implemented pipelines at varying resolutions and triangle counts. While sets of these sizes do little to trouble the test system used, bandwidth-limited platforms such as mobile phones are likely to be more sensitive to total working set increases. Since the visibility buffer technique aims to minimise memory footprint for the benefit of these platforms, it is important to analyse the working set trade- offs that result from the support of hardware tessellation.

Consequently, it is noted that VBT is extremely resolution-dependent. Since the expense of this pipeline comes per-fragment in the form of a large visibility buffer, each incrementally higher resolution will add large amounts of data to the total working set, irrespective of final geometric quality. Therefore, with respect to total working set size, the situations in which VBT offers noticeable benefits are limited. Specifically, cases where high geometric detail is rendered at lower resolution appear to favour the use of hardware tessellation. However, in the context of a video game, this limits the flexibility of the software to perform consistently across multiple system configurations.

In contrast, VB not only scales well to high resolutions due to its compact size per sample, it may also sacrifice geometric detail to maintain low total working set sizes. This allows a game developer to make design trade-offs, potentially lowering geometric quality at high resolutions on memory-limited systems to maintain target performance. This allows the game to be flexible in an area which the developer has control over, as opposed to sacrificing performance due to screen resolution as would be required using VBT, which would likely limit the available player base. In general, it is determined that the per-fragment memory efficiency of VB results in more manageable working sets as rendering complexity increases.

This assessment is further supported by the results presented in Figure 9, where VBT is shown to consume a far greater amount of write bandwidth than VB in the forward pass. In writing a total of 20 bytes of colour data per sample, VBT is approaching a typical deferred g-buffer in terms of bandwidth consumption. While it is shown that this constitutes only 30% of available throughput on the test hardware, this is known to be a common bottleneck in mobile and integrated systems (Burns and Hunt, 2013). For this reason, this portion of VBT may be prohibitive to such platforms.

The deferred subpass throughput measurements of the Level 1 and texture cache help to visualise the trade-off in working set between the two renderers. Despite the differences in

30 their respective memory footprints, both renderers occupy comparable amounts of bandwidth through these units. In the case of VB, this highlights the bandwidth impact of loading a larger vertex buffer — which contains attribute information about the rendered geometry. Conversely, VBT calculates most vertex attributes through interpolation, and thus loads a smaller vertex buffer from memory. However, this comes at the cost of multiple samples of the large tessellation coordinates texture created in the forward pass.

The net result is a higher bandwidth cost in the deferred subpass when using VBT, but the margin is only small. It is therefore likely that the renderers’ subpass structure levels the playing field in terms of texture fetches, as they are heavily optimised by the use of Vulkan Input Attachments. Furthermore, since modern graphics systems in general prefer compute- heavy workloads to bandwidth-intensive patterns (Burns and Hunt, 2013), it is unlikely that either renderer is particularly limited by its deferred subpass.

Indeed, in studying the GPU pass times presented in Section 4.3.1, it is observed that VBT’s deferred pass never takes longer than its preceding forward pass, and remains performant during increases of both resolution and triangle count. Similarly, while the deferred pass in VB occasionally matches or exceeds the forward pass in execution time, it is the forward pass which quickly drops-off in terms of performance as geometric complexity increases. For VB, this suggests a bottleneck in the geometry processing phase. Therefore, its performance could be improved significantly by implementing triangle culling techniques in a preceding subpass to limit the amount of geometry which the renderer must process (Engel, 2016).

The decrease in performance of VB at higher triangle counts can be further explained by the total VRAM bandwidth usage presented in Figure 11. As triangle count rises, the bandwidth consumption of VBT for both read and write decreases, whereas VB experiences a sharp rise in read throughput at higher detail levels. This is due to the necessity of not only processing geometry once during the visibility phase, but a second time during shading when loading vertex attributes. Therefore, higher triangle counts, combined with the lack of geometry culling, compound the performance deficit between the two renderers. On bandwidth-limited platforms, it would be critical to control geometry detail to achieve acceptable performance with VB.

By contrast, VBT suffers smaller penalties for increasing final detail levels, as it relies on its use of hardware tessellation to generate new triangles on the fly. The situations in which VBT is outperformed by VB in terms of VRAM throughput are observed at lower detail levels. In these cases, the cost incurred by VB to transfer all geometry from host memory is not high enough to warrant the extra bandwidth consumed by VBT’s tessellation coordinates texture.

31

A further indicator of memory efficiency is cache hit rate. A larger hit rate demonstrates a high level of coherency in memory accesses, allowing the GPU to move the data that is likely to be imminently required to high-bandwidth memory. VB exhibits a consistently high cache hit rate of around 90%, with only 720p resolution resulting in a drop-off — it is likely that this smaller render target does not fully utilise the large cache hierarchy available on the testing hardware. Conversely, VBT suffers relatively poor hit rates, suggesting that memory accesses for the tessellation coordinate buffer are not very coherent. It is possible that the texture format is too large (16 bytes per sample) to be effectively loaded into cache. The fact that VBT suffers yet poorer hit rates as resolution increases seems to confirm this. In this case, it may be that splitting the tessellation coordinates by component into their own, smaller render targets – as suggested in Section 3.3.1 – would allow VBT to more effectively utilise cache hierarchies. In the case of bandwidth-limited systems, leveraging available cache volumes is critical for efficient performance.

In summary, it is clear that upon further examination the memory efficiency of VBT in its current form would likely prove prohibitive to bandwidth-limited platforms. Meanwhile, VB suffers heavily in terms of read bandwidth usage (especially at higher detail levels) due to transferring all geometry to video memory per-frame, although this is a factor which can be effectively optimised through culling techniques (Engel, 2016; 2018). By requiring a far larger visibility buffer to retain generated triangles between passes, VBT consumes large amounts of bandwidth at performance-critical stages of the pipeline, without much opportunity for optimisation. VBT would, therefore, likely face performance bottlenecks similar to those of deferred shading on systems with low bandwidth capabilities.

5.1.2 Processor Usage

The results presented in Section 4.2 visualise the consumption of available compute resources by each renderer. They demonstrate how VBT consistently requires a larger processing workload per resolution. This is because VBT must recompute the tessellation domain shader for 3 vertices per fragment to calculate the attributes of generated primitives — a substantial processing cost in addition to the VB deferred stage. Despite this, the streaming- multiprocessor (SM) throughput values show a steady correlation between increase in resolution and compute cost, as opposed to the exponential increase observed with the VB forward pass and triangle count. This further indicates that while the deferred passes of both renderers require substantial calculations, they do not represent the primary limitations of the techniques.

It is important to note, however, that while this application is tasked only with rendering and shading geometry, modern video games will often leverage the GPU for other purposes

32 besides just graphics. Due to the expense of simulating real-world physics and artificial intelligence (AI), games are beginning to apply the general-purpose compute power of GPUs to logical as well as graphical operations (Blewitt, Ushaw and Morgan, 2013). This means that while a compute-heavy graphics workload — such as that of VB and VBT — may not pose a problem within the scope of this work, the occupancy of the SM units may become substantial when considering other per-frame processes that must be carried out on the GPU. Therefore, real-world applications with demanding GPU workloads may find the respective compute cost of VB and VBT a more sizeable hurdle than these results necessarily suggest.

5.1.3 Net Performance

In real-time graphics applications such as video games, the key performance factor is how quickly a frame can be rendered to the screen. Alongside gameplay logic, AI computations, and physics simulations, a limited budget of target frame time is left over for graphics rendering. This makes any small saving in execution time bought by optimisation very valuable to a developer. As such, the dramatic improvement in overall frame time observed when using VBT is an indicator of its suitability to demanding real-time systems running on capable hardware.

However, the margin of this improvement is heavily dependent on test case, as shown in Figure 15. At low resolutions, it is observed that VBT takes a fraction of the time to execute its forward pass as compared with VB. Despite this, cases are also shown in which VBT may start to underperform due to its extra write bandwidth usage. Specifically, in cases where average geometric detail is displayed in high resolution, the forward pass execution time of each renderer is essentially equal. In these cases, appropriate geometry culling techniques could potentially cause VB to outperform VBT. This is likely to be compounded on lower-end platforms, as VBT’s far larger visibility buffer would begin to encounter bandwidth limitations even at lower resolutions.

As a result, the choice of one renderer over the other for implementation into a production graphics application would depend entirely on use case and target platform. This is further demonstrated by observing the difference in performance at varying frame occupancies. Since both pipelines will only shade a fragment which is written to in the visibility buffer, each system is sensitive to the amount of geometry visible on screen. Indeed, since VBT requires far more calculations per fragment in the deferred pass, its perceived advantages over VB are vastly more pronounced when less geometry is visible on screen. When considering practical use cases, this may suggest that VBT would be preferable in the case of rendering high-detail, outdoor scenes with large portions of the frame occupied by sky (as demonstrated by this implementation). Meanwhile, an enclosed, indoor scene requiring less geometric complexity

33 would likely favour the use of VB with appropriate geometry culling, especially on bandwidth- limited platforms.

5.2 Design Considerations and Future Work

Due to both the timescale available for this work and the attempt to isolate specific performance factors of each renderer, design decisions were taken which are unlikely to reflect those of a production application utilising these techniques. This section will outline the performance implications of these choices, as well as the areas of potential improvement in the implementation of the application.

5.2.1 Triangle Culling/Filtering

It has been shown that while both the memory and compute efficiency of VBT is strongly linked to screen resolution (and as such the opportunities for optimising the workload are somewhat limited), VB is limited by the amount of geometry it must process. Methods to alleviate this have been addressed thoroughly in previous research surrounding the visibility buffer, with Schied and Dachsbacher (2015) and Engel (2016) both implementing a depth pre- pass to fill a filtered geometry buffer containing only triangles visible in the current frame. Engel (2016) then performs a further compute pass to cull back-facing and invalid triangles to further reduce the load on the eventual visibility buffer write pass. While such aggressive geometry filtering would probably be overkill for VBT (since such a small amount of geometry is stored in host memory), VB’s overall bandwidth usage and subsequent performance is likely to be significantly improved by such techniques.

5.2.2 Support of Hardware Tessellation

The approach taken in developing VBT to support tessellated geometry in a visibility buffer renderer closely follows that suggested by Burns and Hunt (2013). This method relies solely on a g-buffer-like collection of render targets to retain data between passes, and thereby most closely resembles the implementation of the standard visibility buffer. This approach best represents the singular addition of support for hardware tessellation, whilst still maintaining as much of the design pattern of VB as possible.

However, whilst Burns and Hunt (2013) suggest that tessellation coordinates may be stored in an additional 4 bytes per sample, the solutions explored in the development of VBT found no practical way in which this was possible. VBT in its current form requires a total of 16 bytes to store this data, although this could be cut to 9 bytes by following the solution suggested in Section 3.3.1. Consequently, VBT would likely benefit from an alternative method of retaining generated primitives.

34

In the study conducted by Schied and Dachsbacher (2015), hardware tessellation is supported by additionally storing all visible triangles to a new vertex buffer from the forward pass — using the visibility buffer image to store each triangle’s address in the buffer. This means that generated tessellation primitives may also be stored, negating the need to retain the tessellation coordinates and recompute domain shading in the deferred pass. This method, however, would substantially increase bandwidth usage, and therefore was not considered in the implementation of VBT. Despite this, there are a number of reasons why taking this approach may be preferable in a more developed rendering system than has here been presented.

Firstly, the creation of a new, host-accessible vertex buffer containing only visible geometry allows the system to filter out invisible or invalid triangles in an intermediate compute pass, such as that presented by Engel (2018). Amongst other benefits, this would enable the system to exclude generated primitives which are too small in screen-space to be effectively rasterized, thus eliminating the texturing artefacts presented in Figure 7. Secondly, this method eliminates the need for storing barycentric coordinates of generated primitives in the first pass. In doing so, there is no precision lost in the packing and unpacking of tessellation coordinates, thereby making interpolated attributes as accurate as possible. Following the completion of VBT’s development, it is therefore clear that a solution more closely modelled on those of DAIS (Schied and Dachsbacher, 2015) and Engel (2016, 2018) would likely allow for greater optimisation and visual stability in a production-scale application.

5.2.3 Compute-based Tessellation

Whilst hardware tessellation offers a standardised, hardware-accelerated interface for refining meshes procedurally on the GPU, it allows only limited levels of subdivision, and incurs large memory costs at higher subdivision levels (Advanced Micro Devices, 2013). A recent study conducted by Dupuy, Khoury and Riccio (2019) reported success in leveraging the processing power of modern GPUs to run custom subdivision algorithms in compute shaders to refine geometry manually. This approach allows the developer to define a target screen-space size of generated triangles up to an arbitrary subdivision level, and at constant memory cost. This both eliminates saturation of the rasteriser while also dynamically detailing geometry depending on its distance from the camera.

In the context of this paper, a similar approach could be taken to both limit memory usage of the subdivision stage and ensure efficient to the visibility buffer. If employing an approach such as that discussed in Section 5.2.2, this method might also be used to populate a triangle visibility buffer directly from the compute shader, without the need to go through the graphics pipeline.

35

5.2.4 Turing Mesh Shaders

In a traditional graphics pipeline, the hardware’s primitive distributor performs fixed-function vertex deduplication every frame, even if the topologies rendered do not change. This comes with a high bandwidth requirement and is wasteful for large, static topologies such as the terrain mesh rendered by VB. Additionally, vertex and attribute fetches are performed even for primitives which are not visible, due to either being outside the view frustum, back-facing, or too small to rasterize. Consequently — as observed in the results of this research — this vertex processing stage creates a considerable bottleneck when triangle-heavy meshes are rendered without geometry culling.

Mesh Shaders (NVIDIA, 2018a) were introduced as part of the Turing graphics architecture (NVIDIA, 2018b), offering an alternative geometry pipeline for the rendering of highly-detailed meshes. By decomposing large topologies into smaller meshlets and re-using these across many frames, the impact of fixed-function stages on geometry processing is reduced, thus providing sizeable bandwidth reduction and high scalability.

Therefore, employing Mesh Shaders and appropriate geometry culling in the forward pass of a visibility buffer renderer such as VB is likely to vastly improve its performance (Engel, 2019b), and possibly even negate the appeal of hardware tessellation entirely. VB’s sensitivity to rendering large, static topologies indicates the potential of Mesh Shaders for further minimising the memory footprint of deferred renderers, even for extremely detailed scenes.

6. CONCLUSION

A real-time graphics application was here developed to evaluate the memory efficiency and subsequent performance impact of supporting hardware tessellation within a visibility buffer renderer. Two distinct renderers were implemented: the first, a standard visibility buffer pipeline (VB); the second, a novel solution for the additional support of hardware tessellation in the visibility buffer (VBT). The results show that for bandwidth-limited platforms, the associated memory costs of supporting tessellation would likely be impractical. While VBT showed dramatically improved frame-time performance in some test cases, this was found to be largely due to the lack of geometry culling techniques employed in VB, and came at the expense of substantial on-chip bandwidth consumption. This expense undermines the principle goal of the visibility buffer technique: minimising the memory footprint of separating geometry processing from shading.

36

The test hardware presented a forgiving test environment for these renderers, possessing sizeable bandwidth capabilities, cache hierarchies, and processing power. Although VBT clearly benefitted from increasing the workload in these areas, the results suggest a trend that would see it encounter performance bottlenecks far sooner than VB on lower-end platforms. The associated memory costs of VBT therefore signify its dependency on high-performance hardware – and the solution must be considered amongst other techniques in this bracket when determining its worth.

Therefore, in answer to the proposed research question, this research concludes that while hardware tessellation may offer an attractive performance improvement to specific use cases such as terrain rendering, its suitability to memory-conscious deferred rendering systems is questionable. Certainly, the approach taken in the development of VBT was substandard for this purpose, and was found to inhibit the memory efficiency of the visibility buffer system upon which it was based. Optimisations have been discussed which may reduce the memory cost of VBT, but as the general-purpose compute power of GPUs continues to soar, it must be considered whether fixed-function pipeline stages such as tessellation remain relevant. Graphics pipelines are being continuously optimised for efficiently dealing with huge triangle counts, through techniques such as Mesh Shaders (NVIDIA, 2018a) and compute-based geometry filtering, and it is possible that soon the processing of geometric workloads such as those presented in this paper becomes trivial.

Nevertheless, the increasing demand for high-fidelity mobile games (Newzoo, 2017) will perpetuate the need for memory-efficient rendering algorithms such as the visibility buffer — and real-time graphics research should maintain a focus on developing methods to minimise the bandwidth requirements associated with rendering detailed 3D scenes. In doing so, games and other real-time applications may continue to push the limits of immersion and visual quality on platforms of all sizes and capabilities.

37

7. ACKNOWLEDGEMENTS

I would like to sincerely thank Dr Paul Robertson for providing the initial inspiration for this project, as well as constant motivation and support to help to see it through. It is hugely appreciated!

I would also like to extend my gratitude to Aurelio Reis, Seth Schneider, Jeff Kiel and An Yan at NVIDIA for helping to make the testing stage of this project possible, and going out of their way to help me get Nsight Graphics working with Vulkan.

Huge thanks also go to Rys Sommefeldt at AMD, who not only provided me a graphics card to use in the testing of this project, but also offered invaluable advice and guidance about all things graphics.

Finally, I would like to thank Wolfgang Engel for taking the time to speak with me about the visibility buffer, hardware tessellation, and graphics programming in general, and for providing me with an article from the unpublished GPU to help facilitate this research.

38

8. LIST OF REFERENCES

Akenine-Moller, T., Haines, E. and Hoffman, N. (2018) Real-time rendering. AK Peters/CRC Press.

Advanced Micro Devices (2013) GCN Performance Tweets. Available at: http://developer.amd.com/wordpress/media/2013/05/GCNPerformanceTweets.pdf (Accessed: 22nd April 2019).

Advanced Micro Devices (2015) High Bandwidth Memory. [Hardware Technology] Available at: https://www.amd.com/en/technologies/hbm (Accessed: 12th February 2019).

Blewitt, W., Ushaw, G., Morgan, G. (2013) Applicability of GPGPU Computing to Real-Time AI Solutions in Games, IEEE Transactions on Computational Intelligence and AI in Games, 5(3), pp. 265-275. DOI: 10.1109/TCIAIG.2013.2258156.

Bor-Sung Liang, Wen-Chang Yeh, Yuan-Chung Lee and Chein-Wei Jen (2000) Deferred lighting: a computation-efficient approach for real-time 3-D graphics.

Burns, C.A. and Hunt, W.A. (2013) The visibility buffer: a cache-friendly approach to deferred shading, Journal of Computer Graphics Techniques (JCGT), 2(2), pp. 55-69.

Dear ImGui (2019) [Software API] Available at: https://github.com/ocornut/imgui (Accessed: 20th January 2019).

Dupuy, J., Khoury, J. and Riccio, C. (2019). ‘Adaptive GPU Tessellation with Compute Shaders’, in Engel, W. (ed.) GPU Zen 2. Black Cat Publishing.

Engel, W. (2016). The filtered and culled Visibility Buffer. Available at: http://www.confettispecialfx.com/gdce-2016-the-filtered-and-culled-visibility-buffer-2/ (Accessed: 27th September 2018).

Engel, W. (2018). Diary of a Graphics Programmer – Triangle Visibility Buffer. Available at: https://diaryofagraphicsprogrammer.blogspot.com/2018/03/triangle-visibility-buffer.html (Accessed: 17th November 2018).

Engel, W. (2019a). Skype conversation with Wolfgang Engel, 22nd February.

Engel, W. (2019b). 6th March. Available at: https://twitter.com/wolfgangengel/status/1103368462489931776 (Accessed: 21st April 2019).

GLFW (2019) [Software API] Available at: https://www.glfw.org/ (Accessed: 8th October 2018).

39

Microsoft Corporation (2018). Direct3D 11 Features. Available at: https://docs.microsoft.com/en-us/windows/desktop/direct3d11/direct3d-11- features#tessellation (Accessed: 28th April 2019).

Newzoo (2017) High Fidelity Mobile Gaming Is on the Rise, Putting Pressure on GPUs. Available at: https://newzoo.com/insights/articles/high-fidelity-mobile-gaming-become- next-gpu-battle-ground/ (Accessed: 17th April 2019).

Niessner, M., Keinert, B., Fisher, M., Stamminger, M., Loop, C. and Schafer, H. (2016) 'Real- Time Rendering Techniques with Hardware Tessellation', Computer Graphics Forum, 35(1), pp. 113-137. DOI: 10.1111/cgf.12714.

NVIDIA (2018a) Introduction to Turing Mesh Shaders. Available at: https://devblogs.nvidia.com/introduction-turing-mesh-shaders/ (Accessed: 14th April 2019).

NVIDIA (2018b) NVIDIA Turing [Graphics Hardware Architecture]. Available at: https://www.nvidia.com/en-gb/geforce/turing/ (Accessed: 27th April 2019).

NVIDIA (2019) Nsight Graphics [Compute Software]. Available at: https://developer.nvidia.com/nsight-graphics (Accessed: 21st March 2019).

O’Conor, K. (2017) GPU Performance for Game Artists. Available at: http://fragmentbuffer.com/gpu-performance-for-game-artists/ (Accessed: 14th February 2019).

OpenGL Mathematics (2019) [Software API] Available at: https://glm.g- truc.net/0.9.9/index.html (Accessed: 8th October 2019).

Rockstar Games (2012) Max Payne 3 [Video game]. Rockstar Games.

Schied, C. and Dachsbacher, C. (2015) Deferred attribute interpolation for memory-efficient deferred shading. ACM, pp. 43.

Statcounter (2019) Desktop Screen Resolution Stats Worldwide. Available at: http://gs.statcounter.com/screen-resolution-stats/desktop/worldwide accessed 17/04/19 (Accessed 17th April 2019).

STB Image (2019) [Software API] Available at: https://github.com/nothings/stb (Accessed: 16th October 2018).

The Khronos Group (2019) The Vulkan API. Available at: https://www.khronos.org/vulkan/ (Accessed: 5th October 2018).

Vulkan Memory Allocator (2019) [Software API] Available at: https://github.com/GPUOpen- LibrariesAndSDKs/VulkanMemoryAllocator (Accessed: 14th October 2018).

40

9. APPENDICES Appendix 1

Presented here is the breakdown of the total working set of each renderer per resolution and per triangle count, as presented in Table 3. The working set was determined to include the size of geometry stored in host memory as well as the frame buffer attachments used to form the renderer’s visibility buffer. The tables below outline the individual sample/instance size of each resource, along with the number of instances required for each resolution and triangle count. Working set size could then be determined by summing, in each case, the multiples of each resource’s size by its count.

Tri No. Vertices No. Indices Tri No. Vertices No. Indices Count (VB) (VB) Count (VBT) (VBT)

250k 123,201 735,000 250k 196 1014

580k 293,764 1,756,086 580k 196 1014

1000k 512,656 3,067,350 1000k 196 1014

2000k 1,008,016 6,036,054 2000k 196 1014

Resolution No. Pixels Size Per Sample/Instance 720p 921,600 Resource (bytes)

1080p 2,073,600 Vertex 32

1440p 3,686,400 Index 4

4k 8,294,400 Visibility 4

Tess Coords 16

41