Master of Science in Engineering: Game and Software Engineering June 2021

A Performance Comparison of Dynamic- and Inline Ray Tracing in DXR An application in soft shadows

Joakim Sjöberg Filip Zachrisson

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Engineering: Game and Software Engineering. The thesis is equivalent to 20 weeks of full-time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Joakim Sjöberg E-mail: [email protected]

Filip Zachrisson E-mail: fi[email protected]

University advisor: Associate Professor, Veronica Sundstedt Department of Computer Science

External advisor at AMD: SMTS Software Development Engineer, Stefan Petersson

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Background. Ray tracing is a tool that can be used to increase the quality of the graphics in games. One application in graphics that ray tracing excels in is generat- ing shadows because ray tracing can simulate how shadows are generated in real life more accurately than rasterization techniques can. With the release of GPUs with hardware support for ray tracing, it can now be used in real-time graphics appli- cations to some extent. However, it is still a computationally heavy task requiring performance improvements. Objectives. This thesis will evaluate the difference in performance of three ray- tracing methods in DXR Tier 1.1, namely dynamic ray tracing and two forms of inline ray tracing. To further investigate the ray-tracing performance, soft shadows will be implemented to see if the driver can perform optimizations differently (de- pending on the choice of ray-tracing method) on the subsequent and/or preceding API interactions. With the pipelines implemented, benchmarks will be performed using different GPUs, scenes, and a varying amount of shadow-casting lights. Methods. The scientific method is based on an experimental approach, using both implementation and performance tests. The experimental approach will begin by extending an in-house DirectX 12 renderer. The extension includes ray-tracing func- tionality, so that hard shadows can be generated using both dynamic- and the inline forms ray tracing. Afterwards, soft shadows are generated by implementing a state- of-the-art-denoiser with some modifications, which will be added to each ray-tracing method. Finally, the renderer is used to perform benchmarks of various scenes with varying amounts of shadow-casting lights and object complexity to cover a broad area of scenarios that could occur in a game and/or in other similar applications. Results and Conclusions. The results gathered in this experiment suggest that under the experimental conditions of the chosen scenes, objects, and number of lights, AMD’s GPUs were faster in performance when using dynamic ray tracing than using inline ray tracing, whilst ’s GPUs were faster when using inline ray tracing compared to when using dynamic ray tracing. Also, with an increasing amount of shadow-casting lights, the choice of ray-tracing method had low to no impact except for linearly increasing the execution time in each test. Finally, adding soft shadows (subsequent and preceding API interactions) also had low to no relative impact on the results depending on the different ray-tracing methods.

Keywords: dynamic ray tracing, inline ray tracing, hard shadows, soft shadows, rendering

i

Sammanfattning

Bakgrund. Strålspårning (ray tracing) är ett verktyg som kan användas för att öka kvalitén på grafiken i spel. En tillämpning i grafik som strålspårning utmärker sig i är när skuggor ska skapas eftersom att strålspårning lättare kan simulera hur skug- gor skapas i verkligheten, vilket tidigare tekniker i rasterisering inte hade möjlighet för. Med ny hårdvara där det finns support för strålspårning inbyggt i grafikkorten finns det nu möjligheter att använda strålspårning i realtids-applikationer inom vissa gränser. Det är fortfarande tunga beräkningar som behöver slutföras och det är därav att det finns behov av förbättringar. Syfte. Denna uppsats kommer att utvärdera skillnaderna i prestanda mellan tre olika strålspårningsmetoder i DXR nivå 1.1, nämligen dynamisk strålspårning och två olika former av inline strålspårning. För att ge en bredare utredning på prestan- dan mellan strålspårningsmetoderna kommer mjuka skuggor att implementeras för att se om drivrutinen kan göra olika optimiseringar (beroende på valet av strålspårn- ingsmetod) på de efterföljande och/eller föregående API anropen. Efter att dessa rörledningar (pipelines) är implementerade kommer prestandatester att utföras med olika grafikkort, scener, och antal ljus som kastar skuggor. Metod. Den vetenskapliga metoden är baserat på ett experimentellt tillvägagångssätt, som kommer innehålla både ett experiment och ett flertal prestandatester. Det ex- perimentella tillvägagångssättet kommer att börja med att utöka en egenskapad Di- rectX 12 renderare. Utökningen kommer tillföra ny funktionalitet för att kunna hantera strålspårning så att hårda skuggor ska kunna genereras med både dynamisk- och de olika formerna av inline strålspårning. Efter det kommer mjuka skuggor att skapas genom att implementera en väletablerad avbrusningsteknik med några modi- fikationer, vilket kommer att bli tillagt på varje strålspårningssteg. Till slut kommer olika prestandatester att mätas med olika grafikkort, olika antal ljus, och olika scener för att täcka olika scenarion som skulle kunna uppstå i ett spel och/eller i andra lik- nande applikationer. Resultat och Slutsatser. De resultat från testerna i detta experiment påvisar att under dessa förutsättningar så är AMD’s grafikkort snabbare på dynamisk strål- spårning än på inline strålspårning, samtidigt som Nvidias grafikkort är snabbare på inline strålspårning än på den dynamiska varianten. Ökandet av ljus som kastar skuggor påvisade låg till ingen förändring förutom ett linjärt ökande av exekver- ingstiden i de flesta testerna. Slutligen så visade det sig även att tillägget av mjuka skuggor (efterföljande och föregående API interaktioner) hade låg till ingen påverkan på valet av strålspårningsmetod.

Nyckelord: dynamisk strålspårning, inline strålspårning (inline ray tracing), hårda skuggor, mjuka skuggor, rendering

iii

Acknowledgments

We wish to express our appreciation to our primary supervisor, Veronica Sundstedt, for the invaluable feedback, the fast replies, and the general guidance throughout this research. We would also like to sincerely thank Stefan Petersson for pitching the original re- search idea and for all the time and efforts he has spent to support us in many different ways throughout this study.

v

Nomenclature

AMD Advanced Micro Devices

API Application Programming Interface

AS Acceleration Structure

BLAS Bottom-Level Acceleration Structure

DRT Dynamic Ray Tracing

DXR DirectX Ray Tracing

EA Electronic Arts

GPU Graphics Processing Unit

IRT Inline Ray Tracing

IRTC Inline Ray Tracing using the Compute

IRTP Inline Ray Tracing using the Pixel Shader

RPP Rays Per Pixel

TLAS Top-Level Acceleration Structure

vii

Contents

Abstract i

Sammanfattning iii

Acknowledgments v

1 Introduction 1 1.1 Motivation ...... 1 1.2 Aim ...... 2 1.2.1 Research Questions ...... 3 1.2.2 Objectives ...... 3 1.3 Scope and Delimitations ...... 3 1.3.1 Static Scenes ...... 3 1.3.2 Static Camera ...... 4 1.3.3 API ...... 4 1.4 Outline ...... 4

2 Background 5 2.1 Ray-Tracing Essentials ...... 5 2.2 DirectX Ray Tracing ...... 6 2.2.1 Ray-Tracing Pipeline ...... 6 2.2.2 Acceleration Structures ...... 7 2.2.3 Dynamic Ray Tracing ...... 8 2.2.4 Shader Tables ...... 8 2.2.5 Inline Ray Tracing ...... 8 2.3 Hard Shadows ...... 9 2.4 Soft Shadows ...... 10

3 Related Work 13 3.1 Inline Ray Tracing ...... 13 3.2 Ray Tracing in Games ...... 13 3.3 Soft Shadows using DXR ...... 13

4 Method 15 4.1 Soft-Shadow Selection ...... 15 4.2 Implementation ...... 15 4.2.1 Depth Pre-Pass ...... 16 4.2.2 Geometry Buffer ...... 16

ix 4.2.3 Ray Tracing ...... 16 4.2.4 Spatial Accumulation ...... 18 4.2.5 Temporal Accumulation ...... 18 4.2.6 Final Blur ...... 19 4.2.7 Deferred Shading ...... 19 4.2.8 Temporal Anti-Aliasing ...... 19 4.3 Benchmarking ...... 19 4.3.1 System under Test ...... 19 4.3.2 Scenes ...... 20 4.3.3 Mean Squared Error ...... 21 4.3.4 Performance Measurement ...... 22

5 Results and Analyzis 25 5.1 Performance Results ...... 25 5.1.1 Sponza ...... 25 5.1.2 Dragon ...... 27 5.1.3 Sponza and four Stanford Dragons ...... 29 5.2 Analyzis ...... 31 5.2.1 Ray-Tracing Performance ...... 31 5.2.2 Whole Pipeline Performance ...... 32 5.2.3 Mean Squared Error Results ...... 32

6 Discussion 33 6.1 Ray Tracing in DXR ...... 33 6.2 Validity Threats ...... 34 6.2.1 Identical Visibility Buffers ...... 34 6.2.2 Method ...... 34 6.2.3 Spikes in the Tests ...... 35 6.2.4 Scene Setup ...... 35 6.3 Similarities with Related Benchmarks ...... 35 6.4 Contribution and Recommendations ...... 36

7 Conclusions and Future Work 37 7.1 Conclusions ...... 37 7.1.1 Scene and Object Complexity (RQ1) ...... 37 7.1.2 Graphic Processing Units (RQ2) ...... 37 7.2 Future Work ...... 37 7.2.1 Benchmarking ...... 38 7.2.2 Ray-Tracing Complexity ...... 38 7.2.3 Pipeline Complexity ...... 38 7.2.4 Vulkan ...... 38 7.2.5 Cast Fewer Rays ...... 38 7.3 Final Words ...... 39

x A Detailed Results 45 A.1 Sponza ...... 46 A.2 Dragon ...... 48 A.3 Sponza4Dragons ...... 50

B Code Snippets 53 B.1 Random number generation ...... 53 B.2 Ray-Generation Shader ...... 54 B.3 IRT Function Part1 ...... 55 B.4 IRT Function Part2 ...... 56 B.5 IRT using the Compute Shader ...... 57 B.6 IRT using the Pixel Shader ...... 58

xi

List of Figures

2.1 The basic concept of a ray where O is the origin, D is the direction and t is the scalar...... 5 2.2 A visualisation of a shader table with three unique objects with varying constants. Adapted from [12, p 41]...... 9 2.3 A visualisation of how shadow rays work. Adapted from [1]...... 10 2.4 A visualisation of the differences of hard- and soft shadows. Left: Hard Shadows, Right: Soft Shadows...... 10 2.5 A visualisation of the shadows in the umbra and penumbra region. . . 11 2.6 Noisy shadows generated with 1 rpp...... 11

4.1 The rendering pipeline with previews of the different stages...... 16 4.2 A visualisation of the shader table used in this experiment. Adapted from [12, p 41]...... 17 4.3 Screenshot from how the camera was positioned during the benchmark of the Sponza scene...... 20 4.4 Screenshot from how the camera was positioned during the benchmark of the Dragon scene...... 21 4.5 Screenshot from how the camera was positioned during the benchmark of the Sponza4Dragons scene...... 22 5.1 The Sponza scene with one light...... 25 5.2 The Sponza scene with two lights...... 26 5.3 The Sponza scene with four lights...... 26 5.4 The Sponza scene with eight lights...... 27 5.5 The Dragon scene with one light...... 27 5.6 The Dragon scene with two lights...... 28 5.7 The Dragon scene with four lights...... 28 5.8 The Dragon scene with eight lights...... 29 5.9 The Sponza4Dragons scene with one light...... 29 5.10 The Sponza4Dragons scene with two lights...... 30 5.11 The Sponza4Dragons scene with four lights...... 30 5.12 The Sponza4Dragons scene with eight lights...... 31

xiii

List of Tables

4.1 Benchmark Setup ...... 19

6.1 GPU performance hierarchy according to Tom’s Hardware. [34]. . . . 36

A.1 A detailed performance comparison of the different methods in the scene Sponza with 1 & 2 lights...... 46 A.2 A detailed performance comparison of the different methods in the scene Sponza with 4 & 8 lights...... 47 A.3 A detailed performance comparison of the different methods in the scene Dragon with 1 & 2 lights...... 48 A.4 A detailed performance comparison of the different methods in the scene Dragon with 4 & 8 lights...... 49 A.5 A detailed performance comparison of the different methods in the scene Sponza4Dragons with 1 & 2 lights...... 50 A.6 A detailed performance comparison of the different methods in the scene Sponza4Dragons with 4 & 8 lights...... 51

xv

Listings

B.1 Random Number Generation ...... 53 B.2 Ray-Generation Shader ...... 54 B.3 Inline Ray Tracing Part1 ...... 55 B.4 Inline Ray Tracing Part2 ...... 56 B.5 Compute Shader ...... 57 B.6 Pixel Shader ...... 58

xvii

Chapter 1 Introduction

This chapter explains the benefits of generating soft shadows with ray tracing and why it is important, followed by the aim, scope, and research questions of the thesis. Lastly, the outline provides a brief overview of the thesis.

1.1 Motivation In recent times, the graphics in games have seen a surge in realism. With the release of GPUs with hardware-accelerated ray-tracing support, techniques such as ray tracing can now be used in real-time graphic applications to some extent [15]. However, even with the release of new graphics cards with hardware dedicated to ray tracing, there are still performance issues when using ray tracing in real-time graphic applications. Some games that adapted to using real-time ray tracing had performance issues because of it. In the game, The Medium (published by the Bloober Team in 2021), the ray-tracing part of the game took approximately 1/4 of their execution time each frame when the ray-tracing settings were set to ultra [29]. Ray tracing has not been an option for rendering in real-time graphic applications in the past because of hardware limitations. Still, it has been discussed as a future solution to solve the current problems with some rasterization techniques [11, p. 13]. One of the techniques that ray tracing excels in is shadow generation. This allows old techniques such as shadow mapping to be gradually discarded as future games and/or applications are created. Shadow mapping has several issues such as peter panning, shadow acne, and aliasing, which arise due to imprecisions to the depth test [11, p. 16]. Jon Story [33] discusses and visualizes these issues in his presentation at the Game Developers Conference. To minimize the imprecision of the depth test, one could increase the resolution of the shadow map. Still, this solution would also increase the computation time and the memory consumption of the application. Shadow acne occurs due to resolution-bounded shadow maps when multiple fragments sometimes sample the same texel from the shadow map, leading to black lines in the scene. To solve this problem, one usually adds a shadow bias. However, with a shadow bias, the problem of peter panning occurs, a graphical bug where the shadows are slightly detached from their objects. The scene designer often has to manually fine-tune the shadow bias to achieve sufficient results [12, p. 159], but the results will never be physically accurate. Ray tracing solves these cases accurately and elegantly because it simulates light more closely to real-life than what rasterization techniques are able to do [12]. With ray-traced shadows, there are no shadow maps which are the main problem that causes shadow acne.

1 2 Chapter 1. Introduction

Hard shadows, in general, are an unrealistic way of rendering shadows. Hard shadows only come from point lights, which rarely exist in real life [5]. However, they can give decent graphical results anyways. The more physically correct way of rendering shadows is to make them softer in the edges. A soft shadow is a shadow that decreases in shadow strength on the edges of the shadow. Soft shadows emerge from area lights, which emit light from an area instead of a single point as point lights do. This is a more realistic way of how shadows work in real life. See Figure 2.4 for an example of hard shadows versus soft shadows. To use ray tracing within DirectX 12, developers can use the new ray-tracing addition to DirectX 12 called DirectX Ray Tracing (DXR). To use the DXR Appli- cation Programming Interface (API), developers need to create a separate pipeline, which then can be used to cast rays [13]. This new pipeline was designed to be able to have several bound at the same time, which is a must in ray tracing since a ray can intersect with any object at any time and consequently trigger any shader during ray traversal. However, with DXR Tier 1.1, a new feature called inline ray tracing enables a different way to trace rays in DirectX 12. Inline ray tracing allows for launching rays directly from the traditional rasterization shaders (vertex -and pixel shaders), which means that there is no way to dynamically change shaders depending on the object hit during ray traversal. Inline ray tracing also works in the compute shader, which is a shader that is generally used for general-purpose GPU (GPGPU) calculations [16, p. 675], but it is not limited to that. The main benefit of using inline ray tracing is that there is no need to change pipeline state and other resources in order to trigger the ray tracing, which could be an overhead that can be expensive in terms of performance [24]. However, since the driver can perform optimizations when submitting several tasks to the GPU at once [19], it is hard to know for sure if the inline forms of ray tracing perform better than dynamic ray tracing for all rendering techniques such as generating shadows, reflections, and/or ambient occlusion.

1.2 Aim

This thesis aims to evaluating the difference in performance of dynamic ray tracing (DRT), inline ray tracing using the compute shader (IRTC), and inline ray tracing using the pixel shader (IRTP) featured in DirectX 12 Tier 1.1. The reason why these ray-tracing methods were chosen was that they are similar in regards to how they work and because it is not obvious which one to use in different rendering situations. Furthermore, all three methods are included in the DXR API, and all of them benefit from hardware-accelerated ray tracing. To further investigate the ray-tracing performance, soft shadows will be imple- mented to see if the driver can perform optimizations differently (depending on the choice of ray-tracing method) on the subsequent and/or preceding API interactions. Ray-traced shadows are expected to be faster in performance using IRT than DRT [24]. However, to the author’s knowledge, there are no benchmarks that confirm this, which is the reason why soft shadows were explicitly chosen as the use case for ray tracing. Furthermore, the benchmarks will be performed on multiple GPUs with hardware-accelerated support for ray tracing to decrease the risk of getting 1.3. Scope and Delimitations 3 hardware-specific results. Also, by using GPUs from both AMD and Nvidia, the results will offer more generalization and target a broader audience.

1.2.1 Research Questions The research questions (RQ1 and RQ2) will assess how DRT, IRTC, and IRTP compare in ray-tracing performance with and without generating soft shadows.

1. How do DRT, IRTC, and IRTP compare in performance with different scene complexity (objects and lights)?

2. How do DRT, IRTC, and IRTP compare in performance with different GPUs (RTX2070, RTX3080, RX6700XT, RX6900XT)?

1.2.2 Objectives The objectives that will be carried out to answer the research questions are the following:

• Add ray-tracing support for an in-house DirectX 12 renderer

• Implement a tool to measure execution time on the GPU

• Implement soft shadows with DRT/IRTC/IRTP

• Construct test scenes

• Perform tests and collect performance results from the three different imple- mentations

1.3 Scope and Delimitations

In this section, the scope and the delimitations of the thesis are described.

1.3.1 Static Scenes The scenes used for testing were all static, meaning no object and/or lights were moving during the performance tests. The reason why this delimitation was added was to make it easier to make sure that the three ray-tracing methods generated the same graphical result every frame during the performance tests. Also, having a dynamic scene would mean that all three methods would add the same instruction BuildRaytracingAccelerationStructure to the command list every frame, which would probably have little to no relative performance impact. See Microsoft’s documenta- tion for more information [17]. 4 Chapter 1. Introduction

1.3.2 Static Camera Instead of having a camera moving through the scenes, a decision was made to manually choose a position and direction for the camera for each scene, where the camera would not change during the entire measurement. Furthermore, since this delimitation was chosen, there was no need to add motion vectors to the temporal accumulation stage in the soft-shadow technique, making the testing simpler and more focused on the ray tracing.

1.3.3 API Due to the limited time available in this work, other APIs such as Vulkan will not be investigated even though Vulkan supports ray tracing similarly to DirectX 12 [10].

1.4 Outline This section aims to give a brief overview of what the thesis contains.

1. Introduction: This chapter introduces the subject and explains why the work is of importance. It also lists the delimitations and explains the scope of the thesis.

2. Background: This chapter explains the background theory of concepts that must be understood to follow along in the paper. The reader is expected to have a basic understanding of DirectX 12.

3. Related Work: In this chapter, related work and research in shadows, games, and DXR will be presented.

4. Method: This chapter explains the implementation details of how the soft shadows were implemented. It also explains how the three different methods (DRT, IRTP, and IRTC) differ in implementation. Finally, this chapter ex- plains how the performance was measured.

5. Results and Analyzis: This chapter provides graphs obtained during the benchmarking stage, followed by an analyzis of the results.

6. Discussion: This chapter compares the results with related work and discusses the results of the performance tests.

7. Conclusions and Future Work: This chapter summarizes the answer to the research questions of the study and suggests future work. Chapter 2 Background

This chapter explains the key concepts the reader needs to understand to follow the thesis. The reader is expected to have basic knowledge of rasterization, deferred shad- ing, and some understanding of DirectX 12. For more information, see Luna’s [16] book.

2.1 Ray-Tracing Essentials

A ray is defined by two parts, an origin (O) and a direction (D). Ray casting is the idea of taking the origin O and casting the ray in the direction of D. Another value, which often is referred to as (t), is multiplied with D to control the distance of the ray. An example of a ray can be seen in Figure 2.1. Ray tracing is a tool that can be used to render several things such as reflections, ambient occlusion, and shadows to achieve high-quality results. Ray tracing works by using computationally heavy mathematical formulas for calculating the ray intersections with the geometry defined in the formula. For more information, see Alarcons [1] blog post where Eric Haines (Nvidia Engineer) goes deeper into the subject.

Figure 2.1: The basic concept of a ray where O is the origin, D is the direction and t is the scalar.

5 6 Chapter 2. Background

2.2 DirectX Ray Tracing In 2018, DirectX ray tracing Tier 1.0 was introduced as an extension to the DirectX 12 API. This is a step forward into increasing the quality of 3D graphics [9]. DXR Tier 1.1 includes the new ray-tracing feature inline ray tracing that allows developers to more easily integrate ray tracing in their applications.

2.2.1 Ray-Tracing Pipeline The ray-tracing pipeline is different from the traditional rasterization pipeline, which often only uses vertex- and pixel shaders. The ray-tracing pipeline in DXR con- tains a new set of shaders called the ray-generation shader, miss shader, intersection shader, any-hit shader, and finally, closest-hit shader. The last three shaders (inter- section, any hit, and closest hit) are tied together in a so-called hit group. These five new shaders have different purposes regarding launching and shading the identified hits [12, p. 23].

Ray Generation The ray-generation shader can be seen as the launcher of the rays. The actual ray tracing begins after the developer calls TraceRay()[21] in this shader, which takes a few parameters. One of them is the RayDesc, which specifies the ray specifics such as origin, direction, the smallest multiplier for the direction, which was called "t" in Section 2.1, and the largest "t". Another parameter is describ- ing what hit group and what miss shader to execute. Also, some flags (such as RAY_FLAG_SKIP_CLOSEST_HIT_SHADER) can be included as a parameter to increase performance. Whims et al. [35] explain in full which ray flags exist and what they do. Lastly, TraceRay() returns a so-called ray payload. This is a developer- defined structure that follows the ray throughout shader executions. The executed shaders can write information such as color to the payload, and it is through the payload that the ray-generation shader gets information about the results from the other shaders [14], and could, for example, write the results to a texture using an unordered access view with that specific information.

Miss If a ray failed to hit anything, the miss shader is executed for that specific ray. There are several use cases for the miss shader, like returning a background color through the payload or returning a boolean, which indicates if the shader was triggered or not. This information can then be used to determine if the position should be in shadow or not.

Hit Group The hit group is a collection of three types of shaders. These shaders are (as explained in Section 2.2.1) the intersection shader, the any-hit shader, and the closest hit shader. The intersection shader is optional and can be used if the developer wants to test the rays versus different primitives than triangles. If no intersection shader 2.2. DirectX Ray Tracing 7 is provided in the hit group, the default intersection shader will be used, which tests the rays versus triangles. The any-hit shader will trigger as soon as a hit is identified and can trigger multiple times for each ray. This shader stage is also optional and is often used for transparent geometry. Finally, the closest hit shader is triggered when the closest identified hit is calculated. This shader stage can be skipped using the RAY_FLAG_SKIP_CLOSEST_HIT_SHADER flag to increase performance. Generally, this is the shader where the color of a pixel is calculated, similar to a pixel shader in the rasterization pipeline [12, p. 24].

2.2.2 Acceleration Structures To efficiently reduce the number of ray-triangle intersections, DXR requires a two- level acceleration structure of the scene. The geometry is put inside a Bottom Level Acceleration Structure (BLAS), and the actual object to be drawn is put inside an instance in the Top Level Acceleration Structure (TLAS), which has a pointer to the BLAS with the geometry information. Both structures need to reside in GPU memory when the ray traversal begins [12, p. 34]. The acceleration structures are built with the command BuildRaytracingAccelerationStructure [17] method on both the TLAS and the BLAS.

Bottom Level Acceleration Structure The BLAS holds the vertex information for the actual geometry of an object. This means that if a triangle is to be ray traced, there is a vertex buffer with three vertices in a BLAS. A BLAS can contain several vertex buffers, which can be useful for models divided into several parts. For some applications, there is no need to update the BLAS. For example, if there are skinned objects in the scene, the BLAS needs to be updated with the new geometry. However, if the BLAS-geometry is not used for the actual drawing of the object, but only for rendering techniques such as reflections and/or shadows, the need to update the BLAS every frame might not always be there. Sjöholm [32] presented that the BLAS should be rebuilt on every Nth frame for skinned meshes, meaning that the need for updating the BLAS depends on the game or application. Petersson [26] found that the game Metro Exodus (published by Deep Silver and Koch Media in 2019) does not update the BLAS for skinned geometry every frame, probably because it is hardly noticeable in the reflections. However, in a scene with large mirrors, an optimization like this might be visible and obvious.

Top Level Acceleration Structure The TLAS can be seen as an array with one or more instances, where each instance has a pointer to the corresponding geometry in the BLAS alongside a world matrix. If an object moves in the scene, the corresponding world matrix needs to be adjusted, and therefore the TLAS also needs to be updated. Sjöholm [32] mentions that the TLAS should simply be rebuilt every frame. For increased usability, each instance can add an instance mask, which can be used in the TraceRay() call to ignore certain 8 Chapter 2. Background instances. For example, this could be useful to avoid testing for certain instances that should or should not receive shadows [12, p. 35] to increase performance.

2.2.3 Dynamic Ray Tracing In the rasterization pipeline, the vertex- and pixel shaders need to be set before issuing the draw call for the specified geometry. However, in DXR, the shader for each material in the scene must be bound before calling DispatchRays() [18], which is made possible with dynamic shading. To achieve this, shader tables were introduced to have all shaders available at once [14]. The reason why dynamic shading is needed is that a ray intersection with any geometry can happen at any time and consequently trigger every shader during ray traversal.

2.2.4 Shader Tables A shader table is an array containing shaders and shader data such as descriptors and/or root constants. They are used to trigger different shaders for different objects and/or materials. In a more raw view of shader tables, they just contain blocks of 64-bit aligned GPU memory. In the shader table, something called shader records are stored. A shader record contains a 32-bit pointer to the unique shader identifier following with the resource bindings for the shader [12, p. 41]. To choose which shader record to run when an intersection has occurred, several properties are consid- ered, one of them being the user-input arguments into the TraceRay() [21] function. To decide which miss shader to execute, an index that correlates to the order of the miss shader records in the shader table is required as a parameter. So if the index is one, the second miss shader will be evoked if the ray misses all geometry in the scene. Finally, for the hit groups (a combination of intersection, closest hit, and miss-shaders), two indices in the TraceRay function are required. The first index RayContributionToHitGroupIndex is provided as an offset in shader records. It can be used to, for example, choose the ray type of the traversal, such as shadow rays or view rays (see Figure 2.3 to see the difference between these ray types). The second index MultiplierForGeometryContributionToHitGroupIndex is multiplied by the in- dex of the geometry in a BLAS. This is usually set to zero since the release of DXR Tier 1.1 when a new intrinsic called GeometryIndex was introduced. If the second index is set to zero, all geometries in a bottom-level structure share the same shader record [24], and the geometry could be accessed using the GeometryIndex instead. For example, if a mesh is to be rendered three times, where each instance should have a unique color, the three shader records for the hit groups could contain the same 32-bit shader identifier, but then for each object, there is a different root constant containing a unique color. See Figure 2.2 for how the shader table could be structured to manage this specific scene.

2.2.5 Inline Ray Tracing After the release of DXR Tier 1.0, developers who used DXR wanted more flexibility with the ray-tracing pipeline, which was improved with DXR Tier 1.1. This release included something called inline ray tracing. With this feature, rays can be generated 2.3. Hard Shadows 9

Shader Table Ray Generation

Miss

Hit Group

Padding

Shader Identifier Hit Group Any-Hit Shader Constant Constant Closest-Hit Shader Constant Pad Intersection Shader

Figure 2.2: A visualisation of a shader table with three unique objects with varying constants. Adapted from [12, p 41]. without shader tables or the ray-tracing pipeline introduced in DXR Tier 1.0. This feature is available in any shader stage, including compute shaders and pixel shaders. Inline ray tracing uses the same acceleration structures as in DRT. It allows for simple ray-tracing tasks such as hard shadows without restructuring and/or adding a new pipeline which would have been the case when using DRT. Inline ray tracing is said to be faster in performance than DRT when performing jobs with low computations such as hard shadows, but it is assumed to run slower when the shader complexity is high [24].

2.3 Hard Shadows

Hard shadows, in general, are an unrealistic way of rendering shadows. Hard shadows only come from point lights, which rarely exist in real life [5]. However, they can give decent graphical results anyways. Generating hard shadows with ray tracing is a simple task. The idea is to find out if a point is in shadow by ray casting from the point in the world to the light. If the ray hits anything on the way to the light source, the pixel is occluded by another object and should be shadowed. See Figure 2.3 for how this might look like in a 3D scene. In DXR, it means that if the miss shader is executed for a particular ray, then the point the ray originated from missed all geometry on the way to the light and should therefore not be in shadow. See the left image in Figure 2.4 for a scene with hard shadows. 10 Chapter 2. Background

Figure 2.3: A visualisation of how shadow rays work. Adapted from [1].

Figure 2.4: A visualisation of the differences of hard- and soft shadows. Left: Hard Shadows, Right: Soft Shadows.

2.4 Soft Shadows

Soft shadows are a more realistic way of rendering shadows. On the right image in Figure 2.4, the shadows are getting softer the further out the shadow goes. More specifically, the shadows are hard in the umbra region and soft in the penumbra region, as shown in Figure 2.5. Soft shadows emerge from area lights, which emit light from an area instead of a single point as point lights do. When area lights are used, surface points in the penumbra can be occluded by only part of the light area, making the surface point partially in shadow. This means that the further out in the penumbra the point on the surface is, the softer the shadow becomes. In ray tracing, it works similarly as explained in Section 2.3, but instead of shooting the ray directly towards the same point on the light every frame, which would have been the case with point lights, the point on the light is randomized within the area of the light. This means that the direction will be different every time the shadow-ray direction is computed. Therefore when the point on the surface is in the penumbra region, it 2.4. Soft Shadows 11 might sometimes hit objects on the way to the light, and sometimes not. The result of changing the shadow-ray direction every time the shadow ray is computed will result in a noisy result, as shown in Figure 2.6. To solve this, one could cast more shadow rays per pixel (rpp), add them together and divide the summation with the number of rpp. When using more than one rpp, the visibility value of the pixel can be other values than zero and one. For example, if rpp was set to two for a specific pixel and only one of the rays hit the light source, then the visibility value would be 0.5, which means that the pixel should only be partially in shadow. However, only a few rays per pixel can be computed to achieve performance qualified for real-time rendering. In real-time graphic applications, hybrid rendering approaches are often used to solve this problem by adding spatiotemporal filters after the ray-tracing pass to get smoother results [12, p. 286].

Figure 2.5: A visualisation of the shadows in the umbra and penumbra region.

Figure 2.6: Noisy shadows generated with 1 rpp.

Chapter 3 Related Work

This chapter will present relevant usages of ray tracing in games and relevant research related to this thesis.

3.1 Inline Ray Tracing

As of writing this thesis, inline ray tracing is still a recent addition to DirectX 12. With that said, no related work with benchmarks of it was found. However, since this paper explores the difference between dynamic and the inline forms of ray tracing, it will be sufficient to compare with dynamic ray tracing in other works.

3.2 Ray Tracing in Games

When Battlefield V (published by Electronic Arts (EA) in 2018) was released, the execution time of the ray-tracing part in the game took approximately half of the game’s frame rate on low settings, and more than 2/3 of the performance when using ultra settings [31]. However, after some updates and bug fixes to DXR [7], the ray- tracing execution time decreased. In Call of Duty: Modern Warfare (CoD: MW) (published by Activision in 2019), the ray-tracing part took approximately 1/3 of their execution time per frame [8]. The Medium is another game released in January 2021 which also uses DXR. While the game is running on ultra settings, the fps increases by roughly a factor of four when ray tracing is disabled, compared to when enabled [29].

3.3 Soft Shadows using DXR

One problem when generating soft shadows with ray tracing is that the resulting image will be noisy. The act of removing this noise is called denoising, which purpose is to reduce the artifacts to make a smoother texture. Poulsen [28] mentions that it would be too expensive to solve the noisy textures by shooting more rays and that one should instead look into other ways to soften the shadows, like using blur filters. Many existing denoising techniques for ray-traced soft shadows were similar. In CoD: MW, Olejnik and Kozłowski [23] describe how they implemented soft shadows in the game using DXR. First, they denoised everything in half resolution to save performance, and then they applied temporal and spatial filters and finally added

13 14 Chapter 3. Related Work a temporal anti-aliasing (TAA) post-processing effect. They also used a motion- vector texture from a previous frame so that if the camera was moved from the last frame, the amount moved could be taken into consideration when using data from that frame. EA’s research team (SEED) similarly approached the problem in their soft-shadow implementation, where they used a spatiotemporal variance-guided filter (SVGF) [12, p. 443] similar to CoD: MW. For more information about the SVGF, see the work by Schied et al. [30]. In short terms, they blurred each visibility texture per light, and then afterwards, when the temporal textures were summed up, they added a final blur. Finally, Stachowiak mentions in his presentation [4] that they also added a TAA filter so that the soft shadows should be more reactive in dynamic conditions. Chapter 4 Method

This chapter presents the implementation specifics of the application and briefly presents the state-of-the-art soft-shadow technique chosen. The scientific method is based on an experimental approach, using both implementation and performance tests. In this work, the experimental approach includes implementing several pipelines in DirectX 12 specifically for the tests, which then are benchmarked to get timing results. Another scientific method that could be considered for this work is a sys- tematic literature study. However, since there is not a lot of public research yet published about the performance of inline ray tracing, it would be hard to pinpoint a conclusion about its performance.

4.1 Soft-Shadow Selection The task of softening hard shadows mostly resides in the stage of the denoiser. However, since this experiment focuses on, among other things, how the ray-tracing methods perform with subsequent and/or preceding API interactions, the choice of denoiser is not the primary focus. Hence a state-of-the-art-denoiser, namely SVGF was chosen as a template, which has been developed by graphic programmers at EA (SEED) in their renderer Pica Pica. SEED approached the problem with hybrid rendering, meaning that both rasterization and ray tracing were combined to finalize the soft shadows [12, p. 440]. To further ensure that the technique is a state-of-the- art technique, more research in the area was conducted. This ended up in findings where the game COD: MW uses a similar approach as SEED (see Section 3.3 for more information). Due to the credibility of these state-of-the-art methods, a denoiser similar to their implementations was conducted and will be further described in Section 4.2.

4.2 Implementation A brief overview of the pipeline can be seen in Figure 4.1, where each circular part in the figure is a separate shader with the output specified underneath for each light. The rendering pipeline begins by storing the depth and normal information, followed by the ray-tracing stage that writes shadow information to a visibility buffer per light. At this stage, the result is noisy, so the denoising follows next. The result from the visibility buffer gets blurred, following by a temporal accumulation from previous frames, and ends being blurred one last time. After this, a deferred shader

15 16 Chapter 4. Method

Figure 4.1: The rendering pipeline with previews of the different stages. shades the pixels using per-light visibility buffer information and the normals from the geometry buffer in the second stage. After the shading has been performed, temporal anti-aliasing is applied before presenting the frame.

4.2.1 Depth Pre-Pass The first stage of the pipeline is the depth pre-pass, where each object is drawn using rasterization to fill the depth buffer with depth information. A depth pre-pass is generally used to reduce overdraw [3]. However, in the case of ray tracing, a depth buffer can be used for another purpose, namely for starting the ray tracing from the position in the world instead of having to first ray trace to the world position, and then to the light. This saves a considerable amount of computation since it will reduce the rpp with one ray.

4.2.2 Geometry Buffer The geometry buffer (g-buffer) is used to save screen space information from the current frame. In this experiment, only the normals in world space were stored because they were needed to do simple diffuse lighting in the deferred-shading stage. However, in a more sophisticated renderer, other screen space information could be saved here, such as roughness maps, metallic maps, and albedo maps. This pipeline benefits from the depth pre-pass stage by reducing overdrawing.

4.2.3 Ray Tracing This stage of the pipeline is where the ray tracing occurs, illustrated by the orange color in Figure 4.1. The ray-tracing stage will change between the different ray- tracing methods (DRT, IRTC, and IRTP) in the experiment. The actual output from these pipelines is the visibility buffer for each light. If there are two shadow casting lights in the scene, the ray-tracing stage will output two visibility textures. All three methods read the depth from the depth buffer and translate it into a position in world space used as an origin for the ray in the RayDesc structure. To get the ray direction, all three methods use a random number generator (which 4.2. Implementation 17

Shader Table

Ray Generation

Miss

Shader Identifier

Figure 4.2: A visualisation of the shader table used in this experiment. Adapted from [12, p 41]. can be seen in Listing B.1) to randomize a point inside the light which the ray is directing at. Finally, the three methods also use the same flags for the ray-tracing call, namely the RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH and the RAY_FLAG_SKIP_CLOSEST_HIT_SHADER. The resolution is set to 2560x1440 since it is a resolution that does not require the newest and most expensive hardware, which most consumers do not have. Another reason why this resolution was chosen was that it is usually tested by graphic card manufacturers such as AMD [2].

Using the Ray-Generation Shader (DRT)

The pipeline which uses the dynamic form of ray tracing introduced in DXR Tier 1.0 uses a shader table with only two shader records present, namely a ray-generation- shader record and a miss-shader record. No hit groups were needed in this pipeline since the scene did not use any object-specific material such as textures. See Fig- ure 4.2 for a visualization of how the shader table was structured. The width and the height of the DispatchRays()-call were set to 2560x1440. See Listing B.2 for pseudocode of how the ray-generation shader was implemented. 18 Chapter 4. Method

Using the Pixel Shader (IRTP) As explained earlier, using DXR in the pixel shader came with the introduction of DXR Tier 1.1, which allowed for inline ray tracing. To trigger the pixel shader for each pixel, a full-screen quad was drawn with the viewport set to 2560x1440. In the pixel shader, a RayQuery was instantiated, followed by a call to TraceRayIn- line() with parameters similar to TraceRay(). Then similar to the ray generation shader, this shader also outputs one visibility texture per light. See Listing B.6 for pseudocode of how inline ray tracing was implemented in the pixel shader.

Using the Compute Shader (IRTC)

This pipeline is instantiated by calling Dispatch() with the parameters 10x1440, where the number of threads in each thread group was set to 256x1x1 to ensure that each pixel is executed once. The output in this shader is the same as in the other methods, namely the visibility textures for each light. The same call to RayQuery and TraceRayInline() is done in this shader as in IRTP. See Listing B.5 for pseudocode of how inline ray tracing was implemented in the compute shader.

4.2.4 Spatial Accumulation The spatial accumulation is the first step in denoising the visibility values in the shadow textures. This pass contributes to making a more smooth shadow by blurring each light’s visibility buffer. This is accomplished by applying a Gaussian filter with a 9x9 kernel size. This is executed once for every light’s visibility texture generated at the current frame.

4.2.5 Temporal Accumulation Temporal accumulation is used to accumulate the result over time, which means that the visibility textures from past frames are saved and accumulated into a final texture used for the current frame’s visibility texture. This is similar to increasing the rpp but without ray tracing multiple times. The visibility samples obtained from the previous frame’s visibility textures are added up and divided with the number of temporal buffers, which after some testing were set to four in this experiment. To achieve this, each light was given four visibility buffers to which the ray-tracing pass wrote visibility values in a reoccurring manner. To compute the resulting visibility value from the temporal buffers, the average formula was used. This was implemented in a pixel shader by invoking each pixel once using a full-screen quad. In the formula (4.1) v is the visibility value, i is the index of a temporal buffer, and n is the number of temporal buffers for each light in total.

n 1 X v = · v (4.1) n i i=0 4.3.Benchmarking 19

4.2.6 FinalBlur Aftereachvisibilitytexturehadbeenaccumulated,afinalblurwasperformedusing thesameGaussianfilterexplainedinSection4.2.4.Thiswasdoneforeachlight’s visibilitytexturetofurtherdenoisethem.

4.2.7 DeferredShading Theresultingvisibilityvaluesfromthepreviouspassesareusedintheperlight calculationsasafactorwithin [0 ,1] ofhowmuchapixelisinshadow.Thiswas implementedwithapixelshaderbyinvokingeachpixelonceusingafull-screen quad.Simplediffuselightingwasused,calculatedbytakingthedotproductofthe normalfromtheg-bufferandthedirectionofthelight[16].

4.2.8 TemporalAnti-Aliasing Temporalanti-aliasing (TAA) reducesaliasingbyreusingsamplesfrompastframes andintegratingtheminthecurrentframe[25].Sincethedecisionnottousemotion vectorswastaken,theimplementationdoesnotre-projecttheoldersamplesusing motionvectorsasneededwhenthesceneviewischanging.Thisisbecauseofthe delimitationofonlyusingstaticcameras,andthereforetheoldsamplescanjustbe reusedwithoutre-projectingthemfirst.Eachpixeliscalculatedbyanexponentially weightedmovingaverageimplementedinapixelshaderbyinvokingeachpixelonce usingafull-screenquad.Intheformula(4.2), nistheindexforthecurrentframe, and kisaconstant.

xn=xn·(1 −k)+ xn 1·k (4.2)

4.3 Benchmarking

Thefollowingsectionliststhesystemusedinthebenchmarks,thescene-specific information,howthemeansquarederrorwascalculated,andendswithexplaining howthebenchmarkswereconducted.

4.3.1 SystemunderTest TogetcomparableresultsforeachtestusingdifferentGPUsandtominimizethe numberofunknownfactorsduringtesting,onlytheGPUwaschangedforeachtest. Allmeasurementsweretakenwiththefollowinghardware,onwhichtheGPUwas changedforeachtest:

CPU IntelCorei76700k@4100MHz RAM Corsair16GB(2x8GB)DDR42133MHz,DRAMfreq:1066MHz GPU RTX3080,RTX2070,Radeon6900XT,Radeon6700XT Drivers Nvidia’sGPUs(465.89)&AMD’sGPUs(21.3.2) Table4.1:BenchmarkSetup 20 Chapter 4. Method

Figure 4.3: Screenshot from how the camera was positioned during the benchmark of the Sponza scene.

4.3.2 Scenes To be able to get results that are comparable with other research [12, p 176], the number of lights in the scenes were set to one, two, four, and eight. To include both different scene complexity and different model complexity, both Sponza and the Stanford Dragon were chosen as test models.

Sponza Sponza is a scene that is commonly used for benchmarking purposes, which is one of the reasons why it was included in this work. Another reason why it was included was because it is a scene with high complexity. A scene with high complexity means that it consists of many different shapes, forms, corners, and occluders, which is great for testing shadow generations. The Sponza version used in this experiment had 145 185 vertices and 262 205 triangles. The lights were evenly distributed around the scene to match a scenario that could occur in a game or a 3D application. However, since the entire scene could not be covered inside the camera frustum with a static camera during the benchmarking, a position and direction for the camera were manually chosen, which can be seen in Figure 4.3.

Dragon The Stanford Dragon is an object that is also commonly used for benchmarking purposes, which is one of the reasons why it was included in this work. Another reason why it was included was because it is an object with high complexity, meaning that the triangle count is considered high. The version used in this experiment had 437 645 vertices and 871 414 triangles. This object got a dedicated scene, where 4.3. Benchmarking 21

Figure 4.4: Screenshot from how the camera was positioned during the benchmark of the Dragon scene. only the Stanford Dragon and a plane were rendered alongside the varying amount of lights. The lights were evenly distributed around the dragon as in the Sponza scene to generate shadows on as many positions as possible. See Figure 4.4 for how the camera was placed during the benchmarks.

Sponza and Four Stanford Dragons To include a scene that could be considered heavy on performance, Sponza and four Stanford Dragons were combined into a single scene with a total of 3 747 861 triangles. This scene was constructed to stress test the ray-tracing API with both a highly complex scene and with four high complex objects together. The camera was positioned in the same place as in the Sponza scene. See Figure 4.5 for a screenshot of how the camera was positioned during the benchmarks.

4.3.3 Mean Squared Error The mean square error (MSE) is the average of the squares of the errors, which means the difference between an estimated value and the actual value. The MSE can be used to compare different textures to each other in order to see how different they are. If the final result of the MSE is low, it means that the difference between the two values is small and that the two textures are close to being equal. The MSE was calculated using an external tool called ImageMagick, with the graphical outputs from the three different ray-tracing methods (illustrated by the orange color in Figure 4.1) as inputs to the tool. However, the ray-direction seed was set to a static number when calculating the MSE in order to ensure that the visibility buffers were expected to be equal. The MSE was also calculated between Nvidia’s 22 Chapter 4. Method

Figure 4.5: Screenshot from how the camera was positioned during the benchmark of the Sponza4Dragons scene.

RTX 3080 card and AMD’s RADEON 6700XT card to ensure that the MSE stays the same even with GPUs from different manufacturers.

4.3.4 Performance Measurement To determine the favorable ray-tracing method in terms of performance, impartial and comparable results are essential. To achieve this, the measurements are ac- quired using a QueryHeap with the type D3D12_QUERY_TYPE_TIMESTAMP. The QueryHeap is used to get the number of clock ticks that passed between two timestamps on the GPU so that the time that was executed between them can be cal- culated. To convert these clock ticks into milliseconds, the ticks/second of the GPU have to be acquired, which can be attained with the function GetTimestampFre- quency(). With this information, the delta time between each timestamp can be converted to milliseconds with Equation (4.3). To ensure that the GPU-clock runs at the same frequency, SetStablePowerState(true) is used [20].

ticks elapsedtime(ms) = · 1000 (4.3) ticks/second The results gathered during a frame consist of two-time measurements. The first measurement, M1, encapsulates the ray-tracing pass (DRT, IRTP, and IRTC), illustrated by the orange color in Figure 4.1. All API calls for the ray-tracing pass are measured to keep the comparison fair between each ray-tracing method. This means that not just the execution of ray tracing is measured but also the time it takes for the GPU to prepare for ray tracing. The most notable differences are that IRTP executes a draw call for a full-screen quad, IRTC calls Dispatch(), and DRT calls DispatchRays(). The second measurement, M2, includes the entire rendering 4.3. Benchmarking 23 pipeline, illustrated in Figure 4.1. To determine if the choice of ray-tracing methods influences the execution time of rendering soft shadows in any other way than just the performance from the ray-tracing method, the execution time of the pipeline with ray tracing excluded will be investigated by calculating the differences of M2 and M1. The application calls ExecuteCommandLists() once per frame to observe any optimizations that the driver may do on subsequent and/or preceding API calls depending on the ray-tracing method [19]. There are a total of 36 tests performed for each GPU. One test for each combina- tion of ray-tracing method (DRT, IRTP, IRTC), scene (Dragon, Sponza, Sponza4Drag ons), and lights (1, 2, 4, 8). In all 36 tests, the average frame time is calculated and used to present the data. The application is restarted for each test to minimize the impact hardware has on the result, as there could be hardware and driver optimiza- tions that developers are unaware of [27, p. 568]. The first 500 frames of each test are skipped for the same reason. The tests were conducted at the Blekinge Institute of Technology (BTH) to have access to the GPUs, which limited the time available for the testing. Before finalizing the number of frames to include in the benchmarks, testing was conducted to monitor how the variance in the execution times changed when increasing the number of frames to test. When doubling the number of frames in a test from 3600 frames to 7200 frames on the scene Sponza4Dragons with eight lights gave a difference of variance of 0.02ms, which is the scene where the most variance was expected because of its complexity. In recent days, rendering applications are often presented with around 60-144 frames per second (fps) [4, 8, 29, 31]. Even with the highest value of 144fps (≈ 6.94ms), and with the variance mentioned above of 0.02ms, the ratio between the difference of the average execution time becomes 6.96 . Based 6.94 ≈ 1.0029 on these calculations, a decision was made that 3600 frames are enough to ensure reliable results.

Chapter 5 Results and Analyzis

This chapter will present the results gathered during benchmarking, followed by an analyzis of the results.

5.1 Performance Results The figures in this section are grouped per scene and per number of lights. In each figure, the ray-tracing methods are compared with different GPUs. The left subfigure encapsulates only the ray-tracing part (M1), whilst the right subfigure presents the execution time of the entire pipeline (M2). Besides presenting the average execution time of each test as clustered bars, whisker plots are included to show the sample distribution. The whisker box correlates to 50% of the samples and is not seen in all figures due to that the values are almost identical to each other in some cases.

5.1.1 Sponza

Figure 5.1: The Sponza scene with one light.

25 26 Chapter 5. Results and Analyzis

Figure 5.2: The Sponza scene with two lights.

Figure 5.3: The Sponza scene with four lights. 5.1. Performance Results 27

Figure 5.4: The Sponza scene with eight lights.

5.1.2 Dragon

Figure 5.5: The Dragon scene with one light. 28 Chapter 5. Results and Analyzis

Figure 5.6: The Dragon scene with two lights.

Figure 5.7: The Dragon scene with four lights. 5.1. Performance Results 29

Figure 5.8: The Dragon scene with eight lights.

5.1.3 Sponza and four Stanford Dragons

Figure 5.9: The Sponza4Dragons scene with one light. 30 Chapter 5. Results and Analyzis

Figure 5.10: The Sponza4Dragons scene with two lights.

Figure 5.11: The Sponza4Dragons scene with four lights. 5.2. Analyzis 31

Figure 5.12: The Sponza4Dragons scene with eight lights.

5.2 Analyzis In this section, the results from the performance tests (Figures 5.1-5.12) will be analyzed. For a more detailed view of the data, see Appendix A.1- A.6. To compare the ray-tracing methods, an average derived from all ray-tracing methods is provided for easier comparison. The % column shows the execution time for a specific method divided by the average of all the methods.

5.2.1 Ray-Tracing Performance AMD’s GPUs have the longest execution time using IRTC when compared to DRT and IRTP. Across all scenes, AMD’s GPUs have better performance on DRT with an execution time that is regularly between 20%-40% faster than when using IRTC, which is observed to have the slowest execution times except for two cases where IRTC is faster than IRTP as can be seen in Figure 5.5 and 5.4. The results gathered from Nvidia’s GPUs tend to be more consistent in terms of fluctuated results between the ray-tracing methods. Also, on Nvidia’s GPUs, IRTC and IRTP have similar execution times, and DRT is the slowest. The execution times using DRT are usually 10%-30% slower when comparing with IRTC and/or IRTP. The opposite from what was observed from AMD’s GPUs. Comparing results from RX 6900 XT and RTX 3080 on the Dragon scene with eight lights which took ≈ 2.7ms, with the Sponza4Dragons scene with four lights which took about the same time, show that the less complex scene Dragon gives a more sparse distribution of the ray-tracing execution times. Furthermore, DRT in 32 Chapter 5. Results and Analyzis this comparison is observed to have an increase in performance compared to IRTC and IRTP. Observing all the scenes with an increasing amount of lights shows no indication of impacting the performance of the ray-tracing methods.

5.2.2 Whole Pipeline Performance The M2-M1 results in Appendix A.1-A.6 show that the fastest and slowest ray-tracing methods are within 0-7% of each other when excluding the ray-tracing measurement. The values with a difference higher than 5% are exclusively observed on RX 6900 XT, often in the Sponza scene. The difference in execution time on DRT is consistently faster than IRTC when performing denoising on RX 6900 XT specifically. However, 70% of the measurements show only a difference within 0-2%, and no patterns in the different methods can be found. As such, it is apparent that the optimizations the driver can do have low to no impact on the performance relatively between the ray-tracing methods.

5.2.3 Mean Squared Error Results The MSE generated zero as a result between all tests, which means that the visibility buffers were identical between the three ray-tracing methods. Furthermore, the re- sults were also zero between Nvidia’s RTX 3080 card and AMD’s RADEON 6700XT card. Chapter 6 Discussion

This chapter discusses the experiment results and some validity threats that exist. As described in Section 1.2.1, the research questions were about evaluating the difference in performance of the three ray-tracing methods (DRT, IRTC, and IRTP). RQ1 is about comparing the performance of the ray-tracing methods with different scene complexity, whilst RQ2 is focused on how the ray-tracing methods perform with different GPUs. These questions will be discussed in the following section.

6.1 Ray Tracing in DXR Even though DRT seems to be faster than inline ray tracing on AMD’s GPUs, it might not be like this on other GPUs and/or drivers by AMD. The same goes for Nvidia’s GPUs that it might be faster to use dynamic ray tracing on different GPUs and/or drivers manufactured by them. However, the result suggests that choosing the ray-tracing method should not be taken for granted because it can decrease the ray-tracing execution time by almost up to 50% in some situations, according to the benchmarks of the Radeon RX 6900XT in Figure 5.6. Patel (Engineer on the team) said the following: "Perhaps the developer knows their scenario is simple enough that the overhead of dynamic shader scheduling is not worthwhile. For example a well constrained way of calculating shadows." [24]. According to the results in this experiment, this is not the case for AMD’s GPUs, where the performance is slower using inline ray tracing for generating the visibility textures. Patel also said the following: "The basic assumption is that scenarios with many complex shaders will run better with dynamic-shader-based raytracing. As opposed to using massive inline raytracing uber-shaders. And scenarios that would use a very minimal shading complexity and/or very few shaders might run better with inline raytracing." [24]. According to his hypothesis and the results gathered in the performance tests in this work, generating visibility textures with up to eight lights is considered low complexity for AMD’s GPUs but high complexity for Nvidia’s GPUs. The subsequent and/or preceding API interactions used for generating soft shad- ows had low to no impact depending on the ray-tracing method in this experiment. Although, the difference might be higher for other situations, such as when the shaders are more complex or when the application runs more pipelines instead of only generating soft shadows like in this experiment. One reason why the impact was low to none depending on the ray-tracing method might be because the GPU has to finish filling the visibility buffers for each light before executing the subsequent pipeline passes. However, there are a few minor setups for the subsequent and/or

33 34 Chapter 6. Discussion preceding passes that the driver might be able to do differently depending on the ray-tracing method, but as can be seen in the results, this had a low to no impact on the final results. Another interesting thing is that the sizes of the BLAS:es are larger on AMD’s GPUs than on Nvidia’s GPUs. This means that more memory has to be fetched for each triangle during ray traversal, which can result in slower performance. The average triangle size in bytes on the Stanford Dragon on a Radeon card was 145.6, whilst it was only 65.6 on a Nvidia card. This data was gathered by using the D3D12_RAYTRACING_ACCELERATION_STRUCTURE_POSTBUILD_INFO _DESC -structure when building the acceleration structures. See Microsoft’s docu- mentation for more information [17].

6.2 Validity Threats This section lists several threats in this work and how they were dealt with.

6.2.1 Identical Visibility Buffers The MSE was conducted for each pixel in the three different ray-tracing methods, and also between the Nvidia RTX 3080 card and AMDs RADEON 6700XT card to make sure that the difference stays at zero even when changing the graphics card, which it did. It would have been harder to decide which ray-tracing methods were better if both performance and graphical looks were different. Now when the graphical outputs are identical, and the only difference between them is the execution time, it becomes easier to decide which method is most appropriate for the experiments conducted in this thesis. With this information, the visual output of the ray-tracing method is no threat to the validity in this work.

6.2.2 Method The pipelines were solely implemented for generating soft shadows. In real-time graphics applications such as games, this scenario is unrealistic because the pipeline would be more complex regarding more rendering techniques combined with shadows to create the final image. With this in mind, the results gathered in this work might be different when other work is scheduled to the GPU at the same time. However, this is something that will be an issue for a long time, and developers will need to make their own benchmarks specifically targeting their application to find what the most appropriate soft-shadow method would be for that situation. Another issue with this implementation is that it is limited to static scenes and a static camera, which is unacceptable for games or other real-time graphics ap- plications. However, since the purpose of this thesis is about the performance of the ray-tracing methods, we believe that the method was sufficient. Also, these additions to the experiment would probably impact all three ray-tracing methods equally, which makes the relative performance impact between the methods close to zero unless the driver acts differently depending on that specific API call, which is unlikely. 6.3. Similarities with Related Benchmarks 35

6.2.3 Spikes in the Tests Some issues regarding the execution time during the benchmarks occurred. On some tests (for undetermined reasons), the execution time was significantly higher for ≈ 5% of the frames for each test. This can be seen in the box plots in the results, where some values are a lot higher than expected. This was investigated with timestamp queries and ended up in the conclusion that the spikes occur in different locations within the command list, indicating that the spikes might be caused by some wait issued by the GPU. Memory fetching is a bottleneck which often can cause waits for the GPU and is our suspect for the spikes. Another possibility is the resource management within the GPU that causes some wait as some resource states can be in a compressed format. There is some randomness in the ray-tracing part of the pipeline, but it is determined not to be the reason for the spikes since the spikes still exist when the ray-tracing part was disabled. While we think it is the code that causes the spikes, it may also be the drivers since Nvidia’s GPUs seem to have larger spikes. Since ≈ 5% of the frames have spikes, the average value was increased with a maximum of ≈ 0.5ms on rare occasions. Since the impacting factor is unknown and the frequency of the spikes is low, the spikes were included in the results because it is considered to have limited effect on the results.

6.2.4 Scene Setup With the setup of the scenes in this work, several state-of-the-art scenes and objects were taken into consideration but ended up being Sponza (for scene complexity) and the Stanford Dragon (for object complexity). Since the source code of the acceleration-structure traversal of DXR is something that developers do not have access to, there is no way of knowing how the bounding boxes are generated for each object. With that in mind, there might be differences in performance when rendering other objects with different shapes since the drivers may optimize them differently.

6.3 Similarities with Related Benchmarks

To further ensure that the results gathered from the benchmarks during this ex- periment were reliable, research was conducted to find other similar benchmarks. Walton [34] presents a GPU hierarchy of the benchmarks (in score%) that "Tom’s Hardware" did in 2021, which is similar to what the result of this experiment shows. See Table 6.1 for the rankings of the GPUs used in their benchmarks. In the results provided in this work, the Radeon 6900XT is close to RTX 3080 in performance, followed by Radeon 6700XT, and the RTX 2070 at the bottom. Since the RTX 2070 is from an older generation, this was expected. Furthermore, Moass [22] found that in the game Cyberpunk 2077, RTX 3080 was around 50% faster than Radeon 6900XT with ray tracing enabled, but with ray tracing disabled, Radeon 6900XT were around 6% faster. This is also accurate to the results in this work, where the RTX 3080 performed better in most tests, probably because around 50% of the work was within ray tracing in most benchmarks that were conducted. 36 Chapter 6. Discussion

Graphics Processing Unit Score AMD Radeon RX 6900XT 97.0% Nvidia GeForce RTX 3080 93.1% AMD Radeon RX 6700XT 73.3% Nvidia GeForce RTX 2070 53.1%

Table 6.1: GPU performance hierarchy according to Tom’s Hardware. [34].

6.4 Contribution and Recommendations The findings in this work contribute to making applications that use DXR faster in performance. It also explains some key concepts regarding ray tracing and DXR in general, which can be used as an extra source of information for learning activ- ities. Finally, to the author’s knowledge, there is limited public knowledge of the performance of inline ray tracing, which this work contributes to increasing. With the results gathered in this work in mind, the suggestion to developers is that they should implement at least two versions of ray tracing in their applications (dynamic ray tracing and one of the inline ray-tracing methods) to be able to switch methods depending on the users GPU. If the user has a GPU from Nvidia, use inline ray tracing and if the user has a GPU developed by AMD, use dynamic ray tracing. Chapter 7 Conclusions and Future Work

This chapter explains the conclusion of the research questions and gives some ideas for future work in this area.

7.1 Conclusions The following sections will answer the research questions of this thesis. How do DRT, IRTC, and IRTP compare in ray-tracing performance with and with- out generating soft shadows? RQ1: How do DRT, IRTC, and IRTP compare in performance with different scene complexity (objects and lights)? RQ2: How do DRT, IRTC, and IRTP compare in performance with different GPUs (RTX2070, RTX3080, RX6700XT, RX6900XT)?

When generating soft shadows, the optimizations that the driver could do had low to no impact on the performance. With that in mind, the following subsections is regarding how the different ray-tracing methods performed against each other without taking soft shadows into account.

7.1.1 Scene and Object Complexity (RQ1) With an increasing number of lights in each experiment, no significant change hap- pened with the different ray-tracing methods besides that each test took longer time. One can see similar results when looking at the different scene- and object complexity tests, where there also were no significant differences between the three ray-tracing methods.

7.1.2 Graphic Processing Units (RQ2) The results gathered in this experiment suggest that dynamic ray tracing is faster in performance for AMD’s GPU, whilst inline ray tracing is faster in performance for Nvidia’s GPUs.

7.2 Future Work This section provides several ideas that could be considered for future work.

37 38 Chapter 7. Conclusions and Future Work

7.2.1 Benchmarking To further improve on the results, a scene with many objects could be added to see how the three methods compare when the number of TLAS-instances is high. Currently, the test scenes only focus on the high scene- and model complexity (Sponza and The Stanford Dragon), but no test is focused on a large number of objects. For instance, it would be interesting to see how the three methods compare when rendering 10 000 triangles as separate instances in the TLAS.

7.2.2 Ray-Tracing Complexity Since the ray-tracing part in this work only focused on generating shadows, the shader-tables excluded hit groups because no per-instance resource data such as textures was needed. However, in applications such as games, there might be a need for including hit groups in the shader table to be able to get instance-level information about each triangle hit. It would be interesting to see how DRT performs when the shader-table complexity increases with an addition like that.

7.2.3 Pipeline Complexity It would be interesting to see how a more complex pipeline affects the ray-tracing methods. Games usually have several rendering techniques active at the same time to achieve high-quality visuals, which could give the driver even more possibilities to optimize the performance when given several tasks at once. With that in mind, it would be interesting to see how the ray-tracing methods perform with reflections, several compute shaders running async, and/or a more complex denoiser such as the one Baumeister [6] presents.

7.2.4 Vulkan It would be interesting to see how the dynamic and the inline forms of ray tracing would perform when using the Vulkan API instead of DirectX 12. Since Vulkan is a cross-platform API, it would also be interesting to benchmark on various platforms such as Linux and/or Android.

7.2.5 Cast Fewer Rays To reduce the number of rays to trace, there are two optimizations that would be interesting for future work. These will be explained in the following sections.

Variable Rate Shading It would be interesting to see if one can achieve similar graphical results using Vari- able Rate Shading (VRS). VRS is a technique that batches certain pixels together. This batching could be used to trace fewer rays on certain parts of the screen. 7.3. Final Words 39

Half Resolution To further reduce the number of rays per pixel, one could ray trace and denoise at half resolution and then upsample afterwards. It would be interesting to see how this optimization would impact both the performance and graphical looks.

7.3 Final Words In the future, ray tracing will probably be used in most games that try to achieve high-quality graphics. Since ray tracing is expensive in terms of performance, further optimizations on both software and hardware will be needed in order to satisfy the ever increasing demand of realism in games.

Bibliography

[1] N. Alarcon, “RT Essentials: Basics of Ray Tracing,” Jan. 2020. [Online]. Available: https://developer.nvidia.com/blog/ ray-tracing-essentials-part-1-basics-of-ray-tracing/ [2] AMD, “Graphics Gaming Benchmarks,” 2021. [Online]. Available: https: //www.amd.com/en/gaming/graphics-gaming-benchmarks [3] ARM, “ARM-Graphics: Depth pre-pass.” [Online]. Available: https://developer.arm.com/documentation/100140/0302/ optimization-lists/gpu-optimizations/use-depth-pre-pass [4] E. Arts, “PicaPica - Raytracing in Hybrid Real-Time Render- ing,” Jul. 2018. [Online]. Available: https://www.ea.com/seed/news/ seed-dd18-presentation-slides-raytracing [5] U. Assarsson and T. Akenine-Moller, “A Geometry-based Soft Shadow Volume Algorithm using Graphics Hardware,” ACM Transactions on Graphics, vol. 22, no. 3, p. 10, Jul. 2003. [6] D. Baumeister, “Microsoft: Game Stack Live: Denoising Raytraced Soft Shadows with FidelityFX,” 2021. [Online]. Available: https://gpuopen.com/ videos/gsl-denoising-soft-shadows/ [7] A. Burnes, “Battlefield V Update and GeForce,” Mar. 2018. [Online]. Available: https://www.nvidia.com/en-us/geforce/news/ battlefield-v-december-4-dxr-update/ [8] M. Campbell, “Call of Duty: Modern Warfare RTX,” Oct. 2019. [Online]. Available: https://www.overclock3d.net/reviews/software/call_of_ duty_modern_warfare_rtx_raytracing_pc_analysis/1 [9] M. D3D Team, “Announcing DXR,” Mar. 2018. [Online]. Available: https: //devblogs.microsoft.com/directx/announcing-microsoft-directx-raytracing/ [10] R. Daws, “Cross-platform graphics API Vulkan is now ‘ray-tracing ready’,” Dec. 2020. [Online]. Available: https://developer-tech.com/news/2020/dec/16/ cross-platform-graphics-api-vulkan-ray-tracing-ready/ [11] E. Eisemann, U. Assarsson, M. Schwarz, and M. Wimmer, “Shadow algorithms for real-time rendering,” Eurographics 2010 - Tutorials, pp. 13–13, 2010. [12] E. Haines and T. Akenine-Möller, Eds., Ray Tracing Gems: High-Quality and Real-Time Rendering with DXR and Other APIs. Berkeley, CA: Apress, 2019. [Online]. Available: http://link.springer.com/10.1007/978-1-4842-4427-2 [13] M.-K. Lefrancois and P. Gautron, “DX12 Raytracing tutorial - Part 1,”

41 42 BIBLIOGRAPHY

Aug. 2018. [Online]. Available: https://developer.nvidia.com/rtx/raytracing/ dxr/DX12-Raytracing-tutorial-Part-1 [14] ——, “DX12 Raytracing tutorial - Part 2,” Aug. 2018. [Online]. Available: https: //developer.nvidia.com/rtx/raytracing/dxr/dx12-raytracing-tutorial-part-2 [15] I. Llamas and E. Liu, “Coffee Break Series: Ray Tracing in Games with NVIDIA RTX,” Jun. 2018. [Online]. Available: https://developer.nvidia.com/ blog/ray-tracing-games-nvidia-rtx/ [16] F. Luna, Introduction to 3D Game Programming with DirectX 12. Dulles, VA, USA: Mercury Learning & Information, 2016. [17] Microsoft, “Build Ray tracing Acceleration Structure,” Dec. 2018. [On- line]. Available: https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ nf-d3d12-id3d12graphicscommandlist4-buildraytracingaccelerationstructure [18] ——, “Dispatch Rays,” May 2018. [Online]. Avail- able: https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ nf-d3d12-id3d12graphicscommandlist4-dispatchrays [19] ——, “ExecuteCommandLists,” 2018. [Online]. Avail- able: https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ nf-d3d12-id3d12commandqueue-executecommandlists [20] ——, “SetStablePowerState,” 2018. [Online]. Avail- able: https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ nf-d3d12-id3d12device-setstablepowerstate [21] ——, “TraceRay,” May 2018. [Online]. Available: https://docs.microsoft.com/ en-us/windows/win32/direct3d12/traceray-function [22] D. Moass, “Cyberpunk 2077: Ray Tracing Benchmarks,” 2021. [Online]. Available: https://www.kitguru.net/gaming/dominic-moass/ cyberpunk-2077-ray-tracing-on-amd-gpus-benchmarked/ [23] M. Olejnik and P. Kozłowski, “Raytraced Shadows in Call of Duty: Modern Warfare,” 2020. [24] A. Patel, “DirectX Raytracing (DXR) Tier 1.1,” Nov. 2019. [Online]. Available: https://devblogs.microsoft.com/directx/dxr-1-1/ [25] L. Pedersen, “Temporal Reprojection Anti-Aliasing in INSIDE,” 2016. [Online]. Available: https://www.gdcvault.com/play/1022970/ Temporal-Reprojection-Anti-Aliasing-in [26] S. Petersson, “Triangelplockaren, Metro Exodus,” Feb. 2019. [Online]. Available: https://www.youtube.com/watch?v=vYgonNKkUus&ab_channel= StefanPetersson [27] M. Pharr and F. Randima, GPU Gems 2. Addison-Wesley Professional, 2005. [28] H. Poulsen, “POTENTIAL OF GPU BASED HYBRID RAY TRACING FOR REAL-TIME GAMES,” Ph.D. dissertation, Blekinge Institute of Technology, Ronneby, 2009. [29] J. Rodriguez, “The Medium - Tech Review,” Jan. 2021. [Online]. Available: https://www.pcinvasion.com/the-medium-technical-review-pc/ BIBLIOGRAPHY 43

[30] C. Schied, A. Kaplanyan, C. Wyman, A. Patney, C. R. A. Chaitanya, J. Burgess, S. Liu, C. Dachsbacher, A. Lefohn, and M. Salvi, “Spatiotemporal variance-guided filtering: real-time reconstruction for path-traced global illumination,” in Proceedings of High Performance Graphics. Los Angeles California: ACM, Jul. 2017, pp. 1–12. [Online]. Available: https: //dl.acm.org/doi/10.1145/3105762.3105770 [31] T. Schiesser, “Battlefield V DXR,” Nov. 2018. [Online]. Available: https: //www.techspot.com/review/1749-battlefield-ray-tracing-benchmarks/ [32] J. Sjöholm, “NVIDIA RTX in Remedy Northlight,” Helsinki, 2018. [Online]. Available: https://on-demand.gputechconf.com/gtc-eu/2018/pdf/ e8530-nvidia-rtx-in-remedy-northlight-engine.pdf [33] J. Story, “Hybrid Ray-Traced Shadows,” San Francisco, 2015. [On- line]. Available: http://developer.download.nvidia.com/assets/events/GDC15/ hybrid_ray_traced_GDC_2015.pdf [34] J. Walton, “Graphics Cards Ranked,” 2021. [Online]. Available: https: //www.tomshardware.com/reviews/gpu-hierarchy,4388.html [35] S. Whims, V. Kents, R. Garmsen, D. Batgit, and M. Satranjr, “RAY_flag enum,” May 2018. [Online]. Available: https://docs.microsoft.com/en-us/ windows/win32/direct3d12/ray_flag

Appendix A Detailed Results

The tables in this appendix presents detailed results of the benchmarks conducted in this experiment. In the left most column, the number of lights are listed (L). The average of the three ray-tracing methods are calculated and presented. The (%) columns show the performance increase of a method compared to the average of all the methods. To recap, M1 is the ray tracing measurement, M2 measures the whole pipeline execution. M2-M1 shows how the different methods affect the pipeline without the ray tracing accounted for.

45 46 Appendix A. Detailed Results

A.1 Sponza

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 0.863619 -26% 1.183426 -3% 2.047045 -12% Radeon IRTP 0.646131 6% 1.135783 1% 1.781914 3% RX 6900 DRT 0.552861 20% 1.111995 3% 1.664856 9% XT avg 0.687537 1.143735 1.831272 AMD IRTC 1.244042 -16% 1.590819 0% 2.834861 -7% Radeon IRTP 1.024717 5% 1.579883 0% 2.604600 2% RX 6700 DRT 0.951092 11% 1.581119 0% 2.532211 5% XT avg 1.073284 1.583940 2.657224 1 NVIDIA IRTC 1.351892 2% 1.965469 3% 3.317361 3% GeForce IRTP 1.401342 -1% 2.065936 -2% 3.467278 -2% RTX DRT 1.394225 -1% 2.068053 -2% 3.462278 -1% 2070 avg 1.382486 2.033153 3.415639 NVIDIA IRTC 0.553631 5% 0.840497 0% 1.394128 2% GeForce IRTP 0.560203 4% 0.842005 0% 1.402208 2% RTX DRT 0.636978 -9% 0.838289 0% 1.475267 -4% 3080 avg 0.583604 0.840264 1.423868 AMD IRTC 1.637294 -24% 2.092845 -4% 3.730139 -12% Radeon IRTP 1.254150 5% 2.006545 0% 3.260695 2% RX 6900 DRT 1.057736 20% 1.950486 3% 3.008222 10% XT avg 1.316393 2.016625 3.333019 AMD IRTC 2.364578 -15% 2.815561 -1% 5.180139 -7% Radeon IRTP 1.962267 4% 2.785872 0% 4.748139 2% RX 6700 DRT 1.830733 11% 2.790378 0% 4.621111 5% XT avg 2.052526 2.797270 4.849796 2 NVIDIA IRTC 2.799750 1% 3.347250 0% 6.147000 1% GeForce IRTP 2.851583 -1% 3.344528 0% 6.196111 0% RTX DRT 2.849250 -1% 3.345250 0% 6.194500 0% 2070 avg 2.833528 3.345676 6.179204 NVIDIA IRTC 1.143475 4% 1.297008 0% 2.440483 2% GeForce IRTP 1.151439 3% 1.299880 0% 2.451319 1% RTX DRT 1.275139 -7% 1.297444 0% 2.572583 -3% 3080 avg 1.190018 1.298111 2.488128 Table A.1: A detailed performance comparison of the different methods in the scene Sponza with 1 & 2 lights. A.1. Sponza 47

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 3.330694 -22% 4.192584 -3% 7.523278 -11% Radeon IRTP 2.681600 2% 4.060900 0% 6.742500 1% RX 6900 DRT 2.189800 20% 3.942811 3% 6.132611 10% XT avg 2.734031 4.065432 6.799463 AMD IRTC 4.759056 -13% 5.453221 0% 10.212277 -6% Radeon IRTP 4.103861 3% 5.443639 0% 9.547500 1% RX 6700 DRT 3.800611 10% 5.462250 0% 9.262861 4% XT avg 4.221176 5.453037 9.674213 4 NVIDIA IRTC 6.066444 0% 5.571528 1% 11.637972 0% GeForce IRTP 5.988584 1% 5.745027 -2% 11.733611 -1% RTX DRT 6.076250 -1% 5.526028 2% 11.602278 0% 2070 avg 6.043759 5.614194 11.657954 NVIDIA IRTC 2.388367 4% 2.182411 0% 4.570778 2% GeForce IRTP 2.395119 3% 2.182048 0% 4.577167 2% RTX DRT 2.642083 -7% 2.187056 0% 4.829139 -4% 3080 avg 2.475190 2.183838 4.659028 AMD IRTC 5.349972 -6% 7.508694 2% 12.858666 -1% Radeon IRTP 5.436306 -7% 7.798862 -2% 13.235168 -4% RX 6900 DRT 4.425389 13% 7.565972 1% 11.991361 6% XT avg 5.070556 7.624509 12.695065 AMD IRTC 9.226167 -12% 10.260165 0% 19.486332 -5% Radeon IRTP 8.004861 3% 10.245862 0% 18.250723 1% RX 6700 DRT 7.411195 10% 10.281277 0% 17.692472 4% XT avg 8.214074 10.262435 18.476509 8 NVIDIA IRTC 12.114305 0% 10.217251 0% 22.331556 0% GeForce IRTP 12.225862 -1% 10.251887 0% 22.477749 -1% RTX DRT 12.028889 1% 10.191667 0% 22.220556 1% 2070 avg 12.123019 10.220268 22.343287 NVIDIA IRTC 4.886305 3% 4.026556 0% 8.912861 2% GeForce IRTP 4.888083 3% 4.025167 0% 8.913250 2% RTX DRT 5.300472 -5% 4.024556 0% 9.325028 -3% 3080 avg 5.024953 4.025426 9.050380 Table A.2: A detailed performance comparison of the different methods in the scene Sponza with 4 & 8 lights. 48 Appendix A. Detailed Results

A.2 Dragon

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 0.409106 -10% 1.603308 -1% 2.012414 -3% Radeon IRTP 0.424686 -14% 1.587261 0% 2.011947 -3% RX 6900 DRT 0.283942 24% 1.576308 1% 1.860250 5% XT avg 0.372578 1.588959 1.961537 AMD IRTC 0.681956 -12% 2.290905 0% 2.972861 -3% Radeon IRTP 0.643086 -6% 2.284886 0% 2.927972 -1% RX 6700 DRT 0.497119 18% 2.271381 0% 2.768500 4% XT avg 0.607387 2.282391 2.889778 1 NVIDIA IRTC 0.855619 1% 2.339686 0% 3.195305 0% GeForce IRTP 0.875164 -1% 2.344030 0% 3.219194 0% RTX DRT 0.869217 0% 2.343755 0% 3.212972 0% 2070 avg 0.866667 2.342490 3.209157 NVIDIA IRTC 0.296836 13% 1.080158 0% 1.376994 3% GeForce IRTP 0.299081 12% 1.078947 0% 1.378028 3% RTX DRT 0.428303 -25% 1.079386 0% 1.507689 -6% 3080 avg 0.341407 1.079497 1.420904 AMD IRTC 1.008070 -19% 2.555819 -1% 3.563889 -5% Radeon IRTP 0.920083 -8% 2.547945 0% 3.468028 -2% RX 6900 DRT 0.620497 27% 2.523586 1% 3.144083 7% XT avg 0.849550 2.542450 3.392000 AMD IRTC 1.635922 -20% 3.566161 0% 5.202083 -6% Radeon IRTP 1.402592 -3% 3.557380 0% 4.959972 -1% RX 6700 DRT 1.058533 22% 3.552217 0% 4.610750 6% XT avg 1.365682 3.558586 4.924268 2 NVIDIA IRTC 1.792164 1% 3.424169 0% 5.216333 0% GeForce IRTP 1.814767 -1% 3.427289 0% 5.242056 0% RTX DRT 1.799103 0% 3.420341 0% 5.219444 0% 2070 avg 1.802011 3.423933 5.225944 NVIDIA IRTC 0.645978 8% 1.558905 0% 2.204883 3% GeForce IRTP 0.622769 12% 1.566928 0% 2.189697 3% RTX DRT 0.848286 -20% 1.556586 0% 2.404872 -6% 3080 avg 0.705678 1.560806 2.266484 Table A.3: A detailed performance comparison of the different methods in the scene Dragon with 1 & 2 lights. A.2. Dragon 49

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 1.681161 -19% 4.528867 -1% 6.210028 -5% Radeon IRTP 1.518364 -7% 4.478108 0% 5.996472 -2% RX 6900 DRT 1.049533 26% 4.461995 1% 5.511528 7% XT avg 1.416353 4.489657 5.906009 AMD IRTC 2.715561 -21% 6.135356 0% 8.850917 -5% Radeon IRTP 2.249553 0% 6.130586 0% 8.380139 0% RX 6700 DRT 1.791797 20% 6.151453 0% 7.943250 5% XT avg 2.252304 6.139132 8.391435 4 NVIDIA IRTC 3.741917 -2% 6.087333 0% 9.829250 0% GeForce IRTP 3.688889 -1% 6.126111 0% 9.815000 0% RTX DRT 3.565222 3% 6.131917 0% 9.697139 1% 2070 avg 3.665343 6.115120 9.780463 NVIDIA IRTC 1.228219 9% 2.524503 0% 3.752722 3% GeForce IRTP 1.216578 10% 2.524589 0% 3.741167 3% RTX DRT 1.611911 -19% 2.523978 0% 4.135889 -7% 3080 avg 1.352236 2.524357 3.876593 AMD IRTC 3.324389 -19% 8.213722 -1% 11.538111 -6% Radeon IRTP 2.938333 -5% 8.119361 0% 11.057694 -1% RX 6900 DRT 2.119047 24% 8.028953 1% 10.148000 7% XT avg 2.793923 8.120679 10.914602 AMD IRTC 5.132611 -20% 10.939889 0% 16.072500 -5% Radeon IRTP 4.288139 0% 10.924917 0% 15.213056 0% RX 6700 DRT 3.462833 19% 10.968472 0% 14.431305 5% XT avg 4.294528 10.944426 15.238954 8 NVIDIA IRTC 7.821556 -4% 10.570500 0% 18.392056 -2% GeForce IRTP 7.515194 0% 10.609306 0% 18.124500 0% RTX DRT 7.190639 4% 10.588528 0% 17.779167 2% 2070 avg 7.509130 10.589445 18.098574 NVIDIA IRTC 2.455508 9% 4.454492 0% 6.910000 3% GeForce IRTP 2.427778 10% 4.455972 0% 6.883750 4% RTX DRT 3.186694 -18% 4.463390 0% 7.650084 -7% 3080 avg 2.689993 4.457951 7.147945 Table A.4: A detailed performance comparison of the different methods in the scene Dragon with 4 & 8 lights. 50 Appendix A. Detailed Results

A.3 Sponza4Dragons

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 0.914022 -13% 2.397561 0% 3.311583 -4% Radeon IRTP 0.825422 -2% 2.389856 0% 3.215278 -1% RX 6900 DRT 0.685639 15% 2.383278 0% 3.068917 4% XT avg 0.808361 2.390232 3.198593 AMD IRTC 1.396508 -15% 4.320020 -1% 5.716528 -4% Radeon IRTP 1.181839 2% 4.283967 0% 5.465806 1% RX 6700 DRT 1.056236 13% 4.284959 0% 5.341195 3% XT avg 1.211528 4.296315 5.507843 1 NVIDIA IRTC 1.478100 1% 4.520761 0% 5.998861 0% GeForce IRTP 1.522117 -1% 4.522244 0% 6.044361 0% RTX DRT 1.499525 0% 4.524114 0% 6.023639 0% 2070 avg 1.499914 4.522373 6.022287 NVIDIA IRTC 0.571936 5% 2.158447 0% 2.730383 1% GeForce IRTP 0.578244 4% 2.158262 0% 2.736506 1% RTX DRT 0.661547 -10% 2.151842 0% 2.813389 -2% 3080 avg 0.603909 2.156184 2.760093 AMD IRTC 1.742853 -14% 3.365647 0% 5.108500 -5% Radeon IRTP 1.517433 1% 3.356206 0% 4.873639 0% RX 6900 DRT 1.329011 13% 3.348017 0% 4.677028 4% XT avg 1.529766 3.356623 4.886389 AMD IRTC 2.668533 -15% 5.595217 0% 8.263750 -5% Radeon IRTP 2.270586 3% 5.540775 1% 7.811361 1% RX 6700 DRT 2.049608 12% 5.577475 0% 7.627083 3% XT avg 2.329576 5.571156 7.900731 2 NVIDIA IRTC 3.177695 1% 5.656194 0% 8.833889 0% GeForce IRTP 3.266778 -2% 5.644195 0% 8.910973 -1% RTX DRT 3.193917 1% 5.654805 0% 8.848722 0% 2070 avg 3.212797 5.651731 8.864528 NVIDIA IRTC 1.192936 4% 2.622620 0% 3.815556 1% GeForce IRTP 1.200300 3% 2.626255 0% 3.826555 1% RTX DRT 1.323195 -7% 2.614333 0% 3.937528 -2% 3080 avg 1.238810 2.621069 3.859880 Table A.5: A detailed performance comparison of the different methods in the scene Sponza4Dragons with 1 & 2 lights. A.3. Sponza4Dragons 51

L Graphics Card M1 (ms) % M2 - M1 (ms) % M2 (ms) % AMD IRTC 3.462694 -14% 5.401861 0% 8.864555 -5% Radeon IRTP 3.001111 2% 5.373889 0% 8.375000 1% RX 6900 DRT 2.679897 12% 5.389242 0% 8.069139 4% XT avg 3.047901 5.388331 8.436231 AMD IRTC 5.115389 -13% 8.086083 0% 13.201472 -5% Radeon IRTP 4.439972 2% 8.075556 0% 12.515528 1% RX 6700 DRT 4.039583 11% 8.092694 0% 12.132277 4% XT avg 4.531648 8.084778 12.616426 4 NVIDIA IRTC 6.292056 0% 7.897027 0% 14.189083 0% GeForce IRTP 6.408695 -1% 7.892778 0% 14.301473 -1% RTX DRT 6.267528 1% 7.897722 0% 14.165250 0% 2070 avg 6.322760 7.895842 14.218602 NVIDIA IRTC 2.435472 3% 3.507778 0% 5.943250 1% GeForce IRTP 2.443336 3% 3.507775 0% 5.951111 1% RTX DRT 2.688875 -7% 3.515959 0% 6.204834 -3% 3080 avg 2.522561 3.510504 6.033065 AMD IRTC 6.989695 -14% 9.187721 -1% 16.177416 -6% Radeon IRTP 6.042806 2% 9.071360 1% 15.114166 1% RX 6900 DRT 5.434250 12% 9.119139 0% 14.553389 5% XT avg 6.155584 9.126073 15.281657 AMD IRTC 9.847584 -12% 12.919611 0% 22.767195 -5% Radeon IRTP 8.579611 2% 12.873890 0% 21.453501 1% RX 6700 DRT 7.838083 10% 12.925499 0% 20.763582 4% XT avg 8.755093 12.906333 21.661426 8 NVIDIA IRTC 12.451612 0% 12.657666 0% 25.109278 0% GeForce IRTP 12.595944 -1% 12.638334 0% 25.234278 -1% RTX DRT 12.287556 1% 12.647861 0% 24.935417 1% 2070 avg 12.445037 12.647954 25.092991 NVIDIA IRTC 4.990917 3% 5.339054 0% 10.329971 1% GeForce IRTP 4.988667 3% 5.346528 0% 10.335195 1% RTX DRT 5.407889 -5% 5.335638 0% 10.743527 -3% 3080 avg 5.129158 5.340407 10.469564 Table A.6: A detailed performance comparison of the different methods in the scene Sponza4Dragons with 4 & 8 lights.

Appendix B Code Snippets

The following code snippets are from the shaders used for the different ray-tracing methods. The code snippets are in-part psuedocode/realcode so that they can be understood easier.

B.1 Random number generation

1 // Generates a seed for a random number generator from 2 inputs 2 uint initRand ( uint val0 , uint val1 , uint backoff = 16) 3 { 4 uint v0 = val0, v1 = val1, s0 = 0; 5 6 [ unroll ] 7 for ( uint n = 0; n < backoff; n++) 8 { 9 s0 += 0x9e3779b9; 10 v0 += ((v1 << 4) + 0xa341316c) ^ (v1 + s0) ^ ((v1 >> 5) + 0 xc8013ea4); 11 v1 += ((v0 << 4) + 0xad90777d) ^ (v0 + s0) ^ ((v0 >> 5) + 0 x7e95761e); 12 } 13 return v0; 14 } 15 16 // Returns a pseudorandom float in [0..1] from seed 17 float nextRand(inout uint s) 18 { 19 s = (1664525u * s + 1013904223u); 20 return float (s & 0x00FFFFFF) / float (0x01000000); 21 } Listing B.1: Random Number Generation

53 54 Appendix B. Code Snippets

B.2 Ray-Generation Shader

1 void RayGen () 2 { 3 // Get the pixel location within the dispatched 2D grid 4 uint2 launchIndex = DispatchRaysIndex(); 5 float2 dims = float2 (DispatchRaysDimensions().xy); 6 float2 uv = launchIndex.xy / dims.xy; 7 8 float depth = textures[dIndex].SampleLevel(samp, uv, 0).r; 9 float3 worldPos = WorldPosFromDepth(depth, uv); 10 11 uint seed = initRand(frameSeed * uv.x, frameSeed * uv.y); 12 for ( int i = 0; i < numLights; i++) 13 { 14 PointLight pl = lights[i]; 15 float3 lightDir = normalize(pl.lightPos - worldPos); 16 float3 perpL = normalize(cross(lightDir, float3 (0.f,1.0f,0.f) )); 17 if (all(perpL == 0.0f)) 18 perpL.x = 1.0; 19 float3 toEdge = normalize(( pl.position + perpL * 20 pl.lightRadius) - worldPos); 21 float coneAngle = acos(dot(lightDir, toEdge)) * 2; 22 float sumFactor = 0; 23 for ( int j = 0; j < rpp; j++) 24 { 25 float3 randDir = getConeSample(seed,lightDir,coneAngle); 26 27 RayDesc ray; 28 ray.Origin = float4 (worldPos.xyz, 1.0f); 29 ray.Direction = normalize(randDir); 30 ray.TMin = 1.0; 31 ray.TMax = distance(pl.position, worldPos); 32 33 ShadowHitInfo shadowPayload; 34 shadowPayload.isHit = true; 35 36 TraceRay ( 37 SceneBVH , 38 RAY_FLAG_SKIP_CLOSEST_HIT_SHADER | 39 RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH, 40 0xFF , 41 0, 0, 0, 42 ray , 43 shadowPayload); 44 sumFactor += shadowPayload.isHit ? 0.0 : 1.0; 45 } 46 sumFactor /= rpp; 47 48 // Write visibility value for each light 49 uav[i][DispatchRaysIndex().xy] = min(sumFactor, 1.0); 50 } 51 } Listing B.2: Ray-Generation Shader B.3. IRT Function Part1 55

B.3 IRT Function Part1

1 float IRT_ShadowFactorSoft( float3 worldPos , float3 lightPos , float2 uv , float3 lightDir, inout uint seed ) 2 { 3 float lightRadius = 1.0; 4 5 float3 perpL = normalize(cross(lightDir, float3 (0.0f, 1.0f, 0.0f ))); 6 7 // Handle case where L = up -> perpL should then be (1,0,0) 8 if (all(perpL == 0.0f)) 9 { 10 perpL.x = 1.0; 11 } 12 13 // Use perpL to get a vector from worldPosition to the edge of the light sphere 14 float3 toLightEdge = normalize((lightPos + perpL * lightRadius) - worldPos); 15 16 // Angle between L and toLightEdge. Used as the cone angle when sampling shadow rays 17 float coneAngle = acos(dot(lightDir, toLightEdge)) * 2; 18 19 float sumFactor = 0; 20 for ( int i = 0; i < rpp; i++) 21 { 22 float factor = 0; 23 float3 randDir = getConeSample(seed, lightDir, coneAngle); 24 25 factor = RT_ShadowFactor(worldPos, 1.0f, distance(lightPos, worldPos), randDir); 26 27 sumFactor += factor; 28 } 29 30 sumFactor /= rpp; 31 32 return sumFactor ; 33 } Listing B.3: Inline Ray Tracing Part1 56 Appendix B. Code Snippets

B.4 IRT Function Part2

1 float RT_ShadowFactor( float3 worldPos , float tMin , float tMax , float3 rayDir ) 2 { 3 RayQuery q; 4 5 uint rayFlags = 0; 6 uint instanceMask = 0xff; 7 8 float shadowFactor = 1.0f; 9 10 RayDesc ray = (RayDesc)0; 11 ray.TMin = tMin; 12 ray.TMax = tMax; 13 14 ray.Direction = normalize(rayDir); 15 ray.Origin = float4 (worldPos.xyz, 1.0f); 16 17 q.TraceRayInline( 18 SceneBVH , 19 rayFlags , 20 instanceMask, 21 ray 22 ); 23 24 q.Proceed(); 25 26 if (q.CommittedStatus() == COMMITTED_TRIANGLE_HIT) 27 { 28 shadowFactor = 0.0f; 29 } 30 31 return shadowFactor; 32 } Listing B.4: Inline Ray Tracing Part2 B.5. IRT using the Compute Shader 57

B.5 IRT using the Compute Shader

1 [numthreads(256, 1, 1)] 2 void CS_main ( uint3 dispatchThreadID : SV_DispatchThreadID, int3 groupThreadID : SV_GroupThreadID) 3 { 4 float2 uv = dispatchThreadID.xy / screenSize; 5 6 float depth = textures[dIndex].SampleLevel(samp, uv, 0).r; 7 float3 worldPos = WorldPosFromDepth(depth, uv); 8 9 uint seed = initRand(frameSeed * uv.x, frameSeed * uv.y); 10 11 for ( int i = 0; i < numLights; i++) 12 { 13 PointLight pl = lights[i]; 14 15 float3 lightDir = normalize(pl.position.xyz - worldPos.xyz); 16 float shadowFactor = IRT_ShadowFactorSoft( 17 worldPos.xyz, 18 pl.position.xyz, 19 uv , 20 lightDir , 21 seed ); 22 23 shadowFactor = min(shadowFactor, 1); 24 light_uav[i * 2 + 1][dispatchThreadID.xy] = shadowFactor; 25 } 26 } Listing B.5: Compute Shader 58 Appendix B. Code Snippets

B.6 IRT using the Pixel Shader

1 void PS_main(VS_OUT input) 2 { 3 // pixel index 4 float2 d = input.pos.xy - float2 (0.5f, 0.5f); 5 float2 uv = d / screenSize; 6 7 float depth = textures[dIndex].SampleLevel(samp, uv, 0).r; 8 float3 worldPos = WorldPosFromDepth(depth, uv); 9 10 uint seed = initRand(frameSeed * uv.x, frameSeed * uv.y); 11 12 for ( int i = 0; i < numLights; i++) 13 { 14 PointLight pl = lights[i]; 15 16 float3 lightDir = normalize(pl.position.xyz - worldPos.xyz); 17 float shadowFactor = IRT_ShadowFactorSoft( 18 worldPos.xyz, 19 pl.position.xyz, 20 uv , 21 lightDir , 22 seed ); 23 24 shadowFactor = min(shadowFactor, 1); 25 light_uav[i * 2 + 1][d] = shadowFactor; 26 } 27 } Listing B.6: Pixel Shader

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden