CHAPTER 17 USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING Matt Pettineo Ready At Dawn Studios

ABSTRACT

Resource binding in 12 can be complex and diffcult to implement correctly, particularly when used in conjunction with DirectX Raytracing. This chapter will explain how to use bindless techniques to provide with global access to all resources, which can simplify application and code while also enabling new techniques.

17.1 INTRODUCTION

Prior to the introduction of Direct3D 12, GPU texture and buffer resources were accessed using a simple CPU-driven binding model. The GPU’s resource access capabilities were typically exposed as a fxed set of “slots” that were

Figure 17-1. The Sun Temple scene [5] rendered with an open source DXR path tracer [9] that uses bindless resources.

© 2021 A. Marrs, P. Shirley, I. Wald (eds.), Ray Tracing Gems II, https://doi.org/10.1007/978-1-4842-7185-8_17 257 RAY TRACING GEMS II

tied to a particular stage of the logical GPU pipeline, and API functions were provided that allowed the GPU to “bind” a resource view to one of the exposed slots. This sort of binding model was a natural ft for earlier GPUs, which typically featured a fxed set of hardware descriptor registers that were used by the shader cores to access resources.

While this old style of binding was relatively simple and well understood, it naturally came with many limitations. The limited nature of the binding slots meant that programs could typically only bind the exact set of resources that would be accessed by a particular shader program, which would often have to be done before every draw or dispatch. The CPU-driven nature of binding demanded that a shader’s required resources had to be statically known after compilation, which naturally led to inherent restrictions on the complexity of a shader program.

As ray tracing on the GPU started to gain traction, the classic binding model reached its breaking point. Ray tracing tends to be an inherently global process: one shader program might launch rays that could potentially interact with every material in the scene. This is largely incompatible with the notion of having the CPU bind a fxed set of resources prior to dispatch. Techniques such as atlasing or Sparse Virtual Texturing [1] can be viable as a means of emulating global resource access, but may also require adding signifcant complexity to a renderer.

Fortunately, newer GPUs and APIs no longer suffer from the same limitations. Most recent GPU architectures have shifted to a model where resource descriptors can be loaded from memory instead of from registers, and in some cases they can also access resources directly from a memory address. This removes the prior restrictions on the number of resources that can be accessed by a particular shader program, and also opens the door for those shader programs to dynamically choose which resource is actually accessed.

This newfound fexibility is directly refected in the binding model of Direct3D 12, which has been completely revamped compared to previous versions of the API. In particular, it supports features that collectively enable a technique commonly known as bindless resources [2]. When implemented, bindless techniques effectively provide shader programs with full global access to the full set of textures and buffers that are present on the GPU. Instead of requiring the CPU to bind a view for each individual resource, shaders can instead access an individual resource using a simple 32-bit index that can be

258 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

freely embedded in user-defned data structures. While this level of fexibility can be incredibly useful in more traditional rasterization scenarios [6], they are borderline essential when using DirectX Raytracing (DXR).

The remainder of this chapter will cover the details of how to enable bindless resource access using Direct3D 12 (D3D12), and will also cover the basics of how to use bindless techniques in a DXR ray tracer. Basic familiarity with both D3D12 and DXR is assumed, and we refer the reader to an introductory chapter from the frst volume of Ray Tracing Gems [11].

17.2 TRADITIONAL BINDING WITH DXR

Like the rest of D3D12, DXR utilizes root signatures to specify how resources should be made available to shader programs. These root signatures specify collections of root descriptors, descriptor tables, and 32-bit constants and map those to ranges of HLSL binding registers. When using DXR, we actually deal with two different types of root signatures: a global root signature and a local root signature. The global root signature is applicable to the ray generation shader as well as all executed miss, any-hit, closest-hit, and intersection shaders. The local root signature only applies to a particular hit group. Used together, global and local root signatures can implement a fairly traditional binding model where the CPU code “pushes” descriptors for all resources that are needed by the shader program. In a typical rendering scenario, this would likely involve having the local root signature provide a descriptor table containing all textures and constant buffers required by the particular material assigned to the mesh in the hit group. An example of this traditional model of resource binding is shown in Figure 17-2.

Though this approach can be workable, there are several problems that make it less than ideal. First, the mechanics of the local root signature are somewhat inconsistent with how root signatures normally work within D3D12. Standard root signatures require using command list APIs to specify which constants, root descriptors, and descriptor table should be bound to corresponding entries in the root signature. Local root signatures do not work this way because there can be many different root signatures contained within a single state object. Instead, the parameters for the root signature entries must be placed inline within a shader record in the hit group shader table, immediately following the shader identifer. This setup is further complicated by the fact that both shader records and root signature parameters have specifc alignment requirements that must be observed. Shader identifers

259 RAY TRACING GEMS II

Root Signature Descriptor Heap

SRV Table A

SRV Table B

Descriptor 17 Root CBV Descriptor 18 Descriptor 19

Descriptor 23 Descriptor 24 Descriptor 25 Constants

Figure 17-2. An example of traditional resource binding in D3D12. A root signature contains two entries for Shader Resource View (SRV) descriptor tables, each of which point to a range of contiguous SRV descriptors within a global descriptor heap, as well as contains the root Constant Buffer View (CBV).

consume 32 bytes (D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES) and must also be located at an offset that is aligned to 32 bytes (D3D12_RAYTRACING_SHADER_RECORD_BYTE_ALIGNMENT). Root signature parameters that are 8 bytes in size (such as root descriptors) must also be placed at offsets that are aligned to 8 bytes. Thus, carefully written packing code or helper types have to be used in order to fulfll these specifc rules when generating the shader table. The following code shows an example of what shader record helper structs might look like:

1 struct ShaderIdentifier 2 { 3 uint8_t Data[D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES] = { }; 4 5 ShaderIdentifier() = default; 6 explicit ShaderIdentifier(const void* idPointer) 7 { 8 memcpy(Data, idPointer, D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES); 9 } 10 }; 11 12 struct HitGroupRecord 13 { 14 ShaderIdentifier ID; 15 D3D12_GPU_DESCRIPTOR_HANDLE SRVTableA = { }; 16 D3D12_GPU_DESCRIPTOR_HANDLE SRVTableB = { }; 17 uint32_t Padding1 = 0; // Ensure that CBV has 8-byte alignment. 18 uint64_t CBV = 0; 19 uint8_t Padding[8] = { }; // Needed to keep shader ID at 20 // 32-byte alignment 21 };

260 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

For more complex scenarios with many meshes and materials, the local root signature approach can quickly become unwieldy and error-prone. Generating the local root signature arguments on the CPU as part of flling the hit shader table is the most straightforward approach, but this can consume precious CPU cycles if it needs to be done frequently in order to support dynamic arguments. It also subjects us to some of the same limitations on our shader programs that we had in previous versions of D3D: the set of resources required for a hit shader must be known in advance, and the shader itself must be written so that it uses a static set of resource bindings. Generation of the shader table on the GPU using compute shaders is a viable option that allows for more GPU-driven rendering approaches, however this can be considerably more diffcult to write, validate, and debug compared with the equivalent CPU implementation. It also does not remove the limitations regarding shader programs and dynamically selecting which resources to access. Both approaches are fundamentally incompatible with the new inline raytracing functionality that was added for DXR 1.1, since using a local root signature is no longer an option. This means that we must consider a less restrictive binding solution in order to make use of the new RayQuery APIs for general rendering scenarios.

17.3 BINDLESS RESOURCES IN D3D12

As mentioned earlier, bindless techniques allow us to effectively provide our shader programs with global access to all currently loaded resources instead of being restricted to a small subset. D3D12 supports bindless access to every resource type that utilizes shader visible descriptor heaps: Shader Resource Views (SRVs), Constant Buffer Views (CBVs), Unordered Access Views (UAVs), and Samplers. However, these types each have different limitations that can prevent their use in bindless scenarios, most of which are dictated by the value of D3D12_RESOURCE_BINDING_TIER that is exposed by the device. We will cover some of these limitations in more detail in Section 17.5, but for now we will primarily focus on using bindless techniques for SRVs because they typically form the bulk of resources accessed by shader programs. However, the concepts described here can generally be extended to other resource views with little effort.

They key to enabling bindless resources with D3D12 is setting up our root signature in a way that effectively exposes an entire descriptor heap through a single root parameter. The most straightforward way to do this is to add a parameter with a type of D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE,

261 RAY TRACING GEMS II

with a single unbounded descriptor range:

1 // Unbounded range of descriptor SRV to expose the entire heap 2 D3D12_DESCRIPTOR_RANGE1 srvRanges[1] = {}; 3 srvRanges[0].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_SRV; 4 srvRanges[0].NumDescriptors = UINT_MAX; 5 srvRanges[0].BaseShaderRegister = 0; 6 srvRanges[0].RegisterSpace = 0; 7 srvRanges[0].OffsetInDescriptorsFromTableStart = 0; 8 srvRanges[0].Flags = D3D12_DESCRIPTOR_RANGE_FLAG_DESCRIPTORS_VOLATILE; 9 10 D3D12_ROOT_PARAMETER1 params[1] = {}; 11 12 // Descriptor table root parameter 13 params[0].ParameterType = D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE; 14 params[0].ShaderVisibility = D3D12_SHADER_VISIBILITY_ALL; 15 params[0].DescriptorTable.pDescriptorRanges = srvRanges; 16 params[0].DescriptorTable.NumDescriptorRanges = 1;

When building our command lists for rendering, we can then pass the handle returned by ID3D12DescriptorHeap::GetGPUDescriptorHandleForHeapStart() to ID3D12GraphicsCommandList::SetGraphicsRootDescriptorTable() or ID3D12GraphicsCommandList::SetComputeRootDescriptorTable() in order to make the entire contents of that heap available to the shader. This allows us to place our created descriptors anywhere in the heap without needing to partition it in any way.

To access a particular resource’s descriptor in our shader program, we can use a technique known as descriptor indexing. In HLSL, this technique frst requires us to declare an array of a particular shader resource type (such as Texture2D). To access a particular resource, we merely need to index into the array using an ordinary integer. A simple and straightforward way to do this is to use a constant buffer to pass descriptor indices to a shader, as demonstrated in Figure 17-3.

1 Texture2D GlobalTextureArray[] : register(t0); 2 SamplerState MySampler : register(s0); 3 4 struct MyConstants 5 { 6 uint TexDescriptorIndex; 7 }; 8 ConstantBuffer MyConstantBuffer : register(b0); 9 10 float4 MyPixelShader(in float2 uv : UV) : SV_Target0 11 { 12 uint texDescriptorIndex = MyConstantBuffer.TexDescriptorIndex; 13 Texture2D myTexture = GlobalTextureArray[texDescriptorIndex]; 14 return myTexture.Sample(MySampler, uv); 15 }

262 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

Descriptor Heap Root Signature

Root CBV Constants Descriptor 17 uint ColorMapIdx; uint NormalMapIdx; Descriptor 19 uint MetallicMapIdx; uint EmissiveMapIdx; Descriptor 21 uint LightBufferIdx; Descriptor 22

Descriptor 24

Figure 17-3. An example of using bindless techniques to access SRV descriptors. A root signature contains a root Constant Buffer View, which points to a block of constants containing 32-bit indices of descriptors within a global descriptor heap. With a bindless setup the descriptors needed by a shader do not need to be contiguous within the descriptor heap and do not need to be ordered with regards to how the shader accesses or declares the resources.

For this example to work, we simply need to ensure that our root signature’s descriptor table parameter is mapped to the t0 register used by the Texture2D array in our shader. If this is done properly, the shader effectively has full access to the entire global descriptor heap. Or, at least it can access all of the descriptors that can be mapped to the Texture2D HLSL type. One limitation of the current version of HLSL is that we need to declare a separate array for each HLSL resource type that we would like to access in our shader, and each one must have a separate non-overlapping register mapping. A simple way to ensure that their assignments don’t overlap is to use a different register space for each resource array. This allows us to continue using unbounded arrays instead of requiring an array size to be compiled into the shader.

1 Texture2D Tex2DTable[] : register(t0, space0); 2 Texture2D Tex2DUintTable[] : register(t0, space1); 3 Texture2DArray Tex2DArrayTable[] : register(t0, space2); 4 TextureCube TexCubeTable[] : register(t0, space3); 5 Texture3D Tex3DTable[] : register(t0, space4); 6 Texture2DMS Tex2DMSTable[] : register(t0, space5); 7 ByteAddressBuffer RawBufferTable[] : register(t0, space6); 8 Buffer BufferUintTable[] : register(t0, space7); 9 // ... and so on

Note that we not only need separate arrays for different resource types like Texture2D versus Texture3D, but we may also require having separate arrays

263 RAY TRACING GEMS II

for different return types (expressed using the C++ template syntax) of the same HLSL resource type. This is evident in the previous example, which has arrays of both Texture2D as well as Texture2D (a texture resource with no return type has an implicit return type of float4). Having textures with various return types is an unfortunate necessity for supporting all possible DirectX Graphics Infrastructure (DXGI) texture formats because certain formats require the shader to declare the HLSL resource with a specifc return type. The following table lists the appropriate return type for each of the format modifers available in the DXGI_FORMAT enumeration:

UNORM foat SNORM foat FLOAT foat UINT uint SINT int

One consequence of having separate HLSL resource arrays and register space bindings is that we must have a corresponding descriptor range in our root signature for each declared array. Luckily for us, it is possible to stack multiple unbounded descriptor ranges in a single root parameter. This ultimately means that we only need to bind our global descriptor heap once for each root signature:

1 D3D12_DESCRIPTOR_RANGE1 ranges[NumDescriptorRanges] = {}; 2 for(uint32_t i = 0; i < NumDescriptorRanges; ++i) 3 { 4 ranges[i].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_SRV; 5 ranges[i].NumDescriptors = UINT_MAX; 6 ranges[i].BaseShaderRegister = 0; 7 ranges[i].RegisterSpace = i; 8 ranges[i].OffsetInDescriptorsFromTableStart = 0; 9 ranges[i].Flags = D3D12_DESCRIPTOR_RANGE_FLAG_DESCRIPTORS_VOLATILE; 10 } 11 12 D3D12_ROOT_PARAMETER1 params[1] = {}; 13 14 // Descriptor table root parameter 15 params[0].ParameterType = D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE; 16 params[0].ShaderVisibility = D3D12_SHADER_VISIBILITY_ALL; 17 params[0].DescriptorTable.pDescriptorRanges = ranges; 18 params[0].DescriptorTable.NumDescriptorRanges = 1;

With this approach, we can expand our support for bindless access to many types of buffer and texture resources. We can even declare our set of HLSL resource arrays in a single header fle and then include that fle in any shader code that needs to access resources. But what about types such as StructuredBuffer or ConstantBuffer, which are typically templated on a

264 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

user-defned struct type? There are effectively unlimited permutations of these types, which precludes us from predefning them in a shared header fle. One possible approach for handling these types is to reserve some additional descriptor table ranges in the root signature that can be utilized by any individual shader program. As an example, we can defne our root signature with eight additional descriptor ranges using register spaces 100 through 107. If we then write a shader program that needs to access a StructuredBuffer with a custom structure type, we simply declare an array bound to one of these reserved register spaces:

1 // Define an array of our custom buffer type assigned to 2 // one of our reserved register spaces. 3 StructuredBuffer MyBufferArray[] : register(t0, space100); 4 5 MyStruct AccessMyBuffer(uint descriptorIndex, uint bufferIndex) 6 { 7 StructuredBuffer myBuffer = MyBufferArray[descriptorIndex]; 8 return myBuffer[bufferIndex]; 9 }

In our original descriptor indexing example, we pulled our descriptor index from a constant buffer. This is a perfectly straightforward way for our CPU code to pass these indices to a shader program, and by flling out the constant buffer just before a draw or dispatch, we can even use this approach to emulate a more traditional binding setup. However, we are by no means limited to only using constant buffers for passing around descriptor indices. Because they are just a simple integer, we can now pack these almost anywhere. For instance, we could have a StructuredBuffer containing a set of descriptor indices for every loaded material in the scene, and a shader program could use a material index to fetch the appropriate set of indices. The indices could even be written into a UINT-formatted (unsigned integer) render target texture if that were useful! We can actually start to think of these indices as handles to our resources, or even as a pointer that grants us access to a resource’s contents.

When writing shader code that uses descriptor indexing, we must always be careful to evaluate whether or not a particular index is uniform. In this context, a descriptor is uniform if it has the same value for all threads within a particular draw, dispatch, or hit, miss, any-hit, or intersection shader invocation.1 In our original descriptor indexing example, the index came from a constant buffer, which means that all of the pixel shader threads used the

1Note that this defnition is distinct from its meaning when used within the context of wave-level shader pro- gramming, where uniform means that the value is the same within a particular warp or wavefront.

265 RAY TRACING GEMS II

same descriptor index. We would consider the index to be uniform in this case. However, let us now consider a more complex scenario. What if instead of coming from a constant buffer, the index was passed from the vertex shader as an interpolant? This would allow the value of that index to vary within the pixel shader threads. In this case the index would be considered nonuniform. Nonuniform descriptor indices are allowed in D3D12, however they do require special consideration. In order to ensure correct results, the index must be passed to the NonUniformResourceIndex intrinsic before being used to index into the resource array. This notifes the driver that the index may be varying, which may require it to insert additional instructions in order to properly handle the varying descriptor within a SIMD execution environment. On most existing GPU architectures, the additional cost of these instructions is proportional to the amount of divergence within the architecture-specifc thread grouping (often referred to as a warp or wavefront). For these reasons it’s important to be judicious about when and where to make use of nonuniform indexing. It’s also important to be aware that the results of nonuniform indexing are undefned when NonUniformResourceIndex is omitted, which means that forgetting to use it may result in correct results on one architecture while producing graphical artifacts on others.

Though the additional fexibility afforded to our shaders can be a primary motivator for adopting bindless-style descriptor indexing, these techniques can also allow for simplifcation of application code when deployed throughout a rendering engine. Accessing a descriptor by indexing inherently grants us sparse access to all descriptors in a heap, which frees us from having to keep a particular shader’s set of descriptors contiguous within that heap. Maintaining contiguous descriptor tables can often require complex (and expensive) management to be performed by our CPU code, which is especially true in cases where descriptors need to be updated in response to changing data.

A very common example is a CPU-updated buffer, often referred to as a dynamic buffer in earlier versions of D3D. Because the CPU cannot write to a buffer while the GPU is reading from it, techniques such as double-buffering or ring-buffering are often deployed to ensure that there is no concurrent access to the same resource memory. However, the act of “swapping” to a new internal buffer also requires swapping descriptors (if not using root descriptors), which causes any existing descriptor tables to become invalid. Dealing with this might normally require versioning entire descriptor tables,

266 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

or spending CPU cycles updating every table in which a particular buffer is referenced. With bindless techniques we no longer need to worry about contiguous descriptor tables, which means that a particular buffer only needs one descriptor to be updated within a heap whenever the buffer’s contents change. We still need to be careful not to update a descriptor heap that’s currently being referenced by executing GPU commands, however we can handle this at a global level by swapping through N global descriptor heaps (where N is the maximum number of frames in fight allowed by the renderer). As long as the descriptors for a buffer’s “versions” are always placed at the same offset within the global heap, the shaders can continue to access the buffer using the same persistent descriptor index. For an example of how these patterns can be used in practice, consult the implementation of StructuredBuffer::Map() in the DXRPathTracer repository on GitHub [9].

17.4 BINDLESS RESOURCES WITH DXR

Though bindless techniques are very useful in general, their benefts really begin to become apparent when used in conjunction with ray tracing. Having global access to all texture and buffer resources is a great ft when tracing rays that can potentially intersect with any geometry in the entire scene, and is a de facto requirement when working with the new inline tracing functionality introduced with DXR 1.1. In this section, we will walk through an example implementation of a simple path tracer that utilizes bindless techniques to access geometry buffers as well as per-material textures. The source code and Visual Studio project for the complete implementation is available to view and download on GitHub [7].

Our simple path tracer will work as follows:

> For every frame, DispatchRays is called to launch one thread per pixel on the screen.

> Every dispatched thread traces a single camera ray into the scene.

> In the closest-hit shader, the surface properties are determined from geometry buffers and material textures.

> The hit shader computes direct lighting from the sun and local light sources, casting a shadow ray to determine visibility.

> The hit shader recursively traces another ray into the scene to gather indirect lighting from other surfaces.

267 RAY TRACING GEMS II

> A miss shader is used to sample sky lighting from a procedural sky model.

> For materials using alpha-testing, an any-hit shader is run to sample an opacity texture.

> Once the original ray generation thread fnishes, it progressively updates the radiance stored in a foating-point accumulation texture.

Because we will be using bindless techniques to access our resources, we can completely forego local root signatures in favor of a single global root signature used for the entire call to DispatchRays. This global root signature will contain the following:

> An SRV descriptor table containing our entire shader-visible descriptor heap (with overlapping ranges for each resource type).

> A root SRV descriptor for our scene’s acceleration structure.

> A UAV descriptor table containing a descriptor for our accumulation texture.2

> A root CBV descriptor containing global constants flled out by the CPU just before the call to DispatchRays.

> Another root CBV containing constants that contain application settings whose values come from a runtime UI controls.

Because we do not need any local root signatures, we can completely skip adding any D3D12_LOCAL_ROOT_SIGNATURE subobjects to our state object, as well as the D3D12_SUBOBJECT_TO_EXPORTS_ASSOCIATION subjects for associating a local root signature with an export. This also means that each shader record in our shader tables only needs to contain the shader identifer, as we have no parameters for a local root signature.

When we described the basic functionality of our path tracer, we mentioned how it needs to compute lighting response every time a path intersects with geometry in the scene. To do this, we need to properly reconstruct the surface attributes at the point of intersection. For attributes such as normals and UV coordinates that are derived from per-vertex attributes, our hit shaders will

2Bindless UAVs can also be used on a device that supports D3D12_RESOURCE_BINDING_TIER_3, which is currently the case for all DXR-capable GPUs. However, the example path tracer referenced by this chapter only utilizes bindless techniques for SRVs.

268 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

need to access the vertex and index buffers in order to fnd the three relevant triangle vertices and interpolate their data. To facilitate this, we will create a StructuredBuffer whose elements are defned by the following struct:

1 struct GeometryInfo 2 { 3 uint VertexOffset; // Can alternatively be descriptor indices 4 // to unique vertex/index buffer SRVs 5 uint IndexOffset; 6 uint MaterialIndex; // Assumes a single global buffer, could also 7 // have a descriptor index for the buffer 8 // and then another index into that buffer 9 uint PadTo16Bytes; // For performance reasons, not required 10 };

Our buffer will be created with one of these elements for every D3D12_RAYTRACING_GEOMETRY_DESC that was provided when building the scene acceleration structure, which in our case corresponds to a single mesh that was present in the original source scene. By doing it this way, we can conveniently index into this buffer using the GeometryIndex() intrinsic that is available to our hit shaders.3

To complement our GeometryInfo buffer, we will also build another structured buffer containing one entry for every unique material in the scene. The elements of this buffer will be defned by the following struct:

1 struct Material 2 { 3 uint Albedo; 4 uint Normal; 5 uint Roughness; 6 uint Metallic; 7 uint Opacity; 8 uint Emissive; 9 };

Each uint in this struct is the index of a descriptor in our global descriptor heap, which allows hit shaders to access those textures through the SRV descriptor table in the global root signature. Note that this assumes a rather uniform material model, where all materials use the same set of textures and don’t require any additional parameters. However, it would be trivial to add additional values to this struct if necessary, provided that those values are common to all materials. It would also be straightforward for a material to indicate that it did not need to sample a particular texture by providing a known invalid index (such as uint32_t(-1)). The hit shaders could then check for this and branch over the texture sample if necessary. A more complex approach might involve having the Material struct instead provide

3GeometryIndex() is a new intrinsic that was added for DXR 1.1.

269 RAY TRACING GEMS II

the index of a CBV descriptor, whose layout would be interpreted by the individual hit shaders. Alternatively, a set of heterogeneous data could be packed into a single ByteAddressBuffer, where again it would be up to the hit shader to interpret and load the data appropriately based on the material’s requirements.

With our shader tables, state object, and geometry/material buffers built, we can now call DispatchRays() to launch many threads of our ray generation program. As we mentioned earlier, our simple path tracer will work by launching one thread for every pixel in the screen. Each of these threads then starts out by using its associated pixel coordinate to compute a camera ray according to a standard perspective projection, which effectively serves as a simple pinhole camera model. Once we’ve computed the appropriate ray direction, we can then call TraceRay() to trace a ray from the camera’s position into our scene. This ray’s payload contains a float3 that will be set to the computed radiance being refected or emitted toward the camera, which can then be written into an output texture through a RWTexture2D. Our path tracer is progressive, which means that it will compute N radiance samples per pixel every frame (with N defaulting to 1) and update the output accumulation texture with these samples. This allows the path tracer to work its way toward a fnal converged image by doing a portion of the work every frame, thus remaining interactive.

In order to compute the outgoing radiance at each hit point (or vertex in path tracer terminology), our closest-hit shader needs to compute the surface attributes at the hit point and then compute the appropriate lighting response. Our frst step is to use our GeometryInfo buffer to fetch the relevant data for the particular mesh that was intersected by the ray:

1 const uint geoInfoBufferIndex = GlobalCB.GeoInfoBufferIndex; 2 StructuredBuffer geoInfoBuffer; 3 geoInfoBuffer = GeometryInfoBuffers[geoInfoBufferIndex]; 4 const GeometryInfo geoInfo = geoInfoBuffer[GeometryIndex()];

Once we have the relevant data for the mesh that was hit, we can then begin to reconstruct the surface attributes by interpolating the per-vertex attributes according to the barycentric coordinates of the ray/triangle intersection:

1 MeshVertex GetHitSurface( 2 in HitAttributes attr, in GeometryInfo geoInfo) 3 { 4 float3 barycentrics; 5 barycentrics.x = 1 - attr.barycentrics.x - attr.barycentrics.y; 6 barycentrics.y = attr.barycentrics.x; 7 barycentrics.z = attr.barycentrics.y; 8

270 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

9 StructuredBuffer vertexBuffer; 10 vertexBuffer = VertexBuffers[GlobalCB.VertexBufferIndex]; 11 Buffer indexBuffer; 12 indexBuffer = BufferUintTable[GlobalCB.IndexBufferIndex]; 13 14 const uint primId = PrimitiveIndex(); 15 const uint id0 = indexBuffer[primId * 3 + geoInfo.IndexOffset + 0]; 16 const uint id1 = indexBuffer[primId * 3 + geoInfo.IndexOffset + 1]; 17 const uint id2 = indexBuffer[primId * 3 + geoInfo.IndexOffset + 2]; 18 19 const MeshVertex vtx0 = vertexBuffer[id0 + geoInfo.VertexOffset]; 20 const MeshVertex vtx1 = vertexBuffer[id1 + geoInfo.VertexOffset]; 21 const MeshVertex vtx2 = vertexBuffer[id2 + geoInfo.VertexOffset]; 22 23 return BarycentricLerp(vertex0, vertex1, vertex2, barycentrics); 24 }

For surface attributes defned by textures, we must fetch the appropriate material defnition and use it to sample our standard set of textures.

1 StructuredBuffer materialBuffer; 2 materialBuffer = MaterialBuffers[GlobalCB.MaterialBufferIndex]; 3 const Material material = materialBuffer[geoInfo.MaterialIndex]; 4 5 Texture2D albedoMap = Tex2DTable[material.Albedo]; 6 Texture2D normalMap = Tex2DTable[material.Normal]; 7 Texture2D roughnessMap = Tex2DTable[material.Roughness]; 8 Texture2D metallicMap = Tex2DTable[material.Metallic]; 9 Texture2D emissiveMap = Tex2DTable[material.Emissive]; 10 11 // Sample textures and compute final surface attributes.

Note how we are able to effectively sample from any arbitrary set of textures here while using only a single hit shader and no local root signatures! This use case perfectly demonstrates the power and fexibility of bindless techniques: all resources are at our disposal from within a shader, and we are able to store handles to those resource in any arbitrary data structure that we would like to use.

With the surface attributes and material properties for the hit point loaded into local variables, we can fnally run our path tracing algorithm to compute an incremental portion of the radiance that will either be emitted from the surface or refected off the surface in the direction of the incoming ray. The full algorithm is as follows:

> Get the set of material textures and sample them using vertex UV coordinates to build surface and shading parameters.

> Account for emission from the surface.

> Sample direct lighting from the sun and cast a shadow ray.

271 RAY TRACING GEMS II

> Sample direct lighting from local light sources and cast a shadow ray for each.

> Choose to sample a diffuse or specular BRDF.

– If diffuse, choose a random cosine-weighted sample on the hemisphere surrounding the surface normal. – If specular, choose a random sample from the distribution of visible microfacet normals and refect a ray off of that.

> Recursively evaluate the incoming radiance in the sample direction.

> Terminate when the desired max path length is reached.

– Alternatively, use Russian roulette to terminate rays with low throughput.

As for our miss shaders, we merely need to sample our procedural sky model in order to account for the radiance that it provides the scene. We can do this by sampling a cube map texture that contains a cache of radiance values for each world-space direction, which we can obtain by passing its descriptor index to the shader through a global constant buffer. One exception is for primary rays that were cast directly from the camera: for this case we also want to sample the emitted radiance from our procedural sun, which allows the sun to be visible in our rendered images. We skip this for secondary rays cast from surfaces, as we directly importance-sample the sun to compute its direct lighting contribution.

1 [shader("miss")] 2 void MissShader(inout PrimaryPayload payload) 3 { 4 const float3 rayDir = WorldRayDirection(); 5 6 TextureCube skyTexture = TexCubeTable[RayTraceCB.SkyTextureIdx]; 7 float3 radiance = 0.0f; 8 if(AppSettings.EnableSky) 9 radiance = skyTexture.SampleLevel(SkySampler, rayDir, 0).xyz; 10 11 if(payload.PathLength == 1) 12 { 13 float cosSunAngle = dot(rayDir, RayTraceCB.SunDirectionWS); 14 if(cosSunAngle >= RayTraceCB.CosSunAngularRadius) 15 radiance = RayTraceCB.SunRenderColor; 16 } 17 18 payload.Radiance = radiance; 19 }

272 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

To support foliage that utilizes alpha testing, we provide an any-hit shader that samples an opacity map to determine if the ray intersection should be discarded. We obtain this opacity map from the material data using the same method that we utilized for the hit shader:

1 [shader("anyhit")] 2 void AnyHitShader(inout PrimaryPayload payload , in HitAttributes attr) 3 { 4 const uint geoInfoBufferIndex = GlobalCB.GeoInfoBufferIndex; 5 StructuredBuffer geoInfoBuffer; 6 geoInfoBuffer = GeometryInfoBuffers[geoInfoBufferIndex]; 7 8 const GeometryInfo geoInfo = geoInfoBuffer[GeometryIndex()]; 9 const MeshVertex hitSurface = GetHitSurface(attr, geoInfo); 10 11 StructuredBuffer materialBuffer; 12 materialBuffer = MaterialBuffers[GlobalCB.MaterialBufferIndex]; 13 const Material material = materialBuffer[geoInfo.MaterialIndex]; 14 15 // Standard alpha testing 16 Texture2D opacityMap = Tex2DTable[material.Opacity]; 17 if(opacityMap.SampleLevel(MeshSampler, hitSurface.UV, 0).x < 0.35f) 18 IgnoreHit(); 19 }

17.5 PRACTICAL IMPLICATIONS OF USING BINDLESS TECHNIQUES

As this chapter has demonstrated, there are many tangible benefts to utilizing bindless techniques in a D3D12 renderer. This is true regardless of whether a renderer makes use of ray tracing through DXR or sticks to more traditional rasterization-based techniques. However, there are also several downsides and practical implications of which one should be aware before broadly adopting a bindless approach.

17.5.1 MINIMUM HARDWARE REQUIREMENTS

Before using bindless techniques at all, it’s important to know the minimum hardware requirements for using them. As of the time this chapter was written, all DXR-capable hardware has support for D3D12_RESOURCE_BINDING_TIER_3, which is capable of utilizing bindless techniques for all shader-accessible resource types. This means that bindless can be utilized with DXR without concern that the underlying hardware or driver will not have support for dynamic indexing into an unbounded SRV, UAV, or sampler table. However, it may be important to consider what functionality is supported on older hardware if writing a general D3D12 rendering engine that supports older generations of video cards.

273 RAY TRACING GEMS II

While the descriptor heap and table abstractions exposed by D3D12 suggest that they only run on the sort of hardware that can read descriptors from memory (and are therefore capable of using arbitrary descriptors in a shader program), in reality the API was carefully designed to allow for older D3D11-era hardware to be supported through compliant drivers. This class of hardware falls under the Tier 1 resource binding tier, where the device reports a value of D3D12_RESOURCE_BINDING_TIER_1 for the ResourceBindingTier member of the D3D12_FEATURE_DATA_D3D12_OPTIONS structure. Devices with Tier 1 resource binding limit the size of bound SRV descriptor tables to the maximum number of SRVs supported by D3D11 (128), which is far less than the number of textures and/or buffers that most engines will have loaded at any given time. Therefore, we would generally consider Tier 1 hardware to be incapable of using bindless techniques, at least outside of certain special-case scenarios. As of the time this chapter was written, GPUs based on NVIDIA’s Fermi architecture (GTX 400 and 500 series) as well as Intel Gen 7.5 (Haswell) and Gen 8 (Broadwell) architecture will report Tier 1 for resource binding [10].

Hardware that falls under the category of Tier 2 resource binding are largely free of restrictions when it comes to accessing SRV descriptors: these devices allow SRV descriptor tables to span the maximum heap size, which is one million. This limit is suffcient for including all shader-readable resources in typical engines, so we would consider this class of hardware to be bindless-capable for SRVs. Tier 2 hardware can also bind a full heap of sampler descriptors (which is capped at 2048), which we would also consider to be suffcient for bindless sampler access in typical scenarios. However, there are still limitations for this tier. Tier 2 has a maximum size of 14 for CBV descriptor tables and a maximum size of 64 for UAV descriptor tables. In addition, this tier imposes restrictions that require that all bound CBV/UAV descriptor tables contain valid, initialized descriptors. As of the time this chapter was written, only GPUs based on NVIDIA’s Kepler architecture (GTX 600 and 700 series) will report Tier 2 for resource binding.4

The most recent graphics hardware will report Tier 3 for resource binding, which removes the Tier 2 limitations on the number of UAV and CBV descriptors that can be bound simultaneously in a descriptor table. Tier 3 allows for the full heap to be bound for CBV and UAV tables and also removes the requirement that these tables contain only valid, initialized descriptors. With these restrictions removed, we can say that Tier 3 is bindless-capable for

4NVIDIA hardware based on the Maxwell (GTX 900 series) and Pascal (GTX 1000 series) initially reported Tier 2 for resource binding in their earliest D3D12 drivers, but later drivers add full support for Tier 3.

274 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

all shader-accessible resources: Shader Resource Views, Unordered Access Views, Constant Buffer Views, and Sampler States. This provides us with maximum freedom and fexibility in terms of providing our shaders with access to our resources. As of the time this chapter was written, GPUs based on NVIDIA’s Maxwell (GTX 900 series), Pascal (GTX 1000 series), Volta (Titan V), Turing (RTX 2000 series and GTX 1600 series), and Ampere (RTX 3000 series) architectures as well as Intel Gen 9 (Skylake), Gen 9.5 (Kaby Lake), Gen 11 (Ice Lake), and Gen 12 (Tiger Lake) report Tier 3 capabilities for resource binding. Tier 3 is also reported by all D3D12-capable AMD hardware, which includes GCN 1 through 5 as well as newer RDNA-based GPUs.

17.5.2 VALIDATION AND DEBUGGING TOOLS

D3D12 includes an optional debugging and validation layer that can be invaluable for development. When enabled, this layer can report issues relating to incorrect API usage and invalid descriptors used by shaders, as well as incorrect or missing resource transition barriers. Unfortunately, these kinds of validations become much more diffcult to perform for an application that uses bindless techniques for accessing resources. With a more traditional binding setup where all required descriptors are provided in a contiguous descriptor table, a validation layer can track CPU-side API calls in order to determine the set of descriptors that will be accessed within a particular draw or dispatch call. This information can then be used to ensure that these descriptors all point to valid resources and that these resources have been previously transitioned to the appropriate state.

When an application deploys bindless techniques, the set of descriptors accessed by a draw or dispatch is no longer visible through CPU-side API calls. Rather, they can only be determined by following the exact fow of execution that happens on the GPU in order to compute the descriptor index that is used to ultimately access a particular descriptor. To address these scenarios, D3D12 also has a special GPU-based validation layer (abbreviated as GBV). When GBV is enabled, which is done by calling ID3D12Debug1::SetEnableGPUBasedValidation(), the validation layer will patch the compiled shader binaries in order to insert instructions that log the accessed descriptors to a hidden buffer. The contents of this buffer can later be inspected, and the information contained can be used to validate descriptor contents and resource states. This functionality makes it a bona fde requirement for development of bindless renderers. Unfortunately, the shader patching process can add considerable time to the pipeline state object (PSO) creation process, and the patched shaders also cause additional

275 RAY TRACING GEMS II

performance overhead on the GPU. Therefore, it is generally recommended to only use it when necessary.

GPU debugging tools such as RenderDoc, PIX, and NVIDIA Nsight Graphics can also be invaluable for solving bugs during development. Unfortunately, they suffer from the same issues that validation layers experience when dealing with bindless applications: it is no longer possible to determine the set of accessed descriptors/resources for a draw or dispatch purely through interception of CPU-side API calls. Without that information, a tool might report all of the resources that were bound through the global descriptor tables, which would likely be the entire descriptor heap for a bindless application. Fortunately, PIX has a solution to this problem: when running analysis on a capture, it can utilize techniques similar to those used by the GPU-based validation layer to determine the set of accessed descriptors for a draw or dispatch. This allows the tool to display those resources appropriately in the State view, which can provide a similar debugging experience to traditional binding. However, it is important to keep in mind that this set of accessed resources can potentially grow very large for cases where descriptor indices are nonuniform within a draw or dispatch.

17.5.3 CRASHES AND UNDEFINED BEHAVIOR

While bindless techniques offer great freedom and fexibility, they also include the potential for new categories of bugs. Though accessing an invalid descriptor is certainly possible with traditional binding methods, it’s generally easier to avoid because descriptor tables are populated on the CPU. With bindless resources we can instead expose an entire descriptor heap to our shaders, which provides more opportunities for bugs to cause an invalid descriptor to be accessed. In a sense these descriptor indices are very similar to using pointers in C or C++ code on the CPU, as they are both quite fexible but also provide ample opportunities to access something invalid.

When an invalid descriptor is accessed by a shader, the results are undefned. The shader program might end up reading garbage resource data that ultimately manifests as graphical artifacts, or it might cause the D3D12 device to enter a removed state that occurs after a fatal error is detected. Because the behavior is unpredictable, it is recommended to leverage both the D3D validation layers as well as in-engine validation as much possible. Similar to the concept of a NULL pointer, it is possible to reserve a known value such as uint32_t(-1) to be used as an “invalid” descriptor index. This value can then be used when initializing structures containing descriptor indices, and

276 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

debugging code can be written to assert that the index is valid before passing the data off to the GPU908. It is also possible to write GPU-side shader code that validates descriptor indices before usage, although it can be much more complicated to report the results of that validation to the programmer.

To aid in debugging situations where a device removal occurs, it is also recommended to implement a GPU crash detection and reporting system within a renderer. Ideally such a system can tell you which particular command the GPU was executing when it crashed or hung, and potentially provide some additional information as to the specifc reason for device removal. D3D12 has built-in functionality known as Device Removed Extended Data (DRED) [3] that can facilitate the gathering of this information. When enabled, DRED will automatically insert “breadcrumbs” into command lists that can be read back following a crash in order to determine which commands actually completed on the GPU. The low-level mechanisms utilized by DRED can also be triggered manually for applications that want to implement their own custom system for tracking GPU progress. In addition, NVIDIA offers their own proprietary Aftermath library [4] that can provide additional details if the crash occurs on an NVIDIA GPU. In particular, it can provide source-level information about the shader code that was executing when device removal occurred, provided the shaders were compiled with appropriate debug information.

17.6 UPCOMING D3D12 FEATURES

In Section 17.3, which discussed implementing bindless resources in D3D12, we described the specifc steps required for adding the overlapping descriptor ranges and unbounded arrays of HLSL resource types that are needed for allowing our shaders to have global access of all resources. This process is rather clunky and limiting because it adds quite a bit of boilerplate and doesn’t scale well with the effectively unlimited permutations of the StructuredBuffer and ConstantBuffer template types. Though the current setup is workable, it certainly makes one wish for future changes that could simplify the process of implementing bindless resources.

Fortunately, Microsoft has released a set of HLSL additions as part of Shader Model 6.6 [8]. These new features include a dramatically simplifed syntax for globally accessing the descriptors within the bound descriptor heaps via the new ResourceDescriptorHeap and SamplerDescriptorHeap objects:

1 StructuredBuffer materialBuffer; 2 materialBuffer = ResourceDescriptorHeap[GlobalCB.MaterialBufferIndex]; 3 const Material material = materialBuffer[geoInfo.MaterialIndex];

277 RAY TRACING GEMS II

4 5 Texture2D albedoMap = ResourceDescriptorHeap[material.Albedo]; 6 Texture2D normalMap = ResourceDescriptorHeap[material.Normal]; 7 Texture2D roughnessMap = ResourceDescriptorHeap[material.Roughness]; 8 Texture2D metallicMap = ResourceDescriptorHeap[material.Metallic]; 9 Texture2D emissiveMap = ResourceDescriptorHeap[material.Emissive]; 10 11 SamplerState matSampler = SamplerDescriptorHeap[material.SamplerIdx];

This new syntax completely removes the need for declaring unbound arrays of HLSL resource types and also makes it unnecessary to add any descriptor table ranges and root parameters to the root signature. The only requirement is that the root signature be created with a new D3D12_ROOT_SIGNATURE_FLAG_CBV_SRV_UAV_HEAP_DIRECTLY_INDEXED fag that is available in the latest version of the D3D12 headers.

Taken together, this new functionality greatly reduces the boilerplate and friction encountered when using bindless techniques with earlier shader models. Consequently, it is expected that the new syntax will become the dominant approach once the new shader model is widely available to developers and end users, and that the older syntax will eventually fall out of use. However, we do not feel that this fundamentally changes the overall benefts and drawbacks of using bindless techniques, therefore it will still be important to consider the trade-offs presented earlier in the chapter before deciding whether or not to adopt it.

17.7 CONCLUSION

In this chapter we demonstrated how bindless techniques can be used in the context of DirectX Raytracing and discussed the various benefts and practical implications of utilizing bindless resources for this purpose. We hope that the reasons for choosing bindless are quite clear, especially when it comes to future rendering techniques that make use of DXR 1.1 inline tracing. Bindless techniques are here to stay and are likely to be a core feature of future APIs and graphics hardware. For current development, programmers are free to determine whether to adopt them completely throughout their renderers or to selectively make use them for specifc scenarios.

For a complete example of a simple DXR path tracer that makes full use of bindless techniques for accessing read-only resources, please consult the DXRPathTracer GitHub repository [9]. Full source code is included for both the application and shader programs, and the project is set up to be easily compiled and run on a DXR-capable Windows 10 PC with Visual Studio 2019 installed.

278 CHAPTER 17. USING BINDLESS RESOURCES WITH DIRECTX RAYTRACING

REFERENCES

[1] Barrett, S. Sparse Virtual Textures. https://silverspaceship.com/src/svt/, 2008. Accessed September 6, 2020.

[2] Bolz, J. OpenGL bindless extensions. https://developer.download.nvidia.com/opengl/tutorials/bindless_graphics.pdf, 2009. Accessed May 16, 2020.

[3] Kristiansen, B. New in D3D12—DRED helps developers diagnose GPU faults. https://devblogs.microsoft.com/directx/dred/, January 24, 2019. Accessed August 23, 2020.

[4] NVIDIA. NVIDIA Nsight Aftermath SDK. https://developer.nvidia.com/nsight-aftermath, 2020. Accessed February 2, 2021.

[5] NVIDIA. Unreal engine sun temple scene. https://developer.nvidia.com/ue4-sun-temple. Accessed June 10, 2021.

[6] Pettineo, M. Bindless texturing for deferred rendering and decals. https: //therealmjp.github.io/posts/bindless-texturing-for-deferred-rendering-and-decals/, March 25, 2016. Accessed May 16, 2020.

[7] Pettineo, M. DXR Path Tracer. https://github.com/TheRealMJP/DXRPathTracer, 2018. Accessed May 2, 2020.

[8] Roth, G. Announcing HLSL Shader Model 6.6. https://devblogs.microsoft.com/directx/hlsl-shader-model-6-6/, April 20, 2021. Accessed April 24, 2021.

[9] TheRealMJP. DXRPathTracer. https://github.com/TheRealMJP/DXRPathTracer. Accessed June 10, 2021.

[10] Wikipedia. Feature levels in Direct3D. https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D, 2015. Accessed August 5, 2020.

[11] Wyman, C. and Marrs, A. Introduction to DirectX Raytracing. In E. Haines and T. Akenine-Möller, editors, Ray Tracing Gems, pages 21–47. Apress, 2019.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modifed the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

279