Master of Science in Engineering: Game and Software Engineering 2017

Real-time generation of kd-trees for ray tracing using DirectX 11

Martin Säll and Fredrik Cronqvist

Dept. Computer Science & Engineering Blekinge Institute of Technology SE–371 79 Karlskrona, Sweden This thesis is submitted to the Department of Computer Science & Engineering at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full-time studies. Contact Information: Authors: Martin Säll E-mail: [email protected]

Fredrik Cronqvist E-mail: [email protected]

University advisor: Prof. Lars Lundberg Dept. Computer Science & Engineering Mail: [email protected]

Examiner: Veronica Sundstedt E-mail: [email protected]

Dept. Computer Science & Engineering Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Context. Ray tracing has always been a simple but effective way to create a photorealistic scene but at a greater cost when expanding the scene. Recent improvements in GPU and CPU hardware have made ray tracing faster, making more complex scenes possible with the same amount of time needed to process the scene. Despite the improvements in hardware ray tracing is still rarely run at an interactive speed. Objectives. The aim of this experiment was to implement a new kd- tree generation algorithm using DirectX 11 compute shaders. Methods. The implementation created during the experiment was tested using two platforms and five scenarios where the generation time for the kd-tree was measured in milliseconds. The results where com- pared to a sequential implementation running on the CPU. Results. In the end the kd-tree generation algorithm implemented did not run within our definition of real-time. Comparing the generation times from the implementations shows that there is a speedup for the GPU implementation compared to our CPU implementation, it also shows linear scaling for the generation time as the number of triangles in the scene increase. Conclusions. Noticeable limitations encountered during the experi- ment was that the handling of dynamic structures and sorting of arrays are limited which forced us to use less memory efficient solutions.

Keywords: Ray tracing, Kd-tree, acceleration, DirectX 11, Compute shader.

i Contents

Abstract i

1 Introduction 1 1.1 Ray tracing ...... 1 1.2 Kd-tree ...... 2 1.2.1 Rebuild per frame ...... 3 1.3 Our contribution ...... 4

2 Related Work 5 2.1 Uniform grid ...... 5 2.2 Bounding volume hierarchies ...... 5 2.3 Kd-tree ...... 6 2.3.1 Hardware ...... 6 2.3.2 Traversing ...... 7 2.3.3 Split selection ...... 8

3 Method 9 3.1 The experiment ...... 10 3.1.1 CPU generation ...... 10 3.1.2 GPU generation ...... 14 3.2 Scenarios ...... 20 3.3 Platforms ...... 21 3.4 Evaluation method ...... 22

4 Results 23 4.1 CPU creation ...... 23 4.2 GPU creation ...... 25 4.3 Average summary ...... 27 4.4 Tables with exact values ...... 28 4.4.1 Speedup ...... 29

5 Discussion 30

6 Conclusions 32

ii 7 Future Work 34

Appendices 37

A Code and raw data 38

List of Algorithms

1 CPU Generation ...... 13 2 GPU Generation ...... 17 3 GPU Generation continuation ...... 18 4 GPU Generation continuation ...... 19

List of Figures

1.1 Kd-tree representations...... 3

3.1 GPU work flow ...... 15 3.2 The scenarios used in the testing in order from lowset triange amount to highest ...... 21

4.1 Laptop CPU Graph ...... 24 4.2 Desktop CPU Graph ...... 24 4.3 Laptop GPU Graph ...... 26 4.4 Desktop GPU Graph ...... 26 4.5 Comparison Graph ...... 27

iii List of Tables

4.1 Laptop generation times ...... 28 4.2 Desktop generation times ...... 28 4.3 Speedup ...... 29

iv Chapter 1 Introduction

Realistically rendered scenes within some applications is the goal but often unob- tainable. Ray tracing is a way to get this realism without investing in a multitude of systems and techniques. While ray tracing gives this simplicity it will also come with a very high performance cost. There are ways to get around some of the performance problems by having the scene pre generated. Pre generated scenes are built at startup and are therefore static and can not be altered when the application is running and they take a long time to generate. Ray tracing’s main performance issue comes from the ray vs triangle tests that need to be done for every pixel in the window that will display the ray traced environment as well as the number of lights every ray will have to trace. When making an application that uses ray tracing there are many things to consider regarding the performance of the application. Should the calculations be done on the GPU or on the CPU, which language should be used and what kind of extra methods and data structures should be used. All of these choices have their benefits and drawbacks, both with and without the combination of them. Real-time is defined from application to application and what the developers consider as needed to achieve that specific goal in the system. In our case real- time is defined as having no pre generated scenes and no structures built at the startup moment of the application. The system shall work even when moving the camera around in the scene. When moving around in the scene there shall be no major changes in the quality of the system.

1.1 Ray tracing The first ray tracing implementation that is now called ray casting was introduced by Arthur Appel in 1968 [1], when he presented a technique that sent rays from the eye of the viewer, one per pixel, that then collided with the closest object in the scene and with some lighting effects gave the pixel its color. The next step was presented by Turner Whitted in 1979 [23], when he presented a method which not only traced the rays to the first hit but from that point generated new rays that gave additional information to use for the lighting of the pixel. Ray

1 Chapter 1. Introduction 2 tracing was introduced to the wider public by Eric Graham in 1987 [9], when he created an animated scene that ran in real-time on the Amiga. The tech- nique has been around for awhile, but because of the heavy computations that the technique uses it has never reached a real-time runtime speed for applications with complex scenery. This limitation has stopped the technique from being used for interactive applications like games. In recent years, however, new techniques using the graphics cards ability for massively parallel computations have allowed for drastic improvements in the execution times of ray tracing applications. This has led to new research being done in the area. The main reason that makes ray tracing slow to compute is the intersection tests between rays and triangles. This is because every ray needs to be tested against every triangle to find which tri- angle is closest to the ray origin. The key to achieving real-time performance for a ray tracer in a complex scene is effective and optimized acceleration structures containing the scene geometry to reduce the amounts of ray versus triangle tests.

1.2 Kd-tree

A k-dimensional tree (kd-tree) is a space-partitioning data structure used to speedup various kinds of tests. In this project the kd-tree is used to speedup the intersection tests between rays and triangles by reducing the number of pos- sible intersections. This technique uses the triangles in the scene as input and creates a tree structure. The triangles are split according to a predefined con- dition. This split is commonly made using the median triangle as the splitting point to make the tree as balanced as possible. This split is made on the list of triangles that the function takes as input parameters. The function then recur- sively splits the input list into two splits and calls itself with the new lists as the inputs. This continues until an end criterion is met. The end criterion is often a threshold value limiting the minimum amount of triangles that are to be split or the volume that the triangles are contained in is below a threshold value. Once all the splits reach one of these end conditions the tree is created and the function terminates. During this splitting progress nodes have been created and linked in a tree structure creating the kd-tree with the triangles stored in the leaf nodes once the stall condition has been met. Kd-trees are fast to traverse but slow to build. A standard kd-tree can not be generated for each frame and is usually built at startup. This means that kd-trees can not handle dynamic objects in scenes, only the static parts. A binary kd-tree was used in this project, the layout of a binary tree is shown in Figure 1.1a with the root node at the top and then left and right child nodes building up the tree. Figure 1.1a also shows which axis the node splits along. Figure 1.1b found at Wikipedia [14] shows a graphical representation of a scene split by a kd-tree. The entire box is stored in the root node that is then split, that Chapter 1. Introduction 3 split being illustrated by the red border. Each of those splits are stored in the root nodes child nodes and then they are split represented by the green border and stored in their child nodes.

(a) A basic layout of a kd-tree. (b) A graphical representation of how a kd- tree splits a scene. [14] Figure 1.1: Kd-tree representations.

1.2.1 Rebuild per frame Building kd-trees for static scenes is fine since the kd-tree structure does not need to be updated and can hence be created at build time, but if the scene contains dynamic objects problems occur [22]. Since most applications today contain some form of dynamic objects the kd- tree will need to be rebuilt for each frame. Creating kd-trees per frame is a costly operation, Zhou et al. [24] have made an implementation using Compute Unified Device Architecture (CUDA) that gener- ates a kd-tree per frame. Their implementation shows a great promise for future implementations of per frame generated kd-trees. Using their implementation ideas as a base for our implementation we implemented a per frame generated kd-tree on the GPU using DirectX 11. The reason for the slow build time of kd-trees is the massive amount of data that needs to be split, sorted and shuffled around. The creation of kd-trees have a high parallelizability and to achieve per frame generation of kd-trees this needs to be used as efficiently as possible. Chapter 1. Introduction 4

1.3 Our contribution In the article written by Zhou et al. [24], they implemented a kd-tree that is created at runtime using CUDA. After reading about this, a pre study in the area gave the results that there were few implementations of kd-tree algorithms using DirectX 11. The focus of this experiment was to test a variation of implementing a kd-tree algorithm using DirectX 11 compute shader. The experiment was made in three steps. First was to implement the kd-tree generation algorithm using DirectX 11 compute shader and a sequential version. The second step was to test the performance of the DirectX version to the sequential generation of a kd-tree. The third step was to see if it fit into our definition of real-time. Our research questions are:

• 1: How can an implementation for a kd-tree be done using DirectX 11 compute shader?

• 2: How well will it perform compared to a sequential generation of a kd-tree?

• 3: How will the performance of the implementation compare to our defini- tion of real-time? Chapter 2 Related Work

Many different methods exists today to speed up ray tracing. Most of these focus on static scenes and can not handle dynamic scenes. While progress is being made towards true real-time performance we are not there yet. The key to real- time ray tracing lies in the construction of and traversal through an optimization structure. Three main approaches exist here.

2.1 Uniform grid One such method is using grids. This method splits the scene into a predefined grid where the scene data is then sorted into. One such method is presented by Guntry and Narayanan using a grid system [10]. This system uses multiple structures optimized to give a high coherence to the rays that will use them. So the view frustum is used as well as separate structures for each light. These areas are split into grids and then the triangles are sorted into the correct parts of the grids. This system is very effective since the coherence for the ray testing is high and using a front to back search a minimal amount of ray vs. triangle tests needs to be done. While this technique is efficient and easily rebuilt per frame it contains empty grid sections and highly detailed objects might place a high amount of data in a single grid section and cause a massive decrease in speed [20].

2.2 Bounding volume hierarchies

Another optimization structure commonly used is bounding volume hierarchies (BVH). This method creates a tree like structure where full objects are stored in the leaf nodes. Building a BVH is fast since no really fine grained testing is needed while building it unless object are intersecting each other in which case some more exact tests is needed. This method serves the purpose of limiting which objects to test the rays against. In [6], Eisemann et. al. presents two methods, the first method called Dynamic Goldsmith and Salmon is a method which uses spatial coherence to update the BVH quickly. The second method

5 Chapter 2. Related Work 6 they present is called Loose Bounding Volume Hierarchy, here a prebuilt BVH is used to assign objects. This creates an unbalanced structure but this is resolved by refitting the structure. This means that the structure is rebalanced, keeping larger objects at higher levels and skipping empty nodes to avoid unnecessary calculations. While a BVH limits the amount of tests needed to find objects, those objects can still contain a large amount of triangles and might need their own optimization structure.

2.3 Kd-tree The final method and also the one we focus on in this experiment is the kd-tree. As described in Kd-tree 1.2 a kd-tree creates a tree like structure that splits the scene and stores data in the leaf nodes. When implementing a kd-tree there are different things you need look at for the generation. Are you using the CPU or GPU, which traversal method do you use and what split selection is the best one.

2.3.1 Hardware The hardware choice decides in which direction the implementation is going. The CPU is a powerful choice since it is good at general purpose computations and most applications have been using the CPU for a long time so a lot of support is available for CPU programming. Kd-tree implementations have been made using the CPU [12][18][19]. Most of this work has been done to speedup the following ray tracing not the actual generation of the kd-tree. Havran and Bittner [12] presents such improvements. They present two methods of creating a better kd- tree, one method to automatically determine a exit criteria for a node and a method to fit the AABB around triangles that exist in multiple splits. These methods are interesting but they come at the cost of a higher generation time for the kd-tree so they are not further examined. Shevtsov et. al [19] presents a report with detailed implementation steps for ray tracing and goes on to discuss memory usage, splitting method and clustering algorithms. They then implement a ray tracer with two kd-trees, one holding the static parts of the scene and one that is updated per frame holding the dynamic parts of the scene. This method drastically improves the generation time of the dynamic kd-tree since the data that is actually handled by the dynamic tree is a lot smaller. While this way of using two kd-trees is interesting our implementation was developed for fully dynamic scenes so for now it is not further expanded on. In “Experiences with streaming construction of SAH KDtrees.” [18] Popov et. al. presents a method where they generate the kd-tree in breadth first order to a certain depth from where they change to depth first. This allows for better parallelizability at the start of the generation to split the workload over multiple CPU cores. This method is interesting and while not what we later use in our Chapter 2. Related Work 7 experiment it is an inspiration for it. While the CPU is good at general purpose calculations it is not as proficient in parallel calculations, this is where the GPU excels. The GPU excels at massive parallel single input multiple data calculations. Because of this design the GPU can take lists of data as input and efficiently split the work over its threads, run the calculations and return the acquired data. Using the GPU does put some additional demands on the code that is executed. Zhou et. al. [24] implements a kd-tree on the GPU. They create a kd-tree in a way to optimize the parallel environment of the GPU. By using breadth- first search they utilize the GPU to it maximal potential for most parts of the tree. This method only suffers on the upper nodes of the kd-tree where there are too many threads for the low number of nodes in the kd-tree. They solve this by parallelizing over the triangles in the upper nodes of the kd-tree to utilize the GPU optimally. This work by Zhou et. al. is the main inspiration for our experiment. Their use of parallelization over triangles is the main approach we take in our implementation.

2.3.2 Traversing When constructing a kd-tree the choice of traversal algorithm is crucial. The choice of traversal algorithm defines the logic the ray tracer have to use later. Different approaches have been proposed for the traversal of kd-trees [11][7][17][4]. The common way to implement a kd-tree traversal is with recursive ray traver- sal [11]. This method evaluates the child nodes of the current node and stores them according to priority in a traversal list that it then fetches a node from to which it moves and repeats the process. This method is good in a environment with support for lists or stacks which a CPU environment does. A GPU on the other hand usually does not have access to a stack in which case you would have to manually create one which is often not the most optimal way. Foley and Sugerman [7] present two traversal methods which they call kd- restart and kd-backtrack. Kd-restart traverses the kd-tree single way and if the search would miss the triangles in a leaf node it restarts from the root node. To not visit the same leaf node again the method stores the rays exit value from that leaf node and uses it for the next traversal to find a leaf node further away from the ray origin than the one just tested. The second method called kd-backtrack is a variant of the first where instead of restarting the nodes also stores a link to their parents. This allows for backwards traversal in the tree instead of restarting from the top. Otherwise this method functions as the kd-restart. While these two algorithms does not use a stack they needed pointers in the nodes of the tree and we tried to avoid this in our implementation so these methods were not further examined. Popov et. al. [17] presents two stackless traversal algorithms for GPUs in their paper “Stackless KD-Tree Traversal for High Performance GPU Ray Tracing”. Chapter 2. Related Work 8

They use a post processing step to add additional links to neighbouring nodes directly. This makes it possible to traverse to the adjacent nodes without having to backtrack up the tree. While this method achieves good speed it adds a post processing step and additional data that needs to be stored in the nodes. The method used as inspiration in our experiment is presented by Barringer and Akenine-Möller [4] in their paper “Dynamic Stackless Binary Tree Traversal”. They present a method where nodes are stored in sequence in memory so that the tree can be traversed by calculations rather than links. The method comes with pros and cons. Less data needs to be stored in the nodes of the kd-tree but the structure of the tree might contain empty nodes. Performance wise it comparable to a stack based implementation if slightly slower. While the performance suffers a little from this implementation it enforces a strict hierarchy for the kd-tree which for our purposes where preferred in the implementation step.

2.3.3 Split selection The splitting technique used in the construction of the kd-tree is a major decider for the final quality and balance of the kd-tree. A hybrid method between uniform grid and kd-tree is presented by Baks [3], he builds a uniform grid and sorts the triangles into it. Then using the pre- calculated grid a kd-tree is created. This method is effective since the creation of the uniform grid is fast and then using the uniform grid to create the kd-tree speeds up the creation process. His method is faster and more parallelizable than the standard kd-tree generation but the quality of the kd-tree is lowered. Macdonald and Booth [15] describe and test five ways to split a kd-tree. The works they present are interesting but because we have chosen to generate our entire kd-tree structure before populating it with data to get maximized parallelizable we are forced to use a static split selection. Chapter 3 Method

When looking for an area to work in, a gap of coverage were found when it came to ray tracing and kd-trees combined as an improvement in a DirectX 11 environment. Such work had been done for the CPU and the GPU before, but only one such work was found for DirectX 11 [3]. In DirectX 11 the compute shader was introduced, the compute shader supports massive parallel calculations making it optimal for generations and calculations where a lot of data is handled using the same equations repetitively. This experiment will give a alternative approach to Baks [3] solution. When the area was defined, the workload needed to be evaluated and time estimated. Scenarios was used to define the input for the testing. Every scenario contained a scene with static models and a set number of triangles. Scenarios are a major part in the testing since the difference between them indicates what impact on the performance some parts have. Also multiple platforms were put together to give an indication of the impact, mainly from the different GPU-cards. A baseline implementation of a ray tracer was constructed. Additional sub- systems were later added to this baseline. The implementation of the project was made in iterations. This was to make sure that every step of the implementation was as optimized as possible and by not trying to implement every part of the program at the same time we got less errors. Also highlighting problem areas during the development in a more isolated state and often a few at a time. In the end there were two prototypes, one using the CPU and one using the GPU. Both prototypes uses the same splitting and sorting algorithms to create the kd-trees to allow more accurate results on the impact from using the GPU instead of the CPU. The scenes and meshes are in obj format and are loaded into the memory at startup to prevent any impact on the generation of the kd-tree. When the prototypes ran, no new triangles or meshes were added to the scene.

9 Chapter 3. Method 10

3.1 The experiment The experiment is a test of our kd-tree generation algorithm using DirectX 11 compute shader and then test the performance and generation time of the kd- tree in a ray traced environment. The results will be compared with a CPU implementation of the algorithm as well as compared to the results found in Baks thesis [3]. Doing this will give an indication of the usability of our algorithm in future applications as well as give information on what parts of the implementation needs further work. All the test scenarios are static so they can be used in both the CPU version as well in the GPU version. To test the dynamic parts in the GPU version a spin transforming the position of the triangles in the scene is added. When the triangles are transformed to new positions they will be sorted into new places in the kd-tree, moving around in it as if it the scene had dynamic meshes. To speed up calculations axis aligned bounding boxes (AABBs) are used to sur- round the triangles in the scene to optimize the ray versus triangle tests. AABBs are also used in the nodes of the kd-tree to surround the triangles in the nodes, this to be able to traverese the kd-tree.

3.1.1 CPU generation The CPU generation is a sequential implementation using recursive methods to generate a kd-tree. The kd-tree was generated using spatial splitting instead of median triangle splitting since the computation of spatial splitting is faster than median triangle splitting, the drawback being that the kd-tree will not be as evenly split as with a median split. The kd-tree was split until one of two end conditions were meet. The first condition was if the amount of triangles in the current node was equal to or lower than a predefined threshold value. In the case of this implementation the threshold value is 6. This was to make sure that the kd-tree did not become needlessly deep or consume too much memory. The second end condition is a depth condition, this is set to depth 20. If a node is at depth 20 it will be assigned as a leaf node. The final condition is there because of the memory management and tree traversal method we chose to use. We use the method presented in "Dynamic Stackless Binary Tree Traversal" [4], this method allows for traversal in memory using calculations rather than pointers which al- lows for traversal in a stackless environment.

The four functions described below are shown in algorithm 1. Chapter 3. Method 11

CreateKdTree This function creates the base for the kd-tree creation. As input it takes a scene and the root node of the kd-tree. First a list is created to hold all the AABBs that will be computed. Then for each triangle in the scene an AABB surround- ing that triangle is created. The last computation is the creation of an AABB surrounding all the triangles, this AABB is assigned to the root node and later used for the spatial splitting. Finally CreateKDNodeSplit is called.

CreateKdNodeSplit This function’s purpose is to be an origin point for each recursive iteration of the kd-tree generation. It is called as an initiation for each iteration of the recursive function.This function takes a node and a list of AABBs as input. From here the two functions that handle the splitting of the kd-tree is called. The first function that is called is the NodeAABBSplit, this function handles splitting the AABB in the nodes. The second function that is called is splitAABBList which splits the triangles AABBs into child nodes.

NodeAABBSplit NodeAABBSplit handles the spatial splitting of the AABB in a node. It takes a node as input and initializes the child nodes in that node. After initializing the child nodes the bounding box in the input node is split using spatial median splitting along the longest axis. The two child nodes then each get one of the created AABBs assigned to it.

SplitAABBList This function splits a list of AABBs into two new lists using AABBs from the current node’s children. As input SplitAABBList takes a node and a list of AABBs and creates two new AABBLists, one for each child in the node and then for each AABB in the input AABB list test if the AABB exists inside the bounds of the AABB of each child. If an AABB from the input list is inside the AABB of a child node that AABB is assigned to that child node’s list that was created for that child node. With all the AABBs in the input list handled and assigned to the correct child list some checks are done to determine the next action to take. The first test is to see if any one of the AABB child lists are empty. If this is the case the AABB in node is shrunk to the size of the AABB in the child node whose list is not empty. Then the child nodes in node is removed and CreateKDNodeSplit is called with the current node and the AABBList of the Chapter 3. Method 12 child that was not empty, afterwards the function returns. Else if the above test was not true the function tests if the length of one of the child lists is smaller than or equal to maxTriangles. If this is true for either of the child lists a leaf node is created. The leaf node is assigned at the node->child corresponding to the child list whose size is smaller than maxTriangles. Then that node gets the triangle indexes from the child AABB list assigned. This test is done separately for each child list. Lastly CreateKDNodeSplit is called using the node of node->child and the AABBList of the corresponding child. This call is only made if one or both children failed the two above tests. If both children failed the tests it is called two times, one time for each child using their respective node->childs and AABB child lists. Chapter 3. Method 13

Algorithm 1 CPU Generation 1: procedure createKdTree(mesh:Mesh, rootNode:Node) 2: AABBList ← new List 3: for each triangle t in mesh 4: Compute AABB for triangle t 5: AABBList .append(AABB) 6: Compute rootNode bounding box 7: CreateKDNodeSplit( rootNode, AABBList )

8: procedure CreateKdNodeSplit(node:Node, AABBList:List) 9: nodeAABBSplit(node) 10: splitAABBList(node,AABBList)

11: procedure NodeAABBSplit(node:Node) 12: node->leftChild ← new Node 13: node->rightChild ← new Node 14: Split node bounding box along the longest axis 15: Assign the split to corresponding child

16: procedure SplitAABBList(node:Node, AABBList:List) 17: AABBListLeftChild ← new List 18: AABBListRightChild ← new List 19: for each AABB aabb in AABBList 20: if intersection between aabb and node.child 21: Assign to corresponding child list 22: if AABBListLeftChild is empty or AABBListRightChild is empty 23: Shrink the AABB in node 24: Remove child nodes in node 25: CreateKDNodeSplit( node,AABBList) 26: return 27: if child list lenght <= maxTriangles 28: Create leaf node in node.child 29: else 30: CreateKDNodeSplit( node.child,AABBListChild) Chapter 3. Method 14

3.1.2 GPU generation The GPU generation is split up in dispatches each doing parts of the equations needed for the real-time generation of the kd-tree. Each part is created for use in the highly parallel environment on the GPU. The first step of the creation of the kd-tree is to create AABBs around every triangle that the scene contains. Simultaneous to this calculation one single AABB surrounding all the triangles is created. This is the AABB that will be stored in the root node and later used in the spatial splitting. Once the AABBs for the triangles are created and the AABB containing all the triangles is stored in the root node a new dispatch is sent creating the AABBs for all the nodes in the kd-tree. The AABBs in the kd-tree is calculated using spatial splitting, splitting the AABBs along the longest axis. This fills the nodes in the kd-tree with the node AABBs that is used to determine what node the triangles belong in. Following the creation of the kd-tree AABBs the triangles are sorted into correct nodes and corresponding nodes are assigned as leaf nodes. This is done using the semi dynamic append/consume buffer structures available to the com- pute shader. Using the append/consume buffers to shuffle the triangle AABB data down the kd-tree to find the nodes that are to be assigned as leaf nodes. This method takes all the triangle AABBs created previously and assigns them to a consumer buffer. This buffer is sent as indata in the next dispatch. Each thread in this dispatch calls for a triangle AABB in the consumer buffer and uses that triangle AABB to calculate the next split the triangle belongs in. After the split is determined the triangle AABB is appended to an append buffer with the new split values stored. If the number of triangle AABB in a singular split is lower than a determined amount or the defined maximal depth of the kd-tree is reached the triangle AABB is appended to a separate structure containing the triangle AABBs that are at their correct locations. This dispatch is then repeatedly called using the triangle AABB append buffer from the previous dispatch as indata and continues this loop until all the triangle AABBs have been assigned to the strucure with correcly placed AABBs. When the triangle AABBs are sorted into the correct nodes and placed into the separated structure a new dispatch is sent using the buffer with the processed triangles as input. This dispatch sorts the list of processed triangles placing them in an order the render dispatch can use. Chapter 3. Method 15

Create AABB list The first dispatch shown in algorthm 2 is responsible for the creation of AABBs around all the triangles in the scene. Using the graphics card the work is split over the threads and calculated in parallel. The operation fills a structured buffer with AABBs corresponding to the structured buffer containing the triangles. Further the dispatch appends the AABBs to an append structure that is used extensively in later dispatches. Finally this dispatch creates an AABB surrounding all the triangles, this AABB is placed in the root node of the kd-tree.

Create kd-tree This dispatch splits the AABB that was placed in the root node in the previous dispatch and assigns each split to the a leaf node. This is done in iterations using the maximum depth of 20 as the end condition. Each iteration splits the nodes at the current depth over the treads and calculates each node split separately. The split is calculated using the longest axis as the split axis. The split results is then stored in the leaf nodes of the node that the thread is currently calcu- lating. This dispatch is shown is algorithm 2. Figure 3.1: GPU work flow

Create kd-tree append The third dispatch shown in algorithm 2 sorts the AABBs into the correct kd-tree nodes. This dispatch is a repetitive function that is called multiple times until either all the AABBs have been placed in nodes or the maximum depth of 20 for the kd-tree is reached and automatically places the AABBs in the nodes they are currently in. When dispatch three is finished there are no AABBs in Swap- BufferOut or SwapBufferIn, all the AABBs are indexed using indices in IndiceList. Chapter 3. Method 16

Fetch indices Using the IndiceList from the previous dispatch as a ConsumeStructuredBuffer input, the dispatch consumes the indices in IndiceList and writes them to unique indexes in IndiceBuffer. This dispatch is shown in algorithm 3.

Sorting Shown in algorithm 3 this dispatch is divided into four parts sorting and com- pacting the data. The first part isolates unique leaf node ids and stores them in SortingBuffer. This list is then put into SortingBuffer to be calculated and compressed using a prefix sum [5] operation. Then using prefix sum on Sorting- Buffer compressed indexes are found for the indices. Finally using the calculated ids from the prefix sum operation the indices are stored in a consecutive order in SortingBuffer.

Calculate indice start position Shown in algorithm 4 the dispatch uses the indices stored in SortingBuffer the amount of AABBs stored in each leaf node is stored beside the index. The data from SortingBuffer is then transferred to ParallelScan which is then used in a prefix sum operation to calculate the index start positions for the list of sorted indices. Finally using the result from the prefix sum operation index start posi- tions are assigned to the kd-tree leaf nodes.

Place indices Lastly shown in algorithm 4 the unsorted triangles from IndiceBuffer are sorted into IndiceFinal which is the final index structure containing the indices that the kd-tree leaf nodes uses in the draw call. Chapter 3. Method 17

Algorithm 2 GPU Generation 1: procedure createAABBList 2: TriangleList ← input List 3: AABBList ← new List 4: KDTree ← new Indexed Buffer 5: SwapBuffer ← new Append/Consume Buffer 6: for each triangle t in TriangleList in parallel 7: Compute AABB for triangle t 8: AABBList .appendAtIndex( AABB (t)) 9: Append the index of the AABB to SwapBuffer 10: SwapBuffer .append(index(AABBList)) 11: Find the AABB that bounds all triangles and assign it to the root node 12: KDTree[0].aabb ← AABB(TriangleList)

13: procedure createKDTree 14: KDTree ← input Indexed Buffer 15: while (depth < MAXDEPTH ) 16: for each AABB aabb in KDTree at depth depth in parallel 17: Split the aabb along the longest axis 18: LeftSplit ← splitAABB(aabb).left 19: RightSplit ← splitAABB(aabb).right 20: KDTree.at(aabb).leftnode ← LeftSplit 21: KDTree.at(aabb).rightnode ← RightSplit

22: procedure createKDTreeAppend 23: KDTree ← input Indexed Buffer 24: IndiceList ← new Append/Consume Buffer 25: SwapBufferIn ← input Append/Consume Buffer 26: SwapBufferOut ← new Append/Consume Buffer 27: for each index index in SwapBufferIn in parallel 28: if leafNodeCondition(index) 29: IndiceList.append(index) 30: KDTree.at(textitindex).nrOfTriangles ← nrOfTriangles(index) 31: else 32: if leftSplit(index) 33: SwapBufferOut .append(leftSplit(index)) 34: if rightSplit(index) 35: SwapBufferOut .append(rightSplit(index)) Chapter 3. Method 18

Algorithm 3 GPU Generation continuation 1: procedure fetchIndices 2: IndiceList ← input Append/Consume Buffer 3: IndiceBuffer ← new Indexed Buffer 4: for each index index in IndiceList in parallel 5: IndiceBuffer[uniqueId] = index

6: procedure sorting 7: IndiceBuffer ← input Indexed Buffer 8: SortingBuffer ← new Indexed Buffer 9: ParallelScan[2] ← new Indexed Buffer 10: from ← 0 11: to ← 1 12: bitOffset ← 1 13: for each index index in IndiceBuffer in parallel 14: SortingBuffer[index] = (index, 1) 15: for each index index in SortingBuffer in parallel 16: ParallelScan[from] = SortingBuffer[index] 17: while bitOffset != pow(2,20 ) in parallel 18: ParallelScan[to] = ParallelScan[from] 19: for each index index in ParallelScan in parallel 20: ParallelScan[to][index] = ParallelScan[from][index] + 21: ParallelScan[from][index - bitOffset] 22: bitOffset « 1 23: swap(to, from) 24: for each index index in SortingBuffer in parallel 25: SortingBuffer[ParallelScan[from][index]] = SortingBuffer[index] Chapter 3. Method 19

Algorithm 4 GPU Generation continuation 1: procedure calculateIndiceStartPosition 2: KDTree ← input Indexed Buffer 3: IndiceBuffer ← input Indexed Buffer 4: SortingBuffer ← input Indexed Buffer 5: ParallelScan[2] ← input Indexed Buffer 6: from ← 0 7: to ← 1 8: bitOffset ← 1 9: for each index index in SortingBuffer in parallel 10: SortingBuffer[index].y = nrOfTriangles(index) 11: ParallelScan[from] ← SortingBuffer 12: while bitOffset != pow(2,20 ) in parallel 13: ParallelScan[to] = ParallelScan[from] 14: for each index index in ParallelScan in parallel 15: ParallelScan[to][index] = ParallelScan[from][index] + 16: ParallelScan[from][index - bitOffset] 17: bitOffset « 1 18: swap(to, from) 19: for each index index in ParallelScan in parallel 20: KDTree[SortingBuffer[index]].index = ParallelScan[from][index] 21: SortingBuffer[index].y = ParallelScan[from][index]

22: procedure placeIndices 23: IndiceFinal ← new Indexed Buffer 24: IndiceBuffer ← input Indexed Buffer 25: for each index index in IndiceBuffer 26: IndiceFinal[nrOfTriangles(index)] = IndiceBuffer(index) Chapter 3. Method 20

3.2 Scenarios Multiple scenarios were used to test the implementations. All of the scenarios are scenes with different meshes and sizes. Having multiple types of scenarios will give a better understanding of what has an impact on the performance. The scenarios range from 36 triangles up to about 70 thousand triangles. The smallest scenario will check the impact from the system itself and not on the amount of triangles, while the largest scenario will check for the impact of adding more triangles for the system to handle. The scenarios will be put to the same tests on all the platforms.

• Scenario 1: (36): Cornell box 3.2a is a standard testing scene found on the web page MCGuire Graphics Data [16]. Guedis Carde- nas, Morgan McGuire and Michael Mara are the creators of eight variations to this scene that adds or changes what is in the main box. This is because it is a commonly used test scene. This scenario was used as a baseline to see the impact of the system itself because of the low triangle count in the scene. With only 36 triangles the impact will be minimal on the system and that gave a good view of the basic need for the system.

• Scenario 2: (16k): The teapot 3.2b was modeled 1975 by Martin Newell and first appeared in his Ph.D. It is a standard testing model and even a default model in 3D Studio Max. The version used in this scenario was found on the web page MCGuire Graphics Data [16]. Teapot is composed of 16 thousand triangles.

• Scenario 3: Mini spaceship (44k): Mini spaceship 3.2c is a model that was found on the web page cgtrader [2] but have since been removed. cgtrader is a web page where individuals can upload, share and even sell meshes. Mini spaceship is composed of 44 thousand triangles.

• Scenario 4: Mitsuba (61k): Mitsuba 3.2d is a standard material test object. Jakob Wenzel [13] created the obj conversion and it was found on the web page MCGuire Graphics Data [16]. Mitsuba is composed of 61 thousands triangles.

• Scenario 5: Stanford bunny (70k): The Stanford bunny 3.2e is made by Greg Turk and Marc Levoy [21]. Greg and Marc made it in 1994 and it have since been commonly been used in tests for nearly anything, compres- sion, surface smoothing, polygonal simplification and many other graphics algorithms. The Stanford bunny is composed of 70 thousand triangles. Chapter 3. Method 21

(a) Cornellbox (b) Utah teapot (c) Mini spaceship

(d) Mitsuba (e) Stanford bunny Figure 3.2: The scenarios used in the testing in order from lowset triange amount to highest

3.3 Platforms Using multiple scenarios will not show the whole picture. Nor will the range of platforms that are tested here, but it will give a more complete picture. Multiple platforms was used to test the prototypes, the platforms were set up with the same software to assure equivalent behavior and the same scenarios were tested on both platforms.

• Laptop: The first platform is a laptop that have an Intel Core i7 processor on 2.2GHz and a Nvidia GeForce GTX 560M graphics card.

• Desktop: The second platform is a desktop computer that have an Intel Core i5 processor on 3.4GHz and a Nvidia GeForce GTX 670 graphics card. Chapter 3. Method 22

3.4 Evaluation method All the scenarios were tested on both platforms in order to give a broader overview on the performance and hardware impact. To time the generation of the kd-tree ID3D11Query was used within the code to get an as exact time as possible. The part of the program that we are interested in timing is the generation of the kd-tree. The CPU version will only be timed once every time the system starts because it generates the kd-tree at the startup. Since the GPU part of the system recreates the kd-tree every frame, there is no need to restart the system for every test. To get a more accurate timing for every start the graphic pipeline is flushed so it must start fresh and not use any stored values to make it faster from any previous startups. Chapter 4 Results

This section will present and evaluate the results obtained from the tests. It will show the generation times obtained from the tests and compare them to each other. Starting with the CPU generation, continuing to the GPU generation and then ending with an average summary of the results. All test scenarios were run at least 200 times each on every platform to gain a accurate representation of the average generation time of the kd-tree.

4.1 CPU creation The CPU creation of the kd-tree behaved as expected with a slightly less than linear time scaling. Generating the structure at the start of the application is the common way to do it as it takes a long time to create kd-trees, so this is prefer- ably done before the structure is needed. As shown in Figure 4.1 and Figure 4.2 the time it takes to generate the kd-tree increases quite rapidly as the amount of triangles increase and is therefore not suited to run in real-time on applications. The assumption that the generation time between the scenarios is nearly linear has been made.

23 Chapter 4. Results 24

Figure 4.1: Test results from all the scenarios done on the laptops CPU only.

Figure 4.2: Test results from all the scenarios done on the desktops CPU only. Chapter 4. Results 25

4.2 GPU creation The generation times of the per frame generated kd-tree, Figure 4.3 and Fig- ure 4.4, shows a stable and slowly increasing generation time as the number of triangles increase. While the slow increase in generation time as the amount of triangles increase shows a possible future for our approach there are limiting fac- tors to consider. The overhead from setting up the method and the data traffic between passes consumes a lot of time which is what gives the still rather high execution time. Another problem is the dynamic inflexibility of the DirectX 11 compute shader and the method by which this solution is implemented, it sets a hard cap on the maximum number of triangles that can be processed. While this limit can be increased the limit will still be there and the amount of memory needed to push the limit scales exponentially. So while the scaling of the method is good the constraints from other parts of the system limits performance and sets hard limits that can not be surpassed with the current implementation method. Chapter 4. Results 26

Figure 4.3: Test results from all the senarios done on the laptops GPU only.

Figure 4.4: Test results from all the senarios done on the desktops GPU only. Chapter 4. Results 27

4.3 Average summary Lastly a summary of the average results combined into a single overview graph. This shows the potential of the method from this experiment. In the Figure 4.5 there is a clear indication of scalability on all platforms when using kd-trees on the GPU. The results are not constant as indicated in Figure 4.3 and Figure 4.4.

Figure 4.5: Average results from all the testing and platforms summarised for easier comparison. Chapter 4. Results 28

4.4 Tables with exact values Shown in Table 4.1 and Table 4.2 are the results obtained in the testing displayed in milliseconds. Here we can see the minimum and maximum generation times of the scenarios as well as the average time the kd-tree generation took.

Table 4.1: Laptop generation times

CPU laptop Cornell box Teapot Mini spaceship Mitsuba Stanford bunny Max 8.86 1635.73 3361.26 4194.48 4544.27 Average 6.28 1489.05 3213.43 4070.09 4331.10 Min 4.42 1436.76 3125.71 4001.68 4123.67 GPU laptop Cornell box Teapot Mini spaceship Mitsuba Stanford bunny Max 631.81 634.33 643.57 648.23 647.92 Average 624.53 628.31 633.59 636.32 638.61 Min 621.95 626.53 628.76 633.86 636.96 Shows the generation times for the tests run on the laptop in milliseconds.

Table 4.2: Desktop generation times

CPU desktop Cornell box Teapot Mini spaceship Mitsuba Stanford bunny Max 7.15 1226.86 2583.57 3258.80 3384.32 Average 4.45 1149.09 2475.89 3160.23 3264.11 Min 2.14 1118.74 2380.35 3075.31 3120.75 GPU desktop Cornell box Teapot Mini spaceship Mitsuba Stanford bunny Max 508.15 509.23 494.02 496.46 498.54 Average 490.01 490.77 488.68 491.95 493.71 Min 480.05 482.76 485.97 487.13 488.66 Shows the generation times for the tests run on the desktop in milliseconds. Chapter 4. Results 29

4.4.1 Speedup Shown in the Table 4.3 is the speedup from using the GPU version instead of the sequential CPU version. The values show how many times the GPU version can generate in the time frame set by the CPU version.

Table 4.3: Speedup

Speedup Cornell box Teapot Mini spaceship Mitsuba Stanford bunny Laptop 0.010 2.370 5.072 6.396 6.782 Desktop 0.009 2.341 5.066 6.424 6.611 Shows how many times the GPU version can be generated in the same time the CPU version is generated once. Chapter 5 Discussion

To test the implementation created from this experiment multiple platforms were used, running multiple times using multiple scenarios that have the same settings. The platforms used the same versions on all the systems to ensure the results not vary because of differences in versions. ID3D11Query was used for the time measurements, this DirectX 11 integrated measuring tool was used because it is integrated with the graphics library and can therefore get precise measurements. One problem was encountered with the time measurement during the experiment and that was when the graphics pipeline was not flushed. When the graphics pipeline had rest data remaining between dis- patches it interfered with the time measurement and strange result were obtained. This was easily solved by flushing the graphics pipeline between each dispatch. As seen in the Figure 4.4 and Table 4.2 there is a min, average and a max indicated. These were values obtained from our tests and they are presented to give a indication of the time fluctuations caused by background processes running on the system and the operating system itself, as well as display the generation time of the prototypes. The average was calculated from 200 runs for every scenario on every platform. Some might only care for the max because of some requirements they have on their system and some only need the average because it will run over a long period of time. The end result from both the CPU and GPU versions give the same scene when given the same input and configurations. Similar system can be done in CUDA with its own challenges to overcome. As mentioned in Zhou et. al [24] makes an implementation using CUDA, our system uses a similar approach as theirs but we work within the limitations of DirectX 11. A deficiency was found in our experiment during the final stages of develop- ment. This deficiency was caused by our memory management and some other issues we encountered in the final stages. This limits the scalability of the appli- cation since it sets a limit on the size of the structures used in the calculations at startup that can not be increased as needed in runtime. While this is a prob- lem it does not impact our scenarios since we set our limit to account for our scenarios. The version of DirectX did not support certain functionality that ex- ist in similar GPU libraries, such as CUDA. The functionality needed was some

30 Chapter 5. Discussion 31 form of dynamic listing and management of them on the GPU. Newer versions of DirectX might support this and can therefore enhance this technique further. These listings were used to handle the AABBs and the triangle indices. While DirectX has some support for it in their append/consume buffers these came with the limitations named before. When trying to compare our results to that of others we encountered some problems. The first problem being that measuring the generation time of the acceleration structure alone is uncommon. The usual measurement is either total generation time which include generation of acceleration structure and the fol- lowing ray tracing. The other common measurement found was data throughput, meaning the amount of triangles handled per second. Some reports where results could be compared were found. From [19] assuming that the scaling in generation time is linear we get that their generation time for the Stanford Bunny where around 16 milliseconds which is somewhere around twenty five times faster than our generation time. The reason for this difference in generation time is likely because of our static kd-tree generation. Since we generate our entire kd-tree before populating it with data every frame we have a high base generation time. Likely this is the major reason for their twenty five times faster generation but up scaling our approach should limit the impact from this static generation, alternatively if the maximum bounds of a scene is known the static kd-tree could be generated only once and then be populated with data every frame. Popov et. al [18] presents results where they with four CPUs get a generation time for the Stanford bunny that is comparable to our generation time, their generation time is just slightly faster than ours. For the same reason as mentioned above our algorithm runs at a speed comparable to a CPU implementation. While these two might not show the entire picture we can see that our imple- mentation is comparable to an optimized CPU implementation but not necessarily comparable to other GPU generation algorithms. Chapter 6 Conclusions

RQ1: How can an implementation for a kd-tree be done using DirectX 11 compute shader? Answer: By breaking down the implementation of the kd-tree generation into steps, we could use the parallel capabilities of the compute shader. This break- down resulted in seven steps for the GPU generation, those steps are detailed in GPU Generation 3.1.2.

RQ2: How well will it perform compared to a sequential generation of a kd-tree? Answer: The results obtained indicates a stable generation time for our GPU version when compared to our CPU version. When scaling from a low amount of triangles the GPU version barely increases in generation time when compared to the CPU versions generation time which quickly increases. For a comparison of the generation times look at Comparison graph 4.5, for a speedup comparison look at 4.3.

RQ3: Will it run within our definition of real-time? Answer: When looking at the data from the experiment we found that the GPU version has a base time for setting up and transferring the data to and from the GPU. This can be seen in the GPU Figure 4.3 and Figure 4.4. Even if it is stable the generation time and the following ray tracing will be to high to fit inside our definition of real-time. This since the total time makes interaction with the program at run time unbearable since the frame updates comes at to irregular intervals.

The major reasons for the unexpectedly high generation time was that the DirectX 11 compute shader handling of dynamic structures and sorting of arrays are limited and restricting. As we could not find solutions for these parts the performance of the generation were greatly reduced. The DirectX 11 compute shaders excel at massive parallel computations but is limited in step synchronization and atomic functions. This makes sorting massive arrays of data and other similar operations difficult but not impossible to perform

32 Chapter 6. Conclusions 33 using DirectX 11 compute shaders. [8] talks about a sorting algorithm using Di- rectX 11 compute shaders, they implemented a radix sort algorithm which showed good results using multiple passes and prefix sum operations to sort massive data arrays. While our implementation likely would not benefit from this solution an implementation using median triangle split would probably benefit greatly from their algorithm. So while we had some limited success and some maybe positive results our final conclusion is that is it not suitable to use the DirectX 11 compute shader to generate kd-trees in real-time using our implementation. This said though, with improvements to our implementation and later versions of DirectX 11 compute shader that solves some of the dynamic problems our implementation or similar ones might be used for real-time generation. Chapter 7 Future Work

When the results are shown, the tests are done and the problems found, new questions can surface and the search for an answer will move everything forward. The results are showing a positive future for a technique like this when it comes to a stable performance in a ray traced environment. Running the tests gave great insight into the capabilities of a implementation like this one. Also running tests on the iterations building up to the final build for the experiment gave an insight of bottlenecks and possibilities for the system. But finding the problems for new ideas to any future works to be based on, is the best result for any experiment in a new direction in an already known topic.

34 Bibliography

[1] Arthur Appel and William R. Mark. “Some techniques for shading machine renderings of solids”. In: AFIPS 1968 Spring Joint Comptr. Conf (1968), pp. 37–45. [2] M. Baharoon. spaceship Free VR / AR / low-poly 3D model. url: https: //www.cgtrader.com/free-3d-models/vehicle/sci-fi/spaceship-- 19. [3] Pawel Bak. “Real time ray tracing”. MA thesis. Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark, 2010. [4] Rasmus Barringer and Tomas Akenine-Möller. “Dynamic Stackless Binary Tree Traversal”. In: Journal of Techniques 2.1 (2013), pp. 38–49. [5] Guy E. Blelloch. “Prefix sums and their applications”. In: Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University (1990). [6] Martin Eisemann et al. “Automatic Creation of Object Hierarchies for Ray Tracing Dynamic Scenes”. In: WSCG Short Papers Proceedings. 2007. [7] Tim Foley and Jeremy Sugerman. “KD-tree acceleration structures for a GPU raytracer”. In: Proceedings of the ACM SIGGRAPH/EUROGRAPH- ICS conference on Graphics hardware. ACM. 2005, pp. 15–22. [8] Arturo Garcıa et al. “Fast Data Parallel Radix Sort Implementation in DirectX 11 Compute Shader to Accelerate Ray Tracing Algorithms”. In: EURASIA GRAPHICS: International Conference on Computer Graphics, Animation and Gaming Technologies. 2014, pp. 27–36. [9] Eric Graham. “Graphic Scene Simulations”. In: Amiga World (1987), pp. 19– 95. [10] Sashidhar Guntury and PJ Narayanan. “Raytracing dynamic scenes on the GPU using grids”. In: Visualization and Computer Graphics, IEEE Trans- actions on 18.1 (2012), pp. 5–16. [11] Vlastimil Havran. “Heuristic ray shooting algorithms”. PhD thesis. Ph. d. thesis, Department of Computer Science and Engineering, Faculty of Elec- trical Engineering, Czech Technical University in Prague, 2000.

35 BIBLIOGRAPHY 36

[12] Vlastimil Havran and Jiri Bittner. “On Improving KD-Trees for Ray Shoot- ing”. In: In Proc. of WSCG 2002 Conference. 2002, pp. 209–217. [13] Wenzel Jakob. Mitsuba renderer. http://www.mitsuba-renderer.org. 2010. [14] k-d tree. url: https://commons.wikimedia.org/wiki/File:3dtree.png. [15] J David MacDonald and Kellogg S Booth. “Heuristics for ray tracing using space subdivision”. In: The Visual Computer 6.3 (1990), pp. 153–166. [16] Morgan McGuire. MCGuire Graphics Data. May 2015. url: http : / / graphics.cs.williams.edu/data/meshes.xml. [17] Stefan Popov et al. “Stackless KD-Tree Traversal for High Performance GPU Ray Tracing”. In: Computer Graphics Forum. Vol. 26. 3. Wiley Online Library. 2007, pp. 415–424. [18] S. Popov et al. “Experiences with Streaming Construction of SAH KD- Trees”. In: 2006 IEEE Symposium on Interactive Ray Tracing. Sept. 2006, pp. 89–94. doi: 10.1109/RT.2006.280219. [19] Maxim Shevtsov, Alexei Soupikov, and Alexander Kapustin. “Highly Par- allel Fast KD-tree Construction for Interactive Ray Tracing of Dynamic Scenes”. In: Computer Graphics Forum 26.3 (2007), pp. 395–404. issn: 1467-8659. doi: 10 . 1111 / j . 1467 - 8659 . 2007 . 01062 . x. url: http : //dx.doi.org/10.1111/j.1467-8659.2007.01062.x. [20] Niels Thrane and Lars Ole Simonsen. “A Comparison of Acceleration Struc- tures for GPU Assisted Ray Tracing”. PhD thesis. University of Aarhus, 2005. [21] Greg Turk and Marc Levoy. Stanford. May 2015. url: http://graphics. stanford.edu/~mdfisher/Data/Meshes/bunny.obj. [22] Ingo Wald. “Fast kd-tree construction with an adaptive error-bounded heuris- tic”. In: In IEEE Symposium on Interactive Ray Tracing (2007), pp. 33–40. [23] Turner Whitted. “An Improved Illumination Model for Shaded Display”. In: Proceedings of the 6th annual conference on Computer graphics and interactive techniques (1979). [24] Kun Zhou et al. “Real-Time KD-Tree Construction on Graphics Hardware”. In: ACM Transactions on Graphics (TOG) - Proceedings of ACM SIG- GRAPH Asia 2008 27 (5 Dec. 2008). Appendices

37 Appendix A Code and raw data

https://github.com/BuffBuff/Frustum-KD-tree/

38