Parallel BVH Construction for Real-Time Ray Tracing

Aaron Lemmon Division of Science and Mathematics University of Minnesota, Morris Morris, Minnesota, USA 56267 [email protected]

ABSTRACT The rise in popularity of interactive graphical applications, such as video games, has motivated innovations in render- ing 3D scenes in real time. The ray tracing technique is well-suited for generating realistic images of scenes that fea- ture shadows, reflections, and refractions. Historically, ray tracing has been too slow for real-time applications since it is computationally intensive. However, ray tracing perfor- mance can be greatly improved by using the bounding vol- ume hierarchy (BVH) acceleration to store scene information for each frame. Researchers have strived to minimize the combined time of both constructing and us- ing BVHs for ray tracing. This paper provides an overview of ray tracing with BVHs and presents a recently developed Figure 1: A simple scene containing a dolphin con- method for constructing them in parallel on a GPU. structed from triangle primitives [7]. Keywords computer graphics, ray tracing, parallel computing, bound- ing volume hierarchies, Morton codes through the pixel and into the 3D scene. When a ray in- tersects with the first primitive in its path it recursively 1. BACKGROUND generates more rays in directions that will contribute to an In 3D computer graphics, objects are made up of a collec- appearance of reflections, refrations, or shadows [6]. While tion of primitives, which are usually simple geometric shapes ray tracing can create highly realistic images, the process of like triangles. A 3D scene consists of all the primitives that tracing a large number of rays for a 3D scene with many construct it. Figure 1 depicts a scene with a dolphin that primitives on a high resolution display can take a long time. clearly shows the component triangles. In order to depict a Ray tracing a single frame of a 3D scene can involve test- scene on a display, the pixels of the display must be colored ing for intersections of billions of rays against millions of to create an image. A technique called ray tracing can be primitives [2, 5]. To speed up this process, acceleration data used to color the pixels in a way that can accurately por- structures can be used to organize the primitives by their tray shadows, reflections, and refractions in a scene [4]. Ray location so that only the primitives near a given ray need tracing achieves this by determining how light travels in a to be checked for intersection with that ray. Acceleration scene from the light sources, reflecting or refracting off ob- data structures usually take the form of a : the top jects, and meeting the viewer [6]. Figure 2 shows the high node represents the entire 3D volume of the scene and the degree of photorealism that ray tracing can achieve. children of every node divide up the volume of their par- Since many rays from a light source may not ultimately ent node into subsections. Although these acceleration data reach the viewer, it is more practical to start from the viewer structures speed up ray intersection testing, the time it takes and trace paths of light backward. Figure 3 shows that for to build these trees can negatively impact performance [1]. every pixel on a display, a ray is traced from the viewpoint This is especially apparent in scenes with moving objects since the acceleration data structures need to be rebuilt or updated to accurately reflect the new locations of objects [5]. The research discussed here addresses methods for build- ing and maintaining acceleration data structures in a way that minimizes the combined time spent constructing the This work is licensed under the Creative Commons Attribution- data structure and using it to test for ray intersections. Noncommercial-Share Alike 3.0 United States License. To view a copy In particular, parallel computing on the graphics processing of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or unit (GPU) can decrease the time spent building an acceler- send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- ation data structure. Efficient acceleration data structures cisco, California, 94105, USA. UMM CSci Senior Seminar Conference, April 2016 Morris, MN. allow for ray tracing scenes with motion in real time. A A B C B C

A A Figure 2: A ray traced sceneB featuring shadows on the wall and beneath the glasses, reflectionsC on B C the glossy surfaces of the glasses, and refractions through the glass stems and ice cube [9].

Image Camera Light Source

View Ray Shadow Ray

Figure 4: Top: A 2D scene with rectangles as bound- ing boxes. Bottom: One possible BVH configuration for the 2D scene. Note that each internal node has Scene Object exactly two children and that objects that are close in the scene are also close in the BVH tree [8].

Bounding volume hierarchies (BVHs) extend the idea of AABBs to a data structure as shown in Fig- Figure 3: For each pixel of the image plane, a ray is ure 4. The root node of a BVH is an AABB that surrounds cast through and tested for intersection with objects all the primitives in the scene. Every parent node has two in the scene. When intersections occur, the angle child nodes, each of which is an AABB that surrounds its of reflection is calculated and a new ray is sent out. share of the primitives surrounded by the parent node. By Eventually, if a ray intersects with a light source, the continually separating volumes into smaller volumes, it be- color information propagates back to the pixel [2, 9]. comes possible to group close primitives together [5]. The lower nodes on the tree give more precise location infor- mation than the higher nodes. The leaves of the tree are 2. ACCELERATION DATA STRUCTURES AABBs that surround a single primitive [8]. As opposed to If primitives of a scene are stored in a data structure that other types of acceleration data structures, BVHs define the identifies their location in the scene, then a ray only needs volumes by the objects they contain rather than splitting to check for intersections with objects located in the parts of volumes and then determining which objects should go in the scene the ray is passing through. This can drastically im- each node [5]. prove the performance of intersection testing for each ray [9]. Searching for a ray intersection with objects in a BVH A way to make intersection tests easier to calculate is to occurs in a top-down manner. First, the ray is tested against surround primitives with bounding boxes. A bounding box the root of the tree to check if it intersects with the scene. is a box that completely contains its contents as tightly as If it does, then both of the root’s children are checked for possible. A common approach is to use axis-aligned bounding intersection. If the ray misses a child node, then the entire boxes (AABBs), which are aligned with the axes of the scene subtree rooted at that node can be eliminated from the rest as a whole. It is much simpler to test for ray intersections of the search, since the objects contained within the node with AABBs than with the items they contain. If a ray will also not intersect. However, in the case that the ray does not intersect with an object’s AABB, then it cannot intersects both children, the search must continue down both intersect with the object itself. However, if it does intersect subtrees [2]. In the best-case scenario, where that does not with the AABB, then a more costly check must be made happen, the number of checks will be approximately the to test for intersection with the contained object. Overall, log of the number of total primitives. Ray tracing with a using AABBs can reduce the cost of testing for intersections BVH structure can therefore greatly reduce the number of since a ray misses many more objects than it hits. intersection checks performed per ray [8]. x: 3. BVH CONSTRUCTION 0 1 2 3 4 5 6 7 000 001 010 011 100 101 110 111 In order to speed up the construction of BVHs for real- time ray tracing, Tero Karras [3] has developed a method y: 0 000000 000001 000100 000101 010000 010001 010100 010101 based upon earlier work by Garanzha et al. [1] for construct- 000 ing an entire BVH tree in parallel on a GPU. The method 1 000010 000011 000110 000111 010010 010011 010110 010111 follows a series of four main steps. The method first as- 001 signs a value called a Morton code to each primitive based 2 001000 001001 001100 001101 011000 011001 011100 011101 upon its location in the scene. The Morton codes are then 010 sorted. Next, a binary radix tree (defined in Section 3.2.1) is 3 constructed in parallel, which arranges close primitives near 001010 001011 001110 001111 011010 011011 011110 011111 each other in the tree. The last step fits an AABB around 011 the contents of each node in the binary radix tree in parallel 4 100000 100001 100100 100101 110000 110001 110100 110101 to form the final BVH [3]. The next sections will cover each 100 of these steps in more detail. 5 100010 100011 100110 100111 110010 110011 110110 110111 3.1 Morton Codes 101 6 The location of each primitive in the scene can be rep- 101000 101001 101100 101101 111000 111001 111100 111101 resented by the x, y, and z coordinates of the center of its 110

AABB [4, 5]. The Morton code of a primitive combines 7 101010 101011 101110 101111 111010 111011 111110 111111 the coordinates into a single value by interleaving the bi- 111 nary representations of the x, y, and z coordinates [1]. A Morton code has the form X0Y0Z0X1Y1Z1 ... where the x Figure 5: Morton codes for each location in a 2D coordinate is represented as the binary digits X0X1X2 ..., scene. Notice that each Morton code is made by and similarly for the y and z coordinates [3]. interleaving the bits of the x (blue) and y (red) co- Figure 5 shows a two dimensional scene area with the ordinates for its location [10]. Morton codes of each coordinate location. The lowest valued Morton code appears in the upper-left corner of the scene where both coordinates are zero. The zigzag pattern on the image shows the sequence of increasing Morton codes, which and the right child has the same common prefix but followed ultimately ends with the highest value in the lower-right by a 1 bit [3]. As an example, the left child of the root in corner. Every primitive in the scene is assigned a Morton Figure 6 has the prefix “00”, thus every key under its left code based on its location, and each code (paired with a child begins with “000” and every key under its right child reference to its primitive) is added to an array. Primitives begins with “001”. Since every internal node has exactly two that share a common location will be assigned the same children, a binary radix tree with n leaf nodes will always have exactly n 1 internal nodes [5]. Morton code specifying that location, while Morton codes − for empty areas of the scene will be unused [3, 4]. Every key in a binary radix tree must be unique, but du- Note that in Figure 5 all codes that start with a 0 bit plicate Morton codes could exist if the primitives are ex- are located in the upper half of the scene, and within that tremely close. To ensure uniqueness, the binary represen- section all codes that have a 0 as the second bit are located tation of each key’s index in the array can be concatenated on the left half of that section [1]. This property of Mor- onto the end of the key. These concatenations do not need to ton codes allows primitives that are near each other to have be stored, but can be performed as needed when comparing long common prefixes between their Morton codes, which is identical keys [3]. For example, if two keys both have the important for the binary radix tree construction. The array Morton code “010” and they are stored at indices (in binary) of Morton codes is sorted in order to group the codes by 000 and 001, then the index concatenation will result in the common prefixes [4]. Note that the actual primitives in the unique values “010000” and “010001”. scene are not moved. 3.2.2 Binary Radix Tree Properties 3.2 Binary Radix Tree Construction In order to create every internal node of the binary radix tree construction in parallel, it must be possible to deter- 3.2.1 Binary Radix Tree Fundamentals mine three properties about a node. These properties in- The array of sorted Morton codes can be thought of as clude the range of keys that a node covers, the length of the a of n bit string keys k0, . . . , kn 1 where n represents longest common prefix of those keys, and what the node’s the number of primitives in the scene.− A binary radix tree children are. Additionally, these must be determined with- organizes the keys under a tree structure where the leaves out depending on the work done in any other internal nodes are the keys and the internal nodes represent common bi- since all the nodes will be working in parallel. This section nary prefixes of the keys. Figure 6 shows an example of the will cover the three important properties of an internal node binary radix tree for a set of eight binary keys, which are and Sections 3.2.3 and 3.2.4 will discuss the binary radix tree the Morton codes of primitives in a scene. Each key is a leaf construction. node, and each internal node represents the longest common The keys covered by an internal node can be represented prefix shared by all the keys under it. These prefixes are al- as a linear range [i, j]. Using Figure 6 as an example, the ways shorter than the full Morton codes, and longer prefixes root node covers keys 0 through 7, so it can be represented represent a narrower set of locations in the scene. Every as the range [0, 7]. Its left child can be represented as [0, 3] internal node always has exactly two children: the left child and its right child as [4, 7]. has the common prefix of the its parent followed by a 0 bit, 34 Karras / Maximizing Parallelism in the Construction of BVHs, , and k-d Trees Karras / Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees 35

1: for each internal node with index i [0,n 2] in parallel 2 2: // Determine direction of the range (+1 or -1) 3: d sign(d(i,i + 1) d(i,i 1)) 4: // Compute upper bound for the length of the range 5: d d(i,i d) min 6: l 2 max 7: while d(i,i + lmax d) > d do · min 8: l l 2 max max · 9: // Find the other end using binary search 10: l 0

Figure 1: Performance of BVH hierarchy generation for 11: for t lmax/2,lmax/4,...,1 do { } Stanford Dragon (871K triangles). The y-axis corresponds 12: if d(i,i +(l +t) d) > dmin then Figure 2: Ordered binary radix tree. Leaf nodes, numbered · Figure 6: An ordered binary radix tree. There are 13: l l +t to millions of primitives per second and the x-axis to the 0–7,eight store leaf a nodes set of 5-bitcontaining keys in 5-bit lexicographical keys which order, appear and Figure 3: Our node layout for the tree of Figure 2. Each in- Figure 7: This figure shows the storage locations of 14: j i + l d number of parallel cores relative to GTX 480. The solid red in lexicographical order. Each internal node covers ternalnodes node for hasthe binary been assigned radix tree an displayed index between in Figure 0–6, 6. and · the internal nodes represent their common prefixes. Each in- 15: // Find the split position using binary search line indicates the fastest existing method by Garanzha et a linear range of keys with a common prefix dis- alignedThe internal horizontally nodes with have a leaf been node assigned of the same indices index. from The ternalplayed node directly covers beneath a linear range the node. of keys, The which range it partitionsof keys 16: d d(i, j) al. [GPM11], which is dominated by the top levels of the range0 to of 6 keys and covered are lined by eachup with node leaf is indicated nodes with by a the hori- node intocovered two subranges by an internal according node to their is partitioned first differing into bit. two same indices. The range of keys covered by each 17: s 0 tree where the amount of parallelism is very limited. Our subranges according to their first differing bit; these zontalinternal bar, nodeand the is shown split position, as a horizontal corresponding bar, and to a the tiny first 18: for t l/2 , l/4 ,...,1 do method, indicated by the dotted green line, parallelizes over two subranges are represented by the two children {d e d e } bitred that circle differs lies between immediately the keys, afteris indicated each node’s by a red split circle. 19: if d(i,i +(s +t) d) > d then Inof contrast the internal to a prefix node tree, [3]. which contains one internal node · node the entire tree and scales linearly with the number of cores. position [3]. 20: s s +t for every common prefix, a radix tree is compact in the sense 21: g i + s d + min(d,0) that it omits nodes with only one child. Therefore, every bi- edge of which keys they cover before having processed their · where the x coordinate of the point is represented as The length of the longest common prefix between two 22: // Output child pointers narykeys, radixk and treek with, isn denotedleaf nodes by δ contains(i, j). For exactly example,n the1 in- ancestors.have a longer Our key common insight prefix in enabling than the range parallel since construction they all 0.X0X1X2 ..., and similarly for y and z coordinates. The Mor- i j 23: if min(i, j)=g then left Lg else left Ig ternallongest nodes. common Duplicate prefix between keys require key 0 special and key attention—this 3 in Figure 6 is tohave establish 0 in the a differing connection bit position. betweenAdjacent node indices keys andto the keys ton code of an arbitrary 3D primitive can be defined in terms right of the split position are analogous, but with all having 24: if max(i, j)=g + 1 then right Lg+1 else right Ig+1 isis discussed “00”, which in Section has a length4. of two digits. So δ(0, 3) is 2. through a specific tree layout. The idea is to assign indices of the centroid of its axis-aligned bounding box (AABB). In a 1 in the differing bit position. 25: Ii (left,right) The prefix for each internal node can be determined by for the internal nodes in a way that enables finding their chil- practice, the Morton codes can be limited to 30 or 63 bits in solelyWe will inspecting only consider the first keyordered and last trees, key where in the thenode’s children key 26: end for dren without depending on earlier results. order to store them as 32-bit or 64-bit integers, respectively. ofrange. each node—and This is because consequently all keys betweenthe leaf nodes—are the first key in and lexi- 3.2.3 Setup for Binary Radix Tree Construction last key will also share the same prefix since all keys are in It is possible to construct a binary radix tree by starting Figure 4: Pseudocode for constructing a binary radix tree. cographical order. This is equivalent to requiring that the se- Let us assume that the leaf nodes and internal nodes are The algorithm for generating BVH node hierarchy was lexicographical order [3]. As an example, the left child of the with the root node, finding the first differing bit in the keys, For simplicity, we define that d(i, j)= 1 when j [0,n 1]. quence of keys is sorted, which enables representing the keys subsequently improved by Pantaleoni and Luebke [PL10] root in Figure 6 covers keys 0 through 3. By only looking storedcreating in two the separate child nodes, arrays, thenL and handlingI, respectively. each child We recur- define 62 coveredat keys by 0 and each 3, node it can as be a determined linear range that[i, theyj]. Using shared the(i, j) sively. However, this is not parallel because a node cannot and Garanzha et al. [GPM11]. Garanzha et al. generate one our node layout so that the root is located at I0, and the in- toprefix denote “00”. the lengthAll keys of between the longest 0 and common 3 will also prefix start between with be created until all of its ancestors have been created. Par- Now, the keys belonging to I share a common prefix that level of nodes at a time, starting from the root. They process “00”, otherwise they would not fall in that range. dicesallel of construction its children—as can be well achieved as the by children creating ofa connection any internal i keys ki and k j, the ordering implies that d(i0, j0) d(i, j) must be different from the one in the sibling by definition. the nodes on a given level in parallel, and use binary search An internal node partitions its range of keys into two sub- node—arebetween internalassigned node according indices to and its keys respective through split a particu- position. forranges any i0 for, j0 its[ childreni, j]. We according can thusto determine the first the differing prefix bit cor- lar tree setup. This is done by assigning indices to internal to partition the primitives contained within each node. They 2 The left child is located at Ig if it covers more than one key, This implies that a lower bound for the length of the prefix respondingamong the to keys. a given For example, node by the comparing left child its of firstthe root and in last nodes in a way that enables finding their children without then enumerate the resulting child nodes using an atomic or at Lg if it is a leaf. Similarly, the right child is located at is given by dmin = d(i,i d), so that d(i, j) > dmin for any k j key—theFigure 6 other covers keys keys are 0 throughguaranteed 3 with to share a common the same prefix prefix. of depending on that node’s ancestors being finished [3]. counter, and subsequently process them on the next round. “00”. This node will divide its keys among its children based Ig+1 Figureor Lg+ 71. shows The layout how the is nodes illustrated of the binary in Figure radix3 tree. from belonging to Ii. We can satisfy this condition by comparing onIn the effect, value each of the internal third bit node of thepartitions keys, since its keys the third according bit Figure 6 will be stored. The leaf nodes and internal nodes d(i,i 1) with d(i,i + 1), and choosing d so that d(i,i + d) Since these methods are targeted for real-time ray tracing, comes directly after the common prefix. All keys in the Anare stored important in two property separate of arrays, this particularL and I, respectively. layout is that The the to their first differing bit, i.e. the one following d(i, j). This corresponds to the larger one (line 3). they are each accompanied with a high-quality construction range with 0 as the third bit will belong under the left child indexroot of will every always internal be located node coincides at position withI0. either The children its first or bit will be zero for a certain number of keys starting from k , algorithm to allow different quality vs. speed tradeoffs. The and the keys with 1 as the third bit will belong under the i itsof last internal key. This nodes follows are placed by construction—the at an index according root to is theirlocated We use the same reasoning to find the other end of the andright one child. for the The remaining index of the ones last until keyk where. We the call differing the index bit of parent’s split position. The left child is located at Lγ if it is idea is to use the high-quality algorithm for a relatively small j at the beginning of its range [0,n 1], the left child of any range by searching for the largest l that satisfies d(i,i+ld) > theis last 0 is key called where the split the bit position is zerofor a thesplit node, position and, is denoted denoted by a leaf, or at Iγ if it covers more than one key. Similarly, the number of nodes near the root, and the fast algorithm for the by γ. If the range of keys covered by a node is [i, j], then internalright child node is is located located at atLγ the+1 or endIγ+1 of[3]. its range For example,[i,g], and the the dmin. We first determine a power-of-two upper bound lmax > g [i, j 1]. Since the bit is zero for kg and one for k , the rest of the tree. While we do not explicitly consider such 2the split position can be anything from i to j 1. Theg+ split1 rightsplit child position is located of the rootat the in beginning Figure 7 is of at its index range 3, so[g its+ left1, j]. l by starting from 2 and increasing the value exponentially split position must satisfy d(g,g + 1)=d(i, j)−. The resulting hybrid methods in this paper, we believe that our approach position cannot occur at j, since the differing bit must be a child is stored in I3 and its right child is stored in I4. until it no longer satisfies the inequality (lines 6–8). Once subranges1 for key j are. The given split by position[i,g] and for the[g + left1, childj], and of the are root further in Algorithm.It is importantIn order to note to construct that the index a binary of every radix internal tree, we is general enough to be combined with any appropriate high- we have the upper bound, we find l using binary search in partitionedFigure 6 is by 1 because the left andkey 1 right is the child last node,key in respectively. the range [0, 3] neednode to is determine equal to the the index range of ofeither keys its covered first or last by key. each The inter- quality algorithm in the same fashion. that starts with “000”. The next key must start with “001”. index of an internal node will be equal to the index of its the range [0,lmax 1]. The idea is to consider each bit of l in nal node, as well as its children. The above property readily TheIn the subrange figure, [thei, γ] root represents corresponds the range to the of full keys range covered of keys, by first key if the node is a right child of its parent. Figure 7 turn, starting from the highest one, and set it to one unless the Zhou et al. [ZGHG11] also applied the idea of using Mor- gives us one end of the range, and we will show how the [0,the7]. left Since childk andand thek subrangediffer at their [γ +1 first, j] represents bit, the range the range is split shows that internal node 4 is a right child, that its index co- new value would fail to satisfy the inequality (lines 10–13). ton codes to construct octrees in the context of surface recon- covered by the3 right4 child [3]. otherincides end with can itbe first found key efficiently (key 4), and by that looking its range at the extends nearby at g = 3, resulting in subranges [0,3] and [4,7]. The left child The other end of the range is then given by j = i + ld. struction. Instead of generating the hierarchy in a top-down For an internal node with range [i, j], the equality δ(γ, γ + keys.rightward. The children Conversely, can then internal be identified node 3 is a by left finding child, its the in- split further1) = δ splits(i, j) always[0,3] at holdsg = true1 based and ison important the third for bit, finding and the dex coincides with its last key (key 3), and its range extends fashion, they start with the leaf nodes and perform a series of position, by virtue of our node layout. d(i, j) tells us the length of the prefix corresponding to Ii, rightthe child the split splits position.[4,7] at Ing fact,= 4 basedkγ and onkγ the+1 is second the only bit. pair leftward. More generally, if a node covers the range [i, j], parallel compaction operations to determine their ancestors, of adjacent keys in the range where the prefix length of the then its left child will be located at γ, which is the end of its which we shall denote by d . We can, in turn, use this to Pseudocode for the algorithm is given in Figure 4. We pro- node one level at a time. two keys is equal to the prefix length of the entire range. range [i, γ]. The right child will be located at γ + 1, which find the split position g by performing a similar binary search Pairs of adjacent keys to the left of the split position will cessis theeach beginning internal of node its rangeI in parallel, [γ + 1, j] and[3]. first determine the 3. Parallel Construction of Binary Radix Trees i for largest s [0,l 1] satisfying d(i,i + sd) > d (lines Binary radix trees. Given a set of n keys k0,...,kn 1 “direction” of its range by looking at the neighboring keys 2 node 17–20). If d =+1, g is then given by i + sd, as this is the represented as bit strings, a binary radix tree (also called a A naïve algorithm for constructing a binary radix tree would ki 1, ki, ki+1. We denote the direction by d, so that d =+1 highest index belonging to the left child. If d = 1, we have Patricia tree) is a hierarchical representation of their com- start from the root, find the first differing bit, create the child indicates a range beginning at i and d = 1 a range ending to decrement the value to account for the inverted indexing. mon prefixes. The keys are represented by the leaf nodes, nodes, and process each child recursively. This approach is at i. Since every internal node covers at least two keys, we and each internal node corresponds to the longest common inherently sequential—even though we know there are go- know that ki and ki+d must belong to Ii. We also know that Given i, j, and g, the children of Ii cover the ranges prefix shared by the keys in its respective subtree (Figure 2). ing to be n 1 internal nodes in the end, we have no knowl- ki d belongs to a sibling node Ii d, since siblings are always [min(i, j),g] and [g + 1,max(i, j)]. For each child, we com- located next to each other in our layout. pare the beginning and end of its range to see whether it is a c The Eurographics Association 2012. c The Eurographics Association 2012. Karras / Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees 35

1: for each internal node with index i [0,n 2] in parallel by the parent of the node currently under consideration. 2 2: // Determine direction of the range (+1 or -1) This fact is used to determine the index of j. For any km 3: d sign(d(i,i + 1) d(i,i 1)) belonging to Ii, the inequality δ(i, m) > δmin always holds 4: // Compute upper bound for the length of the range true. Therefore, δmin gives a lower bound for the length of 5: d d(i,i d) the prefix for the section of keys covered by Ii. Thus, the min 6: l 2 index j marking the other end of the range can be found by max 7: while d(i,i + lmax d) > dmin do searching for the largest integer l that satisfies δ(i, i+l d) > · · 8: lmax lmax 2 δ . Then j will be the index i + l d [3]. · min 9: // Find the other end using binary search Finding j is achieved by first finding· an exclusive power- 10: l 0 of-two upper bound for l, denoted as lmax (done in lines 6-8 11: for t lmax/2,lmax/4,...,1 do { } of the pseudocode). As shown in Figure 8, lmax begins at 2 12: if d(i,i +(l +t) d) > d then · min and is doubled until it becomes the upper bound for l. For 13: l l +t Figure 3: Our node layout for the tree of Figure 2. Each in- example, for I5, lmax is doubled until it reaches 4. We can 14: j i + l d ternal node has been assigned an index between 0–6, and · see from Figure 7 that l is 2 for node 5 since it covers the 15: // Find the split position using binary search aligned horizontally with a leaf node of the same index. The keys [5, 5 + 2]. Therefore, lmax = 4 is the correct exclusive 16: d d(i, j) node power-of-two upper bound for node 5 [3]. range of keys covered by each node is indicated by a hori- 17: s 0 Once the upper bound l is determined, l can be found zontal bar, and the split position, corresponding to the first 18: for t l/2 , l/4 ,...,1 do max {d e d e } bit that differs between the keys, is indicated by a red circle. 19: if d(i,i +(s +t) d) > d then by using binary search in the range [0, lmax 1] (performed · node − 20: s s +t in lines 10-13). For example, for node 5, with δmin = 1

21: g i + s d + min(d,0) and lmax = 4, the loop goes through the range 2, 1 . l edge of which keys they cover before having processed their · { } 22: // Output child pointers starts at 0 and is incremented by 2 in the first iteration, but ancestors. Our key insight in enabling parallel construction 23: if min(i, j)=g then left L else left I not incremented by 1 in the second iteration because the is to establish a connection between node indices and keys g g 24: if max(i, j)=g + 1 then right L else right I condition to increment does not hold. This gives a final l g+1 g+1 through a specific tree layout. The idea is to assign indices 25: I (left,right) value of 2. Now j can be determined by the formula j = i for the internal nodes in a way that enables finding their chil- 26: end for i + l d, which is 7 for node 5. This is correct since node 5 dren without depending on earlier results. covers· the range [5, 7]. Figure 4: Pseudocode for constructing a binary radix tree. Let us assume that the leaf nodes and internal nodes are Now that the range of keys for the node has been de- FigureFor simplicity, 8: Pseudocode we define that ford(i, binaryj)= 1 when radix j [ tree0,n 1 con-]. stored in two separate arrays, L and I, respectively. We define struction. Note that if an out-of-bounds 62 argument termined, the length of the common prefix of those keys, our node layout so that the root is located at I , and the in- is passed to δ(), it is defined to return -1 [3]. denoted by δnode, is calculated as δ(i, j) on line 16 of Fig- 0 ure 8. The next step of the algorithm is to determine where dices of its children—as well as the children of any internal Now, the keys belonging to I share a common prefix that i the range of keys should split for the left and right chil- node—are assigned according to its respective split position. must be different from the one in the sibling by definition. dren. Recall from the last paragraph of Section 3.2.2 that The left child is located at I if it covers more than one key, This implies that a lower bound for the length of the prefix g 3.2.4 Binary Radix Tree Construction Algorithm δ(γ, γ+1) = δ(i, j), which is the value of δ . This equality or at L if it is a leaf. Similarly, the right child is located at is given by d = d(i,i d), so that d(i, j) > d for any k node g The goal of themin binary radix tree constructionmin algorithmj occurs since the first differing bit in the range of keys toggles I or L . The layout is illustrated in Figure 3. belonging to I . We can satisfy this condition by comparing g+1 g+1 is to determine thei range [i, j] of keys covered by each in- from 0 to 1 exactly between γ and γ + 1. To find this split d(i,i 1) with d(i,i + 1), and choosing d so that d(i,i + d) An important property of this particular layout is that the ternal node, as well as the indices of its children. One end location, the next goal is to determine the largest integer s corresponds to the larger one (line 3). index of every internal node coincides with either its first or of the range is given by the internal node’s index, and the in the range [0, l 1] that satisfies δ(i, i + s d) > δnode. The − · its last key. This follows by construction—the root is located other endWe use of the the rangesame reasoning can be found to find by the examining other endof the the sur- index i + s d will be the furthest index from i in the node’s · at the beginning of its range [0,n 1], the left child of any roundingrange by keys. searching The indices for the largest of thel that children satisfies cand( bei,i+ foundld) > by key range where the first differing bit is the same as that of locating the split position. The previous setup allows> each key i. If the range extends rightward from i then i + s d internal node is located at the end of its range [i,g], and the dmin. We first determine a power-of-two upper bound lmax · right child is located at the beginning of its range [g + 1, j]. internall by startingnode to from be processed 2 and increasing independently the valueexponentially and in parallel will be γ. Conversely, if the range extends to the left from i withuntil the it other no longer internal satisfies nodes the [3]. inequality This section (lines 6–8). will Onceexplain then i + s d will be γ + 1. The value of s can be found using Algorithm. In order to construct a binary radix tree, we · thewe construction have the upper process bound, using we find thel pseudocodeusing binary in search Figure in 8 the binary search shown on lines 17-20 of the pseudocode, need to determine the range of keys covered by each inter- as a reference. which works similarly to the previous binary search [3]. the range [0,lmax 1]. The idea is to consider each bit of l in nal node, as well as its children. The above property readily Now that s is known, γ is determined by i+s d+min(d, 0), Considerturn, starting processing from the an highest internal one, node and setIi, it which to one unless is in the theith gives us one end of the range, and we will show how the which appears on line 21 of the pseudocode.· The addition positionnew valueof the would array failI. First,to satisfy the the direction inequality the (lines range 10–13). extends other end can be found efficiently by looking at the nearby by min(d, 0) serves to place γ to the left of the split if the (rightwardThe other or end leftward) of the range is calculated is then given in line by 3j = ofi Figure+ ld. 8. If d keys. The children can then be identified by finding the split becomes +1 then the range extends rightward; if it becomes range was extending to the left. In that case, i + s d yields · position, by virtue of our node layout. 1 thend(i the, j) rangetells us extends the length leftward of the prefix [3]. For corresponding example consider to Ii, γ + 1 instead of γ, since the search was scanning from right − which we shall denote by d . We can, in turn, use this to to left [3]. Consider the full formula for I3, where s = 1 and Pseudocode for the algorithm is given in Figure 4. We pro- I2, which covers keys 2 andnode 3 as shown in Figure 7. In thisfind example, the splitδ position(2, 3) =g 4by while performingδ(2, 1) a similar = 2, so binary the searchrange of d = 1, indicating that the range extends left. The formula cess each internal node Ii in parallel, and first determine the − internalfor largest nodes 2 extends[0,l 1] tosatisfying the rightd( sincei,i + sd it) shared> d a(lines longer gives γ = 3 + (1 1) + ( 1) = 1, which matches the index “direction” of its range by looking at the neighboring keys 2 node · − − common17–20). prefix If d with=+1, itsg is right then neighbor. given by i + sd, as this is the of the left child for I3 in Figure 7. ki 1, ki, ki+1. We denote the direction by d, so that d =+1 Thehighest next index task belonging is to determine to the leftj child.(handled If d = in lines1, we 5-14have in Lastly, the node stores the indices of its children as shown indicates a range beginning at i and d = 1 a range ending the pseudocode),to decrement the which value represents to account forthe the index inverted of the indexing. other end on lines 23-25 of Figure 8. These assignment statements en- at i. Since every internal node covers at least two keys, we of the range [i, j]. For I2, j is 3 since the range of the internal sure that children covering only one key are represented as know that ki and ki+d must belong to Ii. We also know that node atGiven indexi, 2j, covers and g,keys the children 2 and3. of I Everyi cover internal the ranges node leaf nodes, while children covering multiple keys are repre- k belongs to a sibling node I , since siblings are always [min(i, j),g] and [g + 1,max(i, j)]. For each child, we com- i d i d covers at least two keys, therefore the two keys ki and ki+d sented as internal nodes. located next to each other in our layout. pare the beginning and end of its range to see whether it is a must belong to Ii. The other neighboring key ki d must Overall, when running this algorithm in parallel on the − belong to Ii d. The two keys ki and ki+d share a common GPU, each thread is responsible for only one internal node. c The Eurographics Association 2012. − prefix that is different and longer than the prefix between ki The balance of the tree depends solely on the locations of and ki d. Let δmin represent the value of δ(i, i d). For I2, primitives in the scene. Scenes with an even distribution of − − δmin equals 2, with the common prefix being “00”. The value primitives throughout the space will generate well-balanced of δmin is equivalent to the length of the prefix represented trees. 34 Karras / Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees

on modern hardware continues to increase, this algorithm will become even more impactful. While this algorithm is a significant improvement over New method previous work, it is not without flaws. Scenes with poorly distributed primitives will generate BVHs that are not well- balanced. This is because the algorithm uses absolute lo-

Garanzha et al. cations of the primitives in the scene, rather than their po- sitions relative to each other. An additional drawback to this algorithm is that it creates a new BVH for each frame without using any information from the previous frame. In Figure 1: Performance of BVH hierarchy generation for sequences of frames where primitives move very little, it Figure 9: A comparison of two algorithms. The y would be beneficial to simply adjust the previous frame’s Stanford Dragon (871K triangles). The y-axis corresponds axis represents millions of primitives per second and FigureBVH, 2:ratherOrdered than binary constructing radix tree. an Leaf entirely nodes, new numbered one. Fur- theto millionsx axis is of the primitives number per of second parallel and cores. the x-axis [3]. to the 0–7,ther store research a set is of needed 5-bit keys to overcome in lexicographical these drawbacks. order, and Over- number of parallel cores relative to GTX 480. The solid red theall, internal this research nodes representpresents remarkable their common progress prefixes. in Each the field in- of line indicates the fastest existing method by Garanzha et ray tracing. 3.2.5 Time Complexity ternal node covers a linear range of keys, which it partitions al. [GPM11], which is dominated by the top levels of the into two subranges according to their first differing bit. treeFor wherean internal the amount node that of parallelism covers q keys, is very each limited. of thethree Our loops above executes at most log q iterations. Since q n Acknowledgments method, indicated by the dottedd green2 e line, parallelizes over≤ for all n 1 internal nodes, the worst-case time complexity InThanks contrast to to Nic a prefix McPhee, tree, Elena which Machkasova, contains one internal K.K. Lamberty, node the entire− tree and scales linearly with the number of cores. Max Magnuson, and Emma Sax for their time, feedback, for constructing the entire tree is (n log n). The worst case for every common prefix, a radix tree is compact in the sense O and constructive comments. occurs when the height of the tree grows proportional to n, that it omits nodes with only one child. Therefore, every bi- butwhere this isthe unlikelyx coordinate since the of height the point of the is tree represented is limited as to nary radix tree with n leaf nodes contains exactly n 1 in- the0.X lengthX X of... the, and keys similarly [3]. for y and z coordinates. The Mor- 0 1 2 ternal5. REFERENCESnodes. Duplicate keys require special attention—this ton code of an arbitrary 3D primitive can be defined in terms [1] K. Garanzha, J. Pantaleoni, and D. McAllister. is discussed in Section 4. 3.3of the Fitting centroid ofAABBs its axis-aligned bounding box (AABB). In Simpler and faster HLBVH with work queues. In practice,To turn the the Morton binary codes radix can tree be into limited a BVH, to 30 orthe 63 contents bits in WeProceedings will only consider of the ACM ordered SIGGRAPH trees, where Symposium the children on oforder each to node store are them surrounded as 32-bit withor 64-bit an AABB. integers, The respectively. leaves are of eachHigh node—and Performance consequently Graphics the, HPGleaf nodes—are ’11, pages in 59–64, lexi- New York, NY, USA, 2011. ACM. already done since each primitive has its own AABB as dis- cographical order. This is equivalent to requiring that the se- cussedThe in algorithm Section 2. for Each generating internal BVH node node can hierarchy construct was an [2] C. Gribble, J. Fisher, D. Eby, E. Quigley, and quence of keys is sorted, which enables representing the keys AABBsubsequently by surrounding improved its by two Pantaleoni children’s and AABBs Luebke in [PL10 a box] G. Ludwig. Ray tracing visualization toolkit. In covered by each node as a linear range [i, j]. Using d(i, j) inand a manner Garanzha similar et al. [ toGPM11 Figure]. Garanzha 4. Since et the al. dimensions generate one of Proceedings of the ACM SIGGRAPH Symposium on to denote the length of the longest common prefix between eachlevel internal of nodes node’s at a time, AABB starting depends from the on root. thedimensions They process of Interactive 3D Graphics and Games, I3D ’12, pages each child’s AABB, it must be done in a bottom-up fash- keys k71–78,i and k Newj, the York, ordering NY, implies USA, 2012. that d ACM.(i0, j0) d(i, j) the nodes on a given level in parallel, and use binary search ion. To do this in parallel, each thread should start from for[3] any T.i0 Karras., j0 [i, Maximizingj]. We can thus parallelism determine in the the prefix construction cor- to partition the primitives contained within each node. They 2 a leaf node and travel up the tree. However, each internal respondingof BVHs, to a octrees, given node and by k-d comparing trees. In Proceedings its first and lastof the then enumerate the resulting child nodes using an atomic node has two children and it should not be processed twice. key—theFourth other ACM keys SIGGRAPH are guaranteed / to Eurographics share the same Conference prefix. Therefore,counter, and each subsequently internal node process keeps them an atomic on the next count round. of how on High-Performance Graphics, EGGH-HPG’12, In effect, each internal node partitions its keys according manySince threads these have methods visited are it.targeted The firstfor real-time thread that ray tracing, reaches pages 33–37, Aire-la-Ville, Switzerland, Switzerland, an internal node terminates immediately, while the second to their first differing bit, i.e. the one following d(i, j). This they are each accompanied with a high-quality construction 2012. Eurographics Association. thread gets to process the node [3]. bit will be zero for a certain number of keys starting from k , algorithm to allow different quality vs. speed tradeoffs. The [4] T. Viitanen, M. Koskela, P. J¨a¨askel¨ainen, H. Kultala,i and oneand for J. the Takala. remaining Mergetree: ones until A HLBVHk . We call constructor the index of for idea is to use the high-quality algorithm for a relatively small j 3.4 Results the lastmobile key where systems. the bit In isSIGGRAPH zero a split position Asia 2015, denoted Technical by number of nodes near the root, and the fast algorithm for the Briefs, SA ’15, pages 12:1–12:4, New York, NY, USA, As shown in Figure 9, the performance of the algorithm g [i, j 1]. Since the bit is zero for kg and one for k , the rest of the tree. While we do not explicitly consider such 2 g+1 scales incredibly well as the number of cores increases. The split position2015. ACM. must satisfy d(g,g + 1)=d(i, j). The resulting executionhybrid methods time is in inversely this paper, proportional we believethat to the our number approach of subranges[5] I. Wald. aregiven On fast by construction[i,g] and [g + of1, SAH-basedj], and are bounding further cores.is general This enough is an excellent to be combined quality with because any appropriate the performance high- volume hierarchies. In Proceedings of the 2007 IEEE partitioned by the left and right child node, respectively. canquality be roughly algorithm doubled in the same by doubling fashion. the number of cores Symposium on Interactive Ray Tracing, RT ’07, pages assigned to the task [3]. Figure 9 also shows a comparison In the33–40, figure, Washington, the root corresponds DC, USA, to the2007. full IEEE range Computer of keys, Zhou et al. [ZGHG11] also applied the idea of using Mor- between this algorithm and the best previously known algo- [0,7].Society. Since k and k differ at their first bit, the range is split rithmton codes by Garanzha to construct et octrees al. Asthe in the number context of of cores surface increases, recon- 3 4 at[6]g = T.3, resulting Whitted. in An subranges improved[0, illumination3] and [4,7]. The model left for child thestruction. new algorithm Instead of greatly generating surpasses the hierarchy the performance in a top-down of the furthershaded splits [ display.0,3] at gCommun.= 1 based ACM on the, 23(6):343–349, third bit, and theJune comparisonfashion, they method, start with even the leaf though nodes the and comparison perform a series method of 1980. right child splits [4,7] at g = 4 based on the second bit. hasparallel elements compaction of parallelism operations [1]. to determine their ancestors, [7] Wikipedia. Triangle mesh — Wikipedia, The Free one level at a time. Encyclopedia, 2015. [Online; accessed 10-April-2016]. 3.[8] Parallel Wikipedia. Construction Bounding of volume BinaryRadix hierarchy Trees — Wikipedia, 4.Binary CONCLUSION radix trees. Given a set of n keys k0,...,kn 1 The ray tracing technique is exceptional at producing high The Free Encyclopedia, 2016. [Online; accessed represented as bit strings, a binary radix tree (also called a A naïve10-April-2016]. algorithm for constructing a binary radix tree would qualityPatricia images tree) is of a 3D hierarchical scenes. The representation simulation of actualtheir com- light start from the root, find the first differing bit, create the child propagation results in the shadows, reflections, and refrac- [9] Wikipedia. Ray tracing (graphics) — Wikipedia, The tionsmon which prefixes. make The images keys are appear represented photorealistic. by the leaf By nodes, using nodes,Free and Encyclopedia, process each child 2016. recursively. [Online; accessed This approach is bothand the each parallel internal BVH node construction corresponds to algorithm the longest presented common in inherently10-April-2016]. sequential—even though we know there are go- prefix shared by the keys in its respective subtree (Figure 2). ing[10] to Wikipedia. be n 1 internal Z-order nodes curve in the —end, Wikipedia, we have The no knowl- Free this paper and hardware with multiple cores, real-time ray tracing becomes significantly faster. As the number of cores Encyclopedia, 2016. [Online; accessed 10-April-2016]. c The Eurographics Association 2012.