Design and implementation of virtualized sparse octree ray casting

Master Thesis Computer Science Lund University, Faculty of Engineering

by Markus Arvidsson December 2010 ABSTRACT

Sparse voxel octree ray casting is an alternative rendering technique to rasteri- zation for static geometry. The hierarchical structure of the octree enables some nice properties for the ray casting method: the traversal of the octree data has a logarithmic time complexity and level of detail is handled automatically by terminating the traversal at a lower depth level. Furthermore, the octree data structure is well suited for dynamic streaming and virtualization since both the geometry and the material data can be stored in one data structure. It is possi- ble that in the future highly detailed scenes with complex geometry and unique texturing will perform better with sparse voxel octree ray casting than with ras- terization of indexed triangle meshes. On the negative side, the memory re- quirements are generally higher and the image quality is currently not as good as rasterization with triangle meshes.

In this thesis, we design and implement a method for virtualized sparse voxel octree ray casting. We compare our method to other published methods and discuss the differences as well as present some ideas for future work. A novel octree traversal algorithm is also described (but not performance evaluated) for a SIMD type of instruction set. ACKNOWLEDGEMENTS

I would like to thank my supervisor, Tomas Akenine-Moller,¨ whose guidance helped me from the initial to the final version of this thesis. I would also like to thank my family and friends for giving me support.

iii TABLE OF CONTENTS

Acknowledgements ...... iii Table of Contents ...... iv

1 Background 1 1.1 Real-Time Rendering ...... 1 1.1.1 Rasterization ...... 1 1.1.2 Ray Tracing ...... 4 1.1.3 Ray Casting and Hybrid Rendering ...... 5

2 The Octree Data Structure 6 2.1 CPU Traversal ...... 7 2.2 Multi-Core Parallel Traversal ...... 12 2.3 GPU Traversal ...... 15 2.3.1 GPU Traversal: Implementation of a Benchmark Algorithm 15

3 Sparse Voxel Octrees as Static Geometry 16 3.1 Sparse Voxel Octree Geometry Format ...... 16 3.1.1 Comparison with a Triangle Mesh ...... 18 3.1.2 Voxelization of a Triangle Mesh ...... 19 3.2 Virtualization of the Static Geometry ...... 22 3.2.1 Virtualization System Overview ...... 22 3.2.2 Virtualized Sparse Voxel Octree ...... 24 3.2.3 Run-Time Loading ...... 27

4 Sparse Voxel Octree Rendering 32 4.1 Integration With a Traditional Pipeline ...... 32 4.2 Sparse Voxel Octree Ray Casting ...... 32

5 Implementation and Results 34 5.1 Voxelization and Virtualization ...... 34 5.2 Run-Time Environment ...... 35 5.3 Results and Discussion ...... 38

6 Future Work 41 6.1 Image Quality Improvements ...... 41 6.1.1 Voxelization Filtering ...... 41 6.1.2 Spherical Voxel Traversal ...... 41 6.1.3 Screen Space Post-Processing ...... 42 6.2 Performance Improvements ...... 42 6.2.1 Faster Dynamic Streaming ...... 43 6.2.2 Data Compression ...... 44

7 Conclusion 46

iv A Appendix 47 A.1 Pseudocode ...... 47

Bibliography 52

v 1 BACKGROUND

1.1 Real-Time Rendering

Image rendering is the process of synthesizing images from geometric mod- els. The result of the rendering process for one image is commonly called a frame. Realistic rendering techniques attempt to approximately solve the ren- dering equation [1]. In real-time rendering, the duration to render a frame needs to be short (usually around 15-30 ms) to get the impression that the rendering is instantaneous or in real-time. An introduction and reference to the subject of real-time rendering is given in [2]. The above description of real-time rendering is somewhat vague; in practice, real-time rendering is often used in applications that has some kind of user input that influence the rendering. Video games is currently the largest application field for real-time rendering but its use in some other areas are growing [3].

In this section, we will discuss the different fundamental rendering techniques currently used in real-time rendering as well as some techniques that could po- tentially become useful when the hardware has developed further. We also dis- cuss the advantages and disadvantages of each method.

1.1.1 Rasterization

Rasterization is by far the most frequenty used technique for real-time render- ing. The rasterizaton pipeline transforms geometric primitives by a projection transformation from camera space to screen space. The primitives are stored in a vector graphics format and are after the projection converted to a raster image format, hence the name rasterization. Indexed triangle meshes are in almost all implementations used as the vector graphics format for the primitives.

Rasterization renderers can be categorized into two main rendering types: for- ward rendering and deferred rendering. They differ mainly in how they calcu- late lights and shadows in the scene. Another term for the light and shadow calculations is shading. A forward renderer generally calculates the shading for the objects one object at a time. Multiple rendering passes might be needed if many lights are simultaneously affecting the object. A deferred renderer works

1 differently: instead of calculating the shading for the object it stores the normal and other lightning and shadow properties to a render target buffer. This is done only for the pixels that pass the Z-buffer test. Those properties are then used in the shading stage when the final color for each pixel is calculated. In the shading state, all lights in the scene are iterated over and we calculate and add for each light the contribution to the final pixel color. Each light’s contribution is calculated by taking the stored properties into account. There are advantages and disadvantages with each type of renderer; the strengths of deferred shading is generally maximized in scenes with many lights influencing a large number of objects. On the other hand, an outdoor scene with the sun as the only light will most likely perform better with forward shading.

Modern graphics processors run the rasterization pipeline [4] in parallel on mul- tiple execution units. Each triangle is processed independently from the other triangles and as a consequence each primitive is processed independently from the other primitives. What this means in practice is that rasterization is very cache friendly; the reason being that the vertices for an indexed triangle mesh are stored linearly and each mesh can be processed sequentially without the need of any data from another mesh. However, because of the same indepen- dence between the primitives, the rasterization algorithms can not use any data that is stored within another triangle or within another primitive; if such data is needed it has to be calculated before the rendering of the frame is started and then made available by storing the data in separate buffers.

Another issue with the current rasterization pipeline implementations is that they are using normal maps to enhance geometric detail. While this technique improves the image quality significantly it still needs directional lights and the right view angle to give the desired effect. At some point in the future this il- lusional illumination effect will be replaced with real geometry. Displacement maps [5] in combination with micropolygon tessellation [6] are currently the dominant technique to increase real geometric detail. However, indexed trian- gle meshes are not a good choice if the triangle size is smaller than a pixel since the rasterization interpolation does not work well in that case. Furthermore, indexed triangle meshes are not the most memory efficient format for small tri- angles since no matter the size of the triangle three vector coordinates and three integer coordinates have to be stored. Another issue is that displacement maps are not well suited for future ray tracing [7] techniques since the displacement maps have to be evaluated in order to produce correct results. These drawbacks

2 and restrictions have spawned some research into alternative geometry formats. We will later discuss the octree geometry format which partly addresses these problems.

Despite these restrictions, the performance and image quality that the raster- ization hardware and algorithms have been able to produce have improved significantly every year for the last twenty years. While this development is expected to continue for the foreseeable future, there may at some point be a need for real-time rendering algorithms and visual effects that will work better if implemented with a technique that does not have the independent triangles restriction. For instance, some reflection and refraction effects are hard to cal- culate without access to other primitives. Ray tracing - a technique that we will discuss in the next section - has access to the whole scene and is often utilized in off-line rendering for similar optical effects.

3 1.1.2 Ray Tracing

Ray tracing [7] is a rendering technique that generates the resulting image by simulating light paths. The light path can start from the camera or from the light source (see Figure 1.1).

L i g h t s o u r c e

V i e w R a y S h a d o w R a y

Figure 1.1: Ray tracing.

Simulating a light path requires in most cases access to multiple surfaces since some of the light energy will reflect and sometimes also refract when the light reaches a surface. Note that this is different from rasterization where each sur- face can be rendered independently. Because of the surface reflections, ray trac- ing can achieve a higher image quality than rasterization for scenes with indirect illumination. Furthermore, ray tracing can render optical effects such as reflec- tion, refraction and caustics. It is currently used in offline-rendering to render special optical effects as well as a building block in popular algorithms for global illumination [8]. Entire development APIs [9] and rendering engines [10] have been developed to promote ray tracing as the primary technique for real-time rendering. While we believe ray tracing will have a place in the real-time ren- dering field we are not convinced that it will be used to do the heavy work. This prediction is based on the fact that frequent cache misses will occur in the trac- ing phase when we access the surfaces. Unfortunately, these cache misses will have a severe effect on the performance and we do not think this performance problem will be possible to solve in the near future. However, we believe ray tracing can be used to simulate optical effects such as those mentioned above when it is possible to run ray tracing shaders on the graphics processor.

4 1.1.3 Ray Casting and Hybrid Rendering

The purpose of this thesis is to design and implement a method for static ge- ometry rendering based on sparse voxel octree ray casting. The method was first introduced to the author by a presentation of Olick [11]. The basic idea is an extension of virtual texturing [12]. By storing texture data and geometry data in the same data structure one can extend the virtualization to include the geometry in addition to the textures. This data structure is also designed for ray casting and dynamic streaming. The technical report [13] by Laine and Kar- ras is the most comprehensive publication of sparse voxel octree ray casting to date. The algorithms and the framework presented in [13] are of high quality and we therefore recommend the interested reader to study the report in detail. Another interesting article that implements voxel octree ray casting is [14]. My work is done independently and differs fundamentally from both [13] and [14].

In this thesis, we will also discuss if voxel ray casting combined with traditional techniques might be an alternative to the traditional rasterization pipeline. We call this new pipeline hybrid rendering. Hybrid rendering in this case means that instead of using rasterization for all rendering tasks the graphics pipeline will use ray tracing for some optical effects, voxel ray casting for static geome- try and rasterization for everything else. The data structure used for the static geometry is based on a sparse voxel octree with the planes in the center of the . A description of the octree data structure is given in the next section.

5 2 THE OCTREE DATA STRUCTURE

The octree is a tree data structure used to recursively subdivide space into eight subspaces (see Figure 2.1). It is the three-dimensional analog of the quadtree. The length of the path from the root of the tree to a node is defined as the depth of the node. A tree node N in the octree is itself an octree and represents a finite bounded subspace in the shape of an axis-aligned bounding box B. N is split into 8 subspaces by three splitting planes: Πx=d, Πy=d, Πz=d. The three splitting planes intersect at the center point c of N. This implies that each of the node N’s children have the node N’s center point c as one corner point. If we define an orthogonal coordinate system with the origin mapped to the root node’s center point, then the coordinates for N’s center point c are the respective distances from the origin to the corresponding splitting planes Πx=d, Πy=d, Πz=d. Note that this implies that the root node’s center point is at the origin.

Figure 2.1: The octree data structure.

6 2.1 CPU Traversal

Our novel octree traversal algorithm (see appendix A.1.1 for pseudocode) is partly inspired by an algorithm presented in a paper written by Amanatides and Woo [15]. It is not evaluated and compared to other octree traversal algorithms and we do not except it to perform better than the current top performing ones. We also designed a multi-core parallel version (see section 2.2 for a description and appendix A.1.3 for pseudocode). The parallel version was designed for a fully programmable graphics processor [16]. Unfortunately, we could not eval- uate the performance of this algorithm since there is no such graphics processor on the market at the time of writing.

A key insight we used in the development of the algorithm is that the number of intersections between the line segment and the splitting planes inside B, the order of the plane intersections and the child index for the start point and for the end point uniquely determine which children to traverse and their relative traversal order (see Figure 2.2).

Figure 2.2: The number of intersections between the line segment and the split- ting planes, the order of the plane intersections and the child index for the start point and for the end point uniquely determine the traversal order.

7 The input to the algorithm is the ray r and the traversal depth d; the output is the first node with depth d that the ray r intersects or NIL.

For each ray r(t) = o(t) + td that intersects the root node N’s bounding box B, we find a parameter range tB = [t0, t1] where a = o + t0d and b = o + t1d are the first and second intersection points respectively. We define the local coordinate system of the octree as a right-handed coordinate system with the origin at the center of N (see Figure 2.3).

Figure 2.3: Octree coordinate system.

We transform the points a and b to the local coordinates of the octree

  2/E 0 0 −cx    0 2/E 0 −cy M =   ,  0 0 2/E −cz    0 0 0 1 

aˆ = Ma, bˆ = Mb

where E is the length of the edges of B, c is the center point coordinates of N in world space and aˆ and bˆ are the transformed intersection points.

8 Define the points pˆ start = aˆ and pˆ end = bˆ and the parameter values tstart = 0 and tend = 1 corresponding to pˆ start and pˆ end. Define the line segment between the points aˆ and bˆ:

L = aˆ + tbˆ | t ∈ [0, 1]

The algorithm will recursively visit the nodes in a depth-first order starting with the root node and the above parameters as the first input. In each recursive step, the two points pˆ in and pˆ out are created along with their respective line parame- ter values tin and tout. The line parameter values tin and tout are copied from the splitting plane intersection parameters tx, ty and tz for each visited child node. Explicit stacking instead of recursive function calls are used to avoid the func- tion call overhead. A new stack packet is popped from the stack at the beginning of each recursive step. Thereafter, the stack packets for the child nodes are cre- ated and pushed on the stack in the correct depth-first order. Data for each stack packet are the node index, the node depth and the two intersection parameters tin and tout. The recursion stops when the desired depth d is found or when the stack is empty.

X Y Z Define the bitfield A = in in in where for example the bit Xin is set if the signed distance nx · pˆ in + d from the point pˆ in to the plane Πx=d is negative. Bit-

X Y Z field B = out out out for point pˆ out is defined analogously. By inspection (see

Figure 2.4), it is clear that the node child index for point pˆ in is the integer value of A and the node child index for pˆ out is the integer value of B.

Figure 2.4: Child indices.

9 This simple calculation completely determines the traversal order for the no in- tersection case and for the one intersection case. For example, voxel coordinates that are inside the bounds of the node with child index 7 are all on the negative side of each splitting plane and will therefore have all bits set. Note that this requires a voxel coordinate system that is similar to the one illustrated in Figure 2.3.

The intersection parameter between the line segment L and, for example, the plane Πx=d is given by equation [17]

d − nx · aˆ nx · (aˆ + tx(bˆ − aˆ)) = d ⇒ tx = nx · (bˆ − aˆ)

The above equation will be simplified since the plane is axis-aligned, nx = (1, 0, 0). The only change to the equation for each recursive step is the plane offset d. The points aˆ and bˆ will be reused; the expensive floating point divi- sion is therefore only evaluated once before the first recursive step. Note that we do not have to do a intersection test between the bounding box B and the line segment L. This follows from the first intersection test with the ray and the root node in world space. It is therefore enough to do only line intersection tests with the splitting planes.

Bitfield A and bitfield B are also indirectly used in the calculation of the line- plane intersection parameters tx, ty and tz. First, set each line-plane intersec- tion parameter to FLT MAX where FLT MAX is the system’s largest floating point value. Then, define bitfield C = X Y Z = A XOR B. It is clear that each bit in C will be set if and only if the line segment L intersects the corresponding plane. Bitfield C can then be used as a write mask for the line-plane intersection pa- rameters tx, ty and tz, the intersection parameters will be overwritten if and only if the corresponding bit in C is set. The planes with no intersection with L will keep FLT MAX as their intersection parameter value, those planes will be sorted last and their intersection point will not be used. This trick handles the collinear case without any special treatment.

As mentioned above, the child indices for the no intersection case and for the one intersection case can be handled with A and B alone. We also need to de- termine the traversal order for the two intersection case and for the three in- tersection case. Those indices can be determined by sorting the splitting plane intersections and do some bit operations on A and B:

10 min min min Define the bitfield P f irst = x y z where bit1 is set if min(tx, ty, tz) = tz, bit2 is set if min(tx, ty, tz) = ty and bit3 is set if min(tx, ty, tz) = tx. Analogously, the bitfield

maxx maxy maxz Plast = is defined where bit1 is set if max(tx, ty, tz) = tz, bit2 is set if max(tx, ty, tz) = ty and bit3 is set if max(tx, ty, tz) = tx. The second child index is then given by A XOR P f irst in the two and three intersection case and the third child index is given by B XOR Plast in the three intersection case.

Below is a illustration (Figure 2.5) of a case where all three splitting planes are intersected. The intersection order is stated as well as the bitfields used to de- termine the traversal order.

    

                                                      

Figure 2.5: Node traversal order. The table to the right shows the node traversal order and the plane intersection order for this case. It also shows the variables involved and their respective values.

11 2.2 Multi-Core Parallel Traversal

The algorithm is designed with parallelization in mind (see appendix A.1.3 for pseudocode). We assume an SIMD type of instruction set, with read mask/write mask bit operands and N-wide vector primitive types [16]. This means that some instructions can have a bitfield P = N ... 1 as an optional operand, where biti is set if the instruction should read/write to the scalar with index i in the vector. We also assume vector instructions for comparison operations, bit operations and math operations. The result of the comparison operation is

... stored in a bitfield = N 1 where biti is set if the result of the compari- son is true for the scalar with index i. The algorithm is a packet based traversal algorithm that can be split into multiple threads where each thread will handle N rays at a time.

... Define the bitfield R = N 1 where biti in R is set if the ray ri(ti) is valid for the given stack packet. The bitfield R will be used extensively as the read mask/write mask for the vector instructions. The N rays will all be set to valid before the first step. We then calculate the intersection points ai and bi between the rays ri and the root node N’s bounding box B. We clear biti in R if there is no intersection between the ray ri and B. We then transform the points ai and bi for each valid ray to the corresponding voxel coordinates points aˆi and bˆ i.

These transformed points will be used for computing intersection values ti for each line segment Li with the splitting planes; the N points aˆi and bˆ i will remain constant during each recursive step as in the non parallel case.

The initial values for the N points pˆ start[i] and pˆ end[i] are set to aˆi and bˆ i respec- tively. The Li parameter values tstart[i] = 0 and tend[i] = 1 are also initialized before the first recursive step. Again, this is no different from the non parallel case.

The main difference so far is that we need N values for aˆ, bˆ, pˆ start and pˆ end. We also need the bitfield R to keep track of which rays are valid for the given stack packet. The data for each stack packet in the parallel case are the node index, the node depth, the N intersection parameters tin[n], tout[n] and the bitfield R. We also store an additional ray mask S for each set of N rays. This ray mask will keep track of the rays who have not hit the desired depth d. This is necessary in order to handle the case where only some of the valid rays hit a node with depth d and the others should go on to the next node. The first step in each recursive step is then to AND the ray mask R with S , effectively ignoring those

12 rays that have already hit a node with depth d. The next step is to create the

N points pˆ in[i] and pˆ out[i] from the N intersection parameters tin[n] and tout[n]. After this calculation we need to calculate the splitting plane intersection order.

Define the bitfields:

X ... X P1 = in[n] in[1]

Y ... Y P2 = in[n] in[1]

Z ... Z P3 = in[n] in[1]

X ... X P4 = out[n] out[1]

Y ... Y P5 = out[n] out[1]

Z ... Z P6 = out[n] out[1] where for example the bit Xin[i] is set if the signed distance nx · pˆ in[i] + d from the point pˆ in[i] to the plane Πx=d is negative. Define the bitfields Ix = P1 XOR P4, Iy =

P2 XOR P5 and Iz = P3 XOR P6. The bit with index i in Ix, Iy and Iz will then be set if and only if there is an intersection between the line segment Li and the cor- responding splitting plane. Then for each splitting plane Πi: Set each scalar in the intersection parameter vector ti to FLT MAX. Use the bitfields Ii as the write mask and calculate the intersection parameter vector ti (this calculation can be done with a single MADD instruction).

Define the integer vectors:

t ... t tmax = max[n] max[1]

t ... t tmin = min[n] min[1] .

Set tmax[i] = max(tx[i], ty[i], tz[i]) and tmin[i] = min(tx[i], ty[i], tz[i]). Define the integer vec- tors P f irst[n] and Plast[n]. These integer vectors will be used to sort the splitting plane intersections. The vectors scalars in each vector will have the same func- tionality as the P f irst and Plast bitfields have in the non parallel case. Compare tmax equal to tx. Use the result from the compare as the writemask and set bit3 in each integer scalar in the Plast vector. Do the same for ty and tz, but set bit2 respective bit1 instead of bit3. Repeat the procedure: but this time compare the three parameters tx, ty, tz equal to tmin and write to the vector P f irst.

All these operations are vector versions of the corresponding single ray opera-

13 tions and we now have the data structures we need to determine the traversal order. Comparing rays for the same traversal order is trivial: compare the bit-

fields P1 ... P6 and the integer vectors P f irst and Plast for equality. This is done by logically AND the resulting comparison masks together starting with the valid ray mask R. The rays for the bits that are still set after the eight comparisons have the same traversal order.

To create a ray traversal packet we also need to find the ray that we will compare with all the other valid rays. The ray is found by bit scanning the valid vector

R. We stop when we encounter the first set biti in R. We then create the eight comparison vectors C1 ... C6, C f irst and Clast. We set all N scalars in each vector to the same value as the scalar with index i in the corresponding vectors P1 ... P6,

P f irst and Plast. We then compare the C vectors with the P vectors for equality. The first valid ray combined with the rays that pass the comparison tests will be the valid ray mask (this is the same as the resulting comparison bit vector) for the next stack packet. We then XOR the resulting comparison bit vector with the valid ray mask for the current stack packet. This clears the bits for those rays that were handled and we resume the bitscan for this new bit vector, making another stack packet. The algorithm is done when the stack is empty or when all bits in the bitfield S are set to .

14 2.3 GPU Traversal

In order for the algorithm to be really useful it needs to run on graphics hard- ware. However, so far we have only presented a multi-core parallel traversal algorithm for the CPU where we assumed an instruction set that would fit a multi-core SIMD architecture. This algorithm was originally designed for a fully programmable graphics processor. The graphics processors currently availiable on the market does not fit our target processor and our target instruction set as well as we intended. Despite this mismatch in the target instruction set we got better than expected performance from our algorithm on the GPU. Our GPU im- plementation of the algorithm is more or less the same as the single core traver- sal but it is running in parallel with a 64 ray kernel. Even though the perfor- mance is decent, it would of course be better if there was graphics hardware on the market that would work as we predicted when we started the development.

2.3.1 GPU Traversal: Implementation of a Benchmark Algo- rithm

For benchmark purposes we also implemented a traversal algorithm very sim- ilar to the one presented in [13]. This algorithm is developed and optimized for the same type of graphics processor that we used in the benchmark pro- cess. Both traversal solutions use the same content management and the same virtualizaton system; it is only in the traversal step that the solutions differs.

15 3 SPARSE VOXEL OCTREES AS STATIC GEOMETRY

The voxel octree geometry format that we present in this section is well suited for static geometry but can not be used for dynamic geometry. Static geometry is geometry that is not translated, rotated or deformed. For dynamic geometry we have to resort to other methods such as rasterization of indexed triangle meshes. The voxel octree geometry format is developed to be used with hybrid rendering (see section 1.1.3), so this is not as limiting as it may sound. Static geometry is a significant part of many scenes so the advantages of the voxel octree approach can still be utilized in those scenes.

3.1 Sparse Voxel Octree Geometry Format

Surface geometry and volume geometry can be represented in an octree geom- etry format. In this format, the whole space is filled up by the root node. Each node subdivides the space it represents into eight subspaces and represents an approximation of the geometry that is inside the node. The octree nodes are commonly called voxels. In each subdivision step, the approximation error de- creases and we consequently get a better representation of the actual geometry (see Figure 3.1).

Figure 3.1: Geometry subdivision. Smaller voxels implies a better representa- tion of the original geometry.

16 The disk storage data for each voxel in the voxel octree have the following struc- ture:

DiskS torage

color (int)

normal (3 x float)

childmask (byte)

It is easy to extend this data structure for more advanced BRDF models. For instance, we can add an ambient factor and a specularity factor to implement the phong shading model [18]. The data stored at each node will be the average of the data from the node’s children. It is only the leaf nodes that contains data from the real geometry, the other nodes will be automatically created from the nodes with one higher depth in the tree. The data can be heavily compressed because of the small difference between the parent and the child nodes. Com- pression techniques and how they might be used on the voxel octree data are discussed briefly in section 6.2.2.

The run-time traversal data for each node has the following structure:

RunTime

planescenter (3 x float)

childindices (8 x int)

The run-time traversal data is used in the traversal of the octree. The result of the traversal is a buffer with node indices. Those indices are later used as indices into the storage data and can be mixed with data from other rendering passes (for instance the rasterization pass) in the final rendering pass. The run-time traversal data is created on the fly from the disk storage data. The float array that stores the plane center offsets is not really necessary; the regularity of the octree makes it possible to calculate them in the traversal process. However, this would complicate the traversal algorithm so we choose to precalculate them.

17 3.1.1 Comparison with a Triangle Mesh

In the indexed triangle mesh format (see Figure 3.2), a position vector and two texture coordinates are stored for each vertex point. Each triangle also stores three integers as vertex indices. This data is static regardless of the size of the triangle. At a certain (small) triangle size, this format will become less memory efficient than a triangle in the voxel octree format (see Figure 3.2). For larger triangles, the indexed triangle format is more memory efficient.

Each triangle: 3x3 oat (pos), 3 int (indices), 3x2 oat (uv) Each octree node: 1 byte

Figure 3.2: Triangle memory compared to voxel octree memory.

The time complexity for rendering triangles with rasterization is linear. This means that we will pay roughly a linear time penalty for increased geometric detail. The same is not true with a voxel format; the recursive nature of the traversal implies that the time complexity in this case is logarithmic.

Another major difference between the voxel octree geometry format and in- dexed triangle meshes is that both the geometry and the material parameters are stored in the octree nodes whereas for triangle meshes the material parame- ters and the geometry normals are normally stored in separate textures. One of the implications of this is that texture seams and border problems disappear. It is also easier to dynamically stream the data because it comes from one stream instead of multiple streams. This storage model is also promising from a con- tent creation perspective because of its uniform nature. The idea is that content creation and tools development will be easier if only one major data structure is edited at a time.

18 3.1.2 Voxelization of a Triangle Mesh

The purposes of the voxelization process are to incrementaly a voxel oc- tree hierarchy and then write that hierarchy to disk storage. The disk storage data will be different than the build hierarchy data because of performance and memory requirements. The disk storage layout and the run-time loading of the voxel octree are explained in section 3.2.

The discrete voxel space used in the voxelization process is a cube with a fixed number of voxels per dimension (see Figure 3.3).

gridsize 8

Figure 3.3: Voxelization cube.

The number or voxels per dimension (the variable gridsize) is one of the input parameters to the voxelization process. This variable is a power of two inte- ger. Higher voxel density results in a smaller surface approximation error for each triangle and higher memory requirements. The algorithm iterates through all the triangles of the mesh and transforms each triangle’s vertices to discrete voxel space coordinates with the following matrix transformation:

    2/E 0 0 −c  gridsize/2 0 0 gridsize/2  x        0 2/E 0 −cy  0 gridsize/2 0 gridsize/2 M =      −     0 0 2/E cz  0 0 gridsize/2 gridsize/2      0 0 0 1   0 0 0 1 

The scalars in the M matrix are the same as in the transformation matrix for the ray intersection points (see section 2.1).

19 Each triangle is subdivided along the shortest geometry median if the vertices are not neighbours in discrete voxel space (see Figure 3.4). The triangle subdivi- sion ensures that the voxelization process does not create any holes in the voxel octree. The subdivided triangles are then recursively processed. Three new oc- tree nodes - one for each vertex - are created if the voxel space coordinates are neighbors and no further subdivision should occur. The three octree nodes get their color from the diffuse texture map assigned to the triangle mesh. The nor- mal for each node can be fetched from the normal map or calculated from the triangle geometry. The texture coordinate lookup is done with the vertex UV- coordinate, so in most cases the data will be sampled off center from the voxel (see Figure 3.5). This is a sampling error that we choose not to address since the emphasis of this thesis is not on the rendered image quality. Finally, the three octree nodes are added to the voxel octree. We also need to update the parents color and normal for each new node. This is done by taking the new nodes color and normal into account when calculating the average color and normal for the parent. The update of the parent node is a recursive process that continues until we have updated the root node.

Triangle subdivision O center sample point

Figure 3.4: The triangle is split Figure 3.5: Texture data for each along the shortest geometry me- voxel is fetched from the vertex UV- dian. coordinate.

20 There are other - and in a lot of cases superior - methods available for vox- elization. The most promising methods are based on volume rasterization [19]. Those methods make it possible to sample nearby voxels and thereby use fil- tering algorithms to improve the image quality. They are probably also a better fit for many-core parallelization. The voxelization algorithm is currently using the triangle subdivision technique because of its simplicity. Our intention is to experiment with other methods in the near future.

21 3.2 Virtualization of the Static Geometry

The uniform nature of the voxel octree geometry format makes it theoretically possible to represent all static geometry in the scene as one large octree. This will require a large amount of internal memory, graphics card memory and storage memory such as hard discs and optical media. It will also require a virtualiza- tion system; it is not possible to have the whole tree loaded into main memory and graphics card memory at once. The basic idea is to give each octree node in the world a unique index and then let the virtualization system map those in- dices to actual memory addresses. The CPU will load subsections of the storage data structure from storage memory, create the traversal data structure and fi- nally update the virtual system addresses. The CPU will also update the virtual system and unload sections of the tree. Loading and unloading of the octree sections is a function that depends on the camera location and the rendering resolution.

3.2.1 Virtualization System Overview

The virtualization system is based on virtual pages. This is similar to how a standard virtual memory system works on a modern computer. The node count for each page is fixed and is one of the input parameters to the voxelization process. Every page except the last page holds the same number of nodes (see Figure 3.6). This requirement minimizes the run-time checks when mapping the node index to the memory address.

Page 3

Page 0 Page 1 Page 2

Figure 3.6: Virtualization pages. Each page except the last one holds the same number of nodes.

22 The virtual page table is basically an array of integers where each item is the page offset from the start of the memory. Finding the page index for a node is done by a lookup in the virtual page table. The node offset in each page can be found by a simple mod operation. The whole operation requires simple arith- metic and is very fast:

pageindex = pagetable[nodeindex/pagesize]

nodepointer = basepointer + (pageindex ∗ pagesize) + (nodeindex mod pagesize) nodeindex is the unique global index of the node, pagesize is the number of nodes per page, basepointer is the start of the virtual node memory segment and nodepointer is the resolved pointer to the address where the node with nodeindex is found.

The graphics card and the main memory will have a different working set of the voxel octree. This is because of the bigger memory capacity of the main mem- ory. This implies that they will use different virtual tables.

As discussed in section 3.1, the voxel octree data is split into the storage data and the traversal data. The disk storage data and the traversal data will use the same virtual page table, but their basepointer will be different since the data will be loaded at different addresses.

23 3.2.2 Virtualized Sparse Voxel Octree

The structure of the virtualized voxel octree is designed to minimize cache misses, to enable sequential reads from disk storage and to enable incremen- tal writes of the octree to disk storage without having the whole octree loaded into memory. Incremental writes are important because of the massive amount of source data that will be needed. The above requirements prevent us from using a trivial breadth-first or depth-first layout. A breadth-first layout would space the children too far apart from their parents in memory. A depth-first layout is suffering from a similar problem, but in this case it is the siblings that are too far apart. Both layouts will cause too many cache misses in the traversal process. The final virtualized voxel octree layout is therefore partly breadth-first and partly depth-first. How this is achieved is explained below:

The octree node hierarchy we use in the virtualization process is created in the voxelization process (see section 3.1.2). After we have built the node hierarchy, we start the process that will write the virtualized octree structures to disk stor- age. Only the storage data for each node will be written to disk. The storage data for the virtualized octree will be stored as one big array referred to as the nodearray.

The virtualization process is based on the queue data structure and incremental in nature. The octree nodes are added to a queue in breadth-first order until the pages maximum capacity is reached. The nodes will then be written to disk stor- age starting with the first node in the queue. Each queue will also have a special subspace node associated to it. The subspace node for each queue is a represen- tation of the octree subspace where all nodes in that queue are located. It is used when a pages maximum capacity is reached. The overflowing nodes will then be sorted into one to eight new child queues depending on which subspace the overflowing node is located in compared to the subspace node (see Figure 3.7). The child queues are then pushed on a stack starting with the subspace with the highest number. The next queue from the top of the stack is popped when a page is full or when a queue is empty. We continue and process the nodes in that queue. The queues will therefore be processed in a depth-first order while the nodes in each queue will be processed in breadth-first order (see Figure 3.7). This has the implication that the storage order is sorted by octant subspace. The nodes that are in the first octant would in general be stored before the nodes that are in the second octant. The only exception is the breadth-first processing

24 before the page overflow. The first queue will start with the root node as the first item. The root node will also be the subspace node for the first queue.

Figure 3.7: Octree nodes storage order. A queue is flushed and written to a page when full. Nodes in each page are therefore stored in breadth-first order. Over- flowing nodes that did not fit into the page are sorted into eight new queues depending on octant location. The queues are then processed in depth-first or- der.

25 Each queue will write a structure to disk called a S ubspaceLink:

S ubspaceLink

subspacedepth (int)

subspaceplanes (3 x float)

nodecount (int)

startnode (int)

startnodeparent (int)

startnodeparentchildindex (int)

startnodedepth (int)

parent (int)

children (8xint)

The S ubspaceLink is a representation of an octree subspace. The S ubspaceLink structure is created when the queue is popped from the stack. The S ubspaceLinks for the whole tree are therefore stored in an array in depth- first order. The variables subspacedepth and subspaceplanes in the S ubspaceLink structure are the data for the queue subspace node. The subspace node - as pre- viously discussed - represents a subspace. The variable nodecount is the number of nodes written from the queue. The value of this variable will be at most the page capacity. In some cases it will be a lower value. This will happen if the number of nodes that remains to process and is contained in this queue’s sub- space will be less than the maximum page capacity, or if the page is partly filled when we start to process this queue. The four variables following nodecount are all related to the first node in the queue. The variable startnode is the in- dex of the first node, startnodeparent is the index of the parent of the startnode, startnodeparentchildindex is which childindex (0-7) that startnode is relative to startnodeparent and the variable startnodedepth is the depth of the startnode. The depth and the childindex are important when we create planes for the traver- sal data structure from the storage data structure (see section 3.1). The two last variables, parent and the array children, are used for the S ubspaceLink hierar- chy. The children array holds the indices in the S ubspaceLink array where the S ubspaceLink for the queues that is created when the page overflows is stored (see Figure 3.8). The parent is the index of the parent S ubspaceLink (see Figure 3.8).

26 CHILD PARENT

Figure 3.8: S ubspaceLink hierarchy.

If a queue is empty before the page is full, then the next queue from the stack will be added to the same page. The nodes in the pages therefore have no strict order between themselves. Nodes with big differences in subspace location and depth can be added to the same page. This is a side effect of our requirement that each page should hold the maximum number of nodes. There are no addi- tional data for the pages, each page is just a fixed chunk of the nodearray. The benefit of this is that page index lookup for a node can be done by a simple mod division operation (see section 3.2). The only disk storage data for the virtual- ized octree is the S ubspaceLink array and the nodearray.

3.2.3 Run-Time Loading

The run-time loading of the voxel octree is a multistage process. The first stage is to load in the disk storage node data from disk, the next stage is to create the traversal node data from the disk storage node data (see section 3.1) and the final stage is to upload the data to video card memory and update the virtual tables. The loading process needs to run in a separate execution thread so the rendering is only stalled when all the data is available and ready. The rendering process will use the new data as soon as the loading thread have uploaded the data to video card memory.

Each run-time version of S ubspaceLink will be extended with a boolean loaded in addition to the S ubspaceLink structure on disk storage (see section 3.2.2). Each page will be referenced counted since several S ubspaceLinks can be stored in one page. The reference count for a page will be increased when a S ubspaceLink that is stored in the page is loaded from disk and decreased when a S ubspaceLink is

27 unloaded. We can safely unload the page when the reference count reaches zero. This means that all S ubspaceLinks stored in that page will have their loaded vari- able set to false.

The S ubspaceLink data structure and the nodearray in combination with the vir- tualized page system makes it easy to dynamically load and unload the tree depending on the world camera location. The S ubspaceLink sections that is lo- cated outside the camera frustum do not need to be loaded at all. The desired depth of the nodes for the sections that are inside the frustum is dependent on how far the surface is from the camera and the desired rendering resolution. The target depth is set so that we will have roughly one octree node for each pixel (see Figure 3.9). How this is done is explained more in the rendering sec- tion (see section 4).

Figure 3.9: Target depth: one octree node for each pixel.

The octree load algorithm will load all S ubspaceLinks that are non-loaded and are inside a given bounding box and have a startdepth less than or equal to a desired tree depth. Therefore, the input parameters to the load algorithm are an axis-aligned bounding box and a desired integer tree depth. First, the algorithm needs to find the pages that it should load from disk to memory. This is done by iterating through all the S ubspaceLinks in depth-first order that is inside the bounding box and lookup the page for each non-loaded S ubspaceLink. Refer- ences to the non-loaded S ubspaceLinks are cached in an array for later use. For each S ubspaceLink, the reference count for the page is increased. If the page ref- erence count is zero before the incrimination the page needs to be loaded from disk. The next step is then to read in the pages from disk and store them con- tinuously in a memory buffer (see Figure 3.10). The loading of the pages from

28 disk will be done in a sequential order since the pages are in the same order as the iterated S ubspaceLinks. We also need to create a buffer (see Figure 3.10) for the traversal data pages. The traversal data pages will later in the process be created from the disk pages.

Disk storage Disk data Traversal data memory buffer memory buffer

Figure 3.10: Data streaming order. The traversal data is created from the disk data.

We also need to find the traversal pages that are in memory already but will be written to by the load algorithm. The updating of already loaded traversal data pages will happen in two cases: the first case is if a node already in mem- ory has a child node in a S ubspaceLink that will change status from loaded to non-loaded. The parent node in this case needs to write to one of its items in its children array (see Figure 3.11). The second case is if multiple S ubspaceLinks have written their nodes to the page and at least one of them is loaded and at least one of the others will change status from non-loaded to loaded (see Figure 3.11). In this case the traversal page will change since one S ubspaceLink will change from non-loaded to loaded and will write data into memory. The pages that will be updated are copied from memory to the same memory buffer we created for the traversal data pages after we manage to read storage data pages from disk. This can be done without stalling the rendering thread since this op- eration is read-only. Temporary virtual page tables are created for the pages in the storage node buffer and the traversal node buffer since we need to be able to read data from the pages.

COPY

NON LOADED LOADED

LOADED

LOADED LOADED

Figure 3.11: S ubspaceLink copy.

29 We are ready to create the traversal data structures (see Section 3.1) when the storage data are loaded from disk and the temporary virtual tables are set up. This is done by iterating through the non-loaded S ubspaceLinks. Work is saved in the iteration by reusing the non-loaded S ubspaceLinks array that we created before we read pages from disk. Each non-loaded S ubspaceLink will create parent-child links and center planes from the parent nodes in loaded S ubspaceLinks to its child nodes (see Figure 3.12). Fortunately, the traversal data for the child nodes are easy to create with the startnodeparent and the startnodeparentchildindex variables from the S ubspaceLinks.

CHILD CHILD PLANECENTER PLANECENTER DEPTH DEPTH

Figure 3.12: Creation of traversal data.

Starting with startnodeparent and startnodeparentchildindex, we iterate in se- quential order over the parent nodes in the loaded S ubspaceLink. A data struc- ture named ParentNodeLink is added to a queue for the parent nodes children that have their subspace in the non-loaded S ubspaceLink.

ParentNodeLink

depth (int)

planescenter (3 x float)

parentindex (int)

parentchildindex (int)

The planescenter and depth are created from the parent nodes planescenter and depth. Both of the variables can be generated from the parent node because all our nodes have their planes in the center of the node. The hierarchy variables

30 parentindex and parentchildindex are also easy to generate. This parent node iter- ation will continue until the size of the queue have reached nodecount. Then we start to process the ParentNodeLinks in the queue by linking them up with the nodes in the non-loaded S ubspaceLink and write the planescenter and depth to the corresponding traversal structure. Note that the linkup starts with the node with index startnode in the S ubspaceLink.

As previously mentioned, both the traversal data pages and the storage data pages need to be updated before we can upload the data to video card memory. We also need to update the virtual page tables entries for the updated pages. In order to maximize performance we will only synchronize the rendering thread with the new data when it is uploaded to the video card memory. After syn- chronization the rendering thread will use the new data.

31 4 SPARSE VOXEL OCTREE RENDERING

4.1 Integration With a Traditional Pipeline

As discussed in section 1.1.3, our intentions are to describe how to use the static octree geometry in the rendering process and how to use it with other - more conventional - rendering methods. We call this rendering pipeline hybrid ren- dering. Note that we have not done a complete implementation of this pipeline, but rather present some ideas how it can be done. The demo program only im- plements the sparse voxel octree ray casting. This technique might have a use in the not-so-distant future because of its rendering speed for very detailed mod- els. For those models, the logarithmic time complexity for the octree traversal process is significantly faster than the linear time complexity for the indexed tri- angle mesh format. However, there are some disadvantages with this method. The two main concerns for us are higher memory requirements and the final rendered image quality. Both of these problems as well as some ideas about the future work are discussed in section 6.1 and section 6.2.

4.2 Sparse Voxel Octree Ray Casting

Static geometry will - as previously discussed - be rendered in a separate pass by ray casting into the voxel octree. For each pixel, a ray is shot from the camera into the octree (see Figure 4.1).

Figure 4.1: Ray casting.

32 Each ray is recursively traversed until we have reached a node with the desired depth. The index and the camera space Z-buffer depth for the node are then written to two separate render target buffers. Both buffers are needed when we integrate our static geometry ray casting method with the other rendering methods.

Hidden surface removal can be done by comparing the Z-buffer depth with the Z-buffer depth filled in by the other rendering passes. The Z-buffer depth results for the traveral is also used to determine when to load and unload sub- space sections of the octree (see section 3.2.3). In its current form, the load algo- rithm first checks the node size in world space units for the node that is closest to the camera and then calculates the desired node depth level. The node depth level should be set so the node density in the target buffer is around one node per pixel. After the desired depth level is calculated we call the function de- scribed in section 3.2.3 to load in the missing nodes.

In the final shading pass, all light and material properties needed for a pixel can be accessed by the node data storage (see section 3.1) for the corresponding node index stored in the render target buffer. This is similar to how a deferred renderer (see section 1.1.1) does its shading. This implies that it would require less work to combine the ray cast pass with the rasterization pass if one would use a deferred renderer for rasterization. In order to get the shading parameters for the data storage a virtual page table lookup (see section 3.2) is needed for each node. In fact, this operation might be done multiple times per pixel if the shading model requires several shading passes. A virtual page table lookup is also done several times per ray in the traversal process. Hence, this operation needs to be very fast.

There are still problems left to solve for the ray casting method in order to make it support all the features one would expect of a modern static geometry ren- derer. Transparency and alpha blending for example are not supported in our current implementation. Another issue is that the current generation of graph- ics processors is primarily built for rasterization; the vertex shader and pixel shader pipelines in those processors are hard to integrate with ray casting. For- tunately, the development is progressing towards more programming flexibility for the graphics processor [20].

33 5 IMPLEMENTATION AND RESULTS

5.1 Voxelization and Virtualization

The voxelization of the triangle mesh, the virtualization of the voxel octree and the writing of the voxel octree to disk storage are done sequentially. This func- tionality is implemented in the modeling program Autodesk Maya, which has support for adding custom commands via plugins. Our plugin registers a new command voxelize that can be used within Maya. The parameters to the com- mand voxelize are the f ilename, the voxelization gridsize and the node count per virtual page (pagesize). When the command is invoked, all triangle meshes in the scene that are inside a predefined cube are processed and their generated voxel octree is written to disk (see Figure 5.1 and Figure 5.2). This cube will also serve as the root nodes subspace. Geometry normals for all nodes are cal- culated from the triangle geometry and the node colors are fetched from the texture map. Each mesh will be completely processed and written to disk be- fore processing of the next one is started. This makes it feasible to generate a much larger octree since we only need to have the nodes for one mesh in mem- ory at a time.

Figure 5.1: Voxelization setup. Figure 5.2: Voxelization setup, wire- frame view.

34 5.2 Run-Time Environment

We started early on to do a demo application of the sparse voxel octree ren- dering to test out the algorithms and the theories presented earlier. Figure 5.3 shows the user interface of the demo application.

Figure 5.3: Demo application.

Our implementation is done in C++ in a Windows environment, with Direct3D, Microsoft Foundation Classes (MFC) and the Microsoft Windows API as exter- nal libraries. Because of the amount of memory used the application requires a CPU that supports the x86-64 instruction set. Traversal of the voxel octree was initially done on the CPU in order to simplify debugging. We also did a com- plete simulation of the SIMD multi-core CPU algorithm to verify its correctness. GPU traversal was implemented shortly after we got the CPU traversal running. For GPU traversal we used the OpenCL language and a NVIDIA card based on the FERMI architecture [20]. Desired tree traversal depth can be adjusted by keyboard buttons. Figure 5.4 and Figure 5.5 show the rendering for traversal depth 6 and 8 respectively.

35 Figure 5.4: Traversal depth 6. Figure 5.5: Traversal depth 8.

Furthermore, the application has support for rendering out the Z-buffer values. Figure 5.6 and Figure 5.7 show this functionality.

Figure 5.6: Depth buffer near. Figure 5.7: Depth buffer far.

36 The dynamic streaming of the virtual pages and the run-time building of the traversal data are relatively slow tasks. Doing all this work in a separate thread is necessary to prevent the rendering from stalling. Unfortunately, this does not solve the problem with the noticeable rendering result changes that will follow when the streaming is finished. This is similar to the LOD popping that occurs when using discrete LOD triangle meshes. Another problem with voxel ray casting is the cubic look (see Figure 5.8) that is a consequence of the shape of the voxels. Some possible approaches to minimize both problems are discussed in section 6.1.

Figure 5.8: Voxel image quality. The individual cubes can be seen inside the red square.

37 5.3 Results and Discussion

As a performance benchmark we compared the frame rate of our traversal al- gorithm with a similar algorithm to the one (see section 2.3) explained in [13]. We used two different models for the benchmark: RHINO (see Figure 5.9) and HAIRBALL1 (see Figure 5.10). The HAIRBALL model was chosen because of its shape; the volumetric geometry is quite different from the surface geometry of the RHINO model. One problem with the HAIRBALL model was the lack of ver- tex texture coordinates; we solved this problem by using a modified version of projective and then applying a color ramp texture.

Figure 5.9: RHINO model. Figure 5.10: HAIRBALL model with color ramp texture

1Model courtesy of Samuli Laine, http://www.tml.tkk.fi/ samuli/publications/hairball.zip

38 Another crucial factor to evaluate is the memory requirements for the octree data structures. We calculated the disk memory as well as the run-time mem- ory for both models. Naturally, the HAIRBALL model has much higher memory requirements because of its volumetric geometry. In order for both models to fit into our graphics card memory we used a maximum octree node depth of 9.

Here is a table that shows our measured frame rate as well as the disk and run- time memory size:

Table 5.1: BENCHMARKTABLEFOROUROCTREEMODELS OCTREEMODEL RHINO HAIRBALL disk memory size 1.27 MB 28.71 MB

run-time memory size 17.56 MB 396.31 MB

frames/sec, our algorithm, 1024x768 19.6 FPS 15.3 FPS

frames/sec, benchmark algorithm, 1024x768 33.4 FPS 21.0 FPS

frames/sec, our algorithm, 1280x720 17.9 FPS 10.8 FPS

frames/sec, benchmark algorithm, 1280x720 27.3 FPS 16.4 FPS

As expected, our traversal algorithm is noticeable slower on our test system. It is our belief that the difference in performance is partly explained by the fact that the benchmark algorithm is developed for the same type of graphics processor as the one we used in the benchmark process. Our algorithm - on the other hand - is designed for a fully programmable graphics processor with a different instruction set. We do not think this is the only reason though; the benchmark algorithm is very effective and will probably perform better than our packet traversal algorithm for this type of application in any case. It would be interest- ing to implement and test the octree traversal algorithm on a graphics processor with a similar instruction set as the one we assumed when we started the devel- opment. Unfortunately, no such graphics processor was available when it was time to implement the algorithm so we could not do a full performance evalua- tion.

39 Another concern is the memory requirements; one fundamental difference be- tween our approach and the one explained in [13] is how the octree is stored on disk. Laine and Karras store the tree hierarchy via offset pointers whereas we store a bytemask. We therefore have to recreate the parent-child integer indices on-the-fly after reading them from disk. The integer indices make the run-time structures rather big, this is bad because graphics card memory is gen- erally smaller than primary memory. In addition, they lower the data locality in the traversal process and hence lower the performance. The on-the-fly cre- ation of the indices also takes too long for a real-time application. However, offset pointers have a bigger memory footprint than a bytemask. In order to save disk memory a bytemask can therefore be used in applications where the tree is loaded in once and not streamed.

The octree traversal algorithm could also be used in other applications in addi- tion to sparse voxel octree ray casting. In fact, the traversal algorithm is proba- bly as least as good a fit for other applications since the parallelization is utilized the most in the first tree levels. Sparse voxel octree ray casting traverses very deep so in the last levels the parallelization naturally has a poor utilization; this is not necessarily true for other applications.

40 6 FUTURE WORK

6.1 Image Quality Improvements

Image quality for the final rendered image is not good enough in our current implementation; it is significantly worse than the output quality of traditional real-time rendering. However, there are some approaches we can try and hope- fully get it up to an acceptable level. We have not done any concrete work on improvements but we have some ideas on possible approaches. Below is a brief explanation of three techniques that could improve the image quality signifi- cantly if done right.

6.1.1 Voxelization Filtering

The basic idea of voxelization filterering is to sample neighbor voxels and do signal processing/filtering on the samples at the time of voxelization. Ideally, we want to smooth out the BRDF data and the normals between neighbor nodes to reduce the sampling artifacts. A number of different geometry conditions need to be taken into account since the problem is three dimensional volume sampling rather than the more common two dimensional image sampling. For- tunately, we can allow this process to be extensive and time consuming since it is done offline as part of the voxelization process.

6.1.2 Spherical Voxel Traversal

Spherical voxel traversal is the idea of replacing the cubes with spheres in the final intersection test when we have reached the target node depth of the octree traversal. Using spheres instead of cubes will reduce the blocky look that is inti- mately associated with voxels. However, this might have other drawbacks and the image quality could very well end up worse than with cubes. The shift from a cube to a sphere can be done entirely as a calculation on the graphics proces- sor so we do not need to change the tree data structure in order for this to work. Each sphere will have its radius length at least the same length as the edge of the voxel. In some cases it might be better to increase the size of the radius for the sphere so flat surfaces appear more even. In fact, it should be possible to have

41 a variable sphere radius and store the radius as part of the node disk storage structure (see section 3.1). If the stored radius is over a certain threshold we can actually treat the sphere as a cube, making the sphere intersection test optional. The radius to use for each sphere can be calculated offline in the voxelization process.

This idea resembles in some ways the concept of splatting [21]. There are some differences though, such as splatting’s use of bounding spheres instead of octree nodes.

The technical report [13] introduced a new concept called contours. Contours is essentially parallel planes stored in a compressed format for each voxel. The contours concept is similar to the sphere concept but has the big advantage that it is a much better fit to the original geometry.

6.1.3 Screen Space Post-Processing

Anti-aliasing techniques are frequently used in computer graphics to reduce sampling artifacts. Different types of signal filters are applied to the sampled signal to remove high frequency components and to improve the reconstruc- tion of the original signal. Similar filters can be applied to our render target buffer when we calculate the final shading. Experimentation in this area will be needed in order to find the optimal algorithms given computing time and mem- ory restrictions. Our estimate is that the algorithms and the filters will be similar to the ones used in other areas of computer graphics, such as texture sampling filters and blur filters. Given that the arithmetic instruction throughput of the graphics processors will increase with time, more and more computing time can be dedicated to this type of filtering. This performance increase in combination with better algorithms will hopefully lead to vast improvements of the image quality.

6.2 Performance Improvements

Current performance of the latest graphics processors are theoretically high enough to allow a 1280x720 render target buffer to be rendered at 60 frames per second where each ray will recurse down to a target node depth of 16. If we

42 assume a 2x or 4x increase in arithmetic instruction output for future graphics processors, then it is our estimate that a future hybrid rendering pipeline will be able to render in high resolution at a speed of 60 frames per second. Further- more, the logarithmic time complexity for the recursive traversal means that we will be able to get an increase in geometric detail with a relatively low perfor- mance penalty.

Unfortunately, the performance bottleneck lies elsewhere. The biggest hurdles to solve in order to make ray casting a viable alternative to rasterization for static geometry is the steep memory requirements and the fact that in our current implementation the data streaming and the on-the-fly creation of the traversal structures takes a relatively long time. Both of these problems and some possi- ble improvements will be discussed in this section.

6.2.1 Faster Dynamic Streaming

The streaming of the disk storage data and the subsequent building of the traversal data do not perform as well as we expected them to. These opera- tions need to be completed in a relatively short time frame to avoid sudden large differences in target node depth for the octree traversal. Note that the disk storage data streaming and the traversal data building are triggered by min- imum Z-depth changes; we use the minimum Z-depth to set the target node depth so the pixel closest to the camera will have its world space unit size about the same size as one world space octree node unit. In order to meet this require- ment we have to dynamically stream and build the octree nodes. So, if the target node depth is changing faster than we can process we will get a large jump in the traversal target depth when the streaming is finished. This will cause vi- sual popping. Ideally, what we want is small changes so we can use different filtering techniques to minimize the popping problem. While we believe the al- gorithms we use can be improved to increase the performance, a significant part of the time is spent on streaming the pages from disk storage. Compression of the disk storage data and future storage hardware such as solid-state disk drives will hopefully improve this.

Our current algorithm for building the traversal data (see section 3.2.3) is quite simple, it does its job while requiring a small amount of data (the S ubspaceLink structure). However, its performance is not good enough and it contributes to

43 the systems low dynamic streaming performance.

Laine and Karras describe in [13] a data solution that has the same hierarchy on both the disk and in memory and hence avoids the expensive generation of the memory data structures. Another advantage of their method is that they do not need to store the integer indices for the generated hierarchy. This solution saves graphics card memory and increases the data locality.

6.2.2 Data Compression

Our small test scenes currently require too much memory (see Table 5.1). While they are in range for most high-end systems, the requirements are very big for such small scenes and consequently it will be hard to scale to larger scenes. Data compression is one way we can get the memory requirements down. For the geometry normals we can use fixed point arithmetic and store 16-bit integer variables instead of 32-bit floating point variables. Another interesting normal compression scheme is introduced in [13]. For the BRDF data such as diffuse color and specularity we can use delta encoding and then use entropy encoding (for instance Huffman Encoding) on the delta values. Using delta encoding for the voxel octree [11] will result in a tree structure where the root node is the first node (see Figure 6.1). Because of the regularity of the data and the in general small differences between the nodes the data can probably be compressed to a high compression ratio.

RGB ( 255,255,0)

2 16

RGB RGB ( 255,255,2) ( 255,255,16)

Figure 6.1: Delta compression.

44 Unfortunately, even a 10:1 compression ratio will not be enough for voxel ray casting to become a viable alternative to current real-time rendering techniques for anything but small scenes. Large scenes will require several orders of mag- nitude more data than our current scene, and we think it is unlikely that the disk storage capacity, the main memory and the video card memory will develop as fast as needed for the foreseeable future.

45 7 CONCLUSION

Virtualized sparse voxel octree ray casting is an interesting technique to achieve detailed, unique geometry. It is clear that the technique has some merits to it, but it is also clear that it has a long way to go before it can be considered a vi- able option for static geometry rendering. Memory capacity - both primary and secondary - has to increase by several orders of magnitude and graphics proces- sor performance has to increase as well. While the ray casting performance is close to achieve real-time performance, it does so by not using anti-aliasing tech- niques or other types of filtering. The inherent blockiness of the octree nodes is also a problem that needs to be addressed. Laine and Karras reduce this prob- lem by their use of contours, but it is still not as smooth as detailed triangles or surface patches. In order for sparse voxel octree rendering to be considered over rasterization rendering it has to achieve as least the same level of image quality which is currently not the case.

Perhaps the biggest hurdle to overcome would be the distribution of content that use sparse voxel octrees. Blu-Ray is currently the optical format with the highest storage capacity. The format stores 50 GB of data on one disc which will probably not be enough for a game that uses sparse voxel octrees. Internet dis- tribution might be an alternative but in order for that to happen the bandwidth has to be increased to the end consumers.

Despite this negative prediction, we believe voxel ray casting can be a valid technique for some applications. The biggest advantage of voxel ray casting is the real-time rendering speed for high detail geometry, this property follows from the logarithmic time complexity of the octree traversal. It is hard to predict what use this will have in the near future, but we think a useful application for voxel ray casting is likely to emerge in a five years time span.

There are numerous ways one can improve our current implementation. One important improvement is to remove the expensive creation of the large run- time structures from the disk structures. Another possible improvement is to introduce a compression scheme. Both of these changes will make the memory footprint smaller and will decrease the number of cache misses in the traversal process.

To summarize, we achieved our goal of a working solution that gave us insight into the challenges and possibilities of the technique.

46 A APPENDIX

A.1 Pseudocode

Below is pseudocode for the single ray octree traversal algorithm (see section 2.1).

Algorithm A.1.1 :Single Ray Octree Traversal, Part 1 1: TRACERAY(Ray = (O, D), stopdepth) 2: node ← root B the root node 3: if ¬RAY-AABB-INTERSECT(Ray, node, t0, t1) then 4: return NIL 5: else 6: a ← O + t0D 7: b ← O + t1D B intersection points 8: aˆ ← Ma 9: bˆ ← Mb B transform to voxelspace 10: STACK-PUSH(stack, node, depth = 0, tstart = 0, tend = 1) 11: while ¬ STACK-EMPTY(stack) do 12: (node, depth, tin, tout) =STACK-POP(stack) 13: if depth ≥ stopdepth then 14: return node 15: else 16: pˆ in = (1 − tin)aˆ + toutbˆ 17: pˆ out = (1 − tout)aˆ + toutbˆ 18: A =SIGN-MASK(n · pˆ in + d) 19: B =SIGN-MASK(n · pˆ out + d) 20: C = A XOR B 21: tx = ty = tz = FLT MAX 22: isc ← 0 B plane intersection counter 23: if INTERSECTION-BIT-SET(Πx=d, C) then 24: tx = LINE-PLANE-INTERSECT(Πx=d, pˆ in, pˆ out) 25: isc ← isc + 1 26: if INTERSECTION-BIT-SET(Πy=d, C) then 27: ty = LINE-PLANE-INTERSECT(Πy=d, pˆ in, pˆ out) 28: isc ← isc + 1 29: if INTERSECTION-BIT-SET(Πz=d, C) then 30: tz = LINE-PLANE-INTERSECT(Πz=d, pˆ in, pˆ out) 31: isc ← isc + 1 WRITE-SORT-BITS(Pmin, Plast) 32: PUSH-CHILD-NODES(node, A, B, Pmin, Plast, depth, isc) 33: return NIL

47 Algorithm A.1.2 : Single Ray Octree Traversal, Part 2

1: PUSH-CHILD-NODES(node, A, B, Pmin, Plast, depth, isc) 2: if isc = 3 then 3: node ← GETCHILD(node, B) 4: if node , NIL then 5: STACK-PUSH(stack, node, depth + 1, tthird, tout) 6: node ← GETCHILD(node, B XOR Plast) 7: if node , NIL then 8: STACK-PUSH(stack, node, depth + 1, tsecond, tthird) 9: node ← GETCHILD(node, A XOR P f irst) 10: if node , NIL then 11: STACK-PUSH(stack, node, depth + 1, t f irst, tsecond) 12: node ← GETCHILD(node, A) 13: if node , NIL then 14: STACK-PUSH(stack, node, depth + 1, tin, t f irst) 15: else if isc = 2 then 16: node ← GETCHILD(node, B) 17: if node , NIL then 18: STACK-PUSH(stack, node, depth + 1, tsecond, tout) 19: node ← GETCHILD(node, A XOR P f irst) 20: if node , NIL then 21: STACK-PUSH(stack, node, depth + 1, t f irst, tsecond) 22: node ← GETCHILD(node, A) 23: if node , NIL then 24: STACK-PUSH(stack, node, depth + 1, tin, t f irst) 25: else if isc = 1 then 26: node ← GETCHILD(node, B) 27: if node , NIL then 28: STACK-PUSH(stack, node, depth + 1, t f irst, tout) 29: node ← GETCHILD(node, A) 30: if node , NIL then 31: STACK-PUSH(stack, node, depth + 1, tin, t f irst) 32: else 33: node ← GETCHILD(node, A) 34: if node , NIL then 35: STACK-PUSH(stack, node, depth + 1, tin, tout)

48 Below is pseudocode for the parallel ray octree traversal algorithm (see section 2.2). The operations that start with VEC- use SIMD type of instructions. This is explained in section 2.2.

Algorithm A.1.3 : N Rays parallel Octree Traversal, Part 1

1: TRACE-N-RAYS(Rays[n] = (O, D[n]), stopdepth) 2: node ← root B the root node 3: S ← 0xFFFFFFFF 4: if ¬VEC-RAY-AABB-INTERSECT(Rays[n], node, t0[n], t1[n]) then 5: S [i] ← 0 i = 1,..., i = n 6: if S = 0 then 7: return 8: else 9: a[n] ← O + t0[n]D[n], b[n] ← O + t1[n]D[n] B intersection points 10: aˆ[n] ← Ma[n], bˆ [n] ← Mb[n] B transform to voxelspace 11: STACK-PUSH(stack, node, depth = 0, tstart[n] = 0, tend[n] = 1, S ) 12: while ¬ STACK-EMPTY(stack) do 13: (node, depth, tin[n], tout[n], R) =STACK-POP(stack) 14: R ← R AND S 15: if depth ≥ stopdepth then 16: VEC-UPDATE-RESULT(node, R) 17: S ← S XOR R 18: if S = 0 then 19: return continue 20: pˆ in[n] = (1 − tin[n])aˆ + tin[n]bˆ 21: pˆ out[n] = (1 − tout[n])aˆ + tout[n]bˆ 22: P1 =VEC-SIGN-MASK(nx · pˆ in[n] + d) 23: P2 =VEC-SIGN-MASK(ny · pˆ in[n] + d) 24: P3 =VEC-SIGN-MASK(nz · pˆ in[n] + d) 25: P4 =VEC-SIGN-MASK(nx · pˆ out[n] + d) 26: P5 =VEC-SIGN-MASK(ny · pˆ out[n] + d) 27: P6 =VEC-SIGN-MASK(nz · pˆ out[n] + d) 28: Ix = P1 XOR P4 29: Iy = P2 XOR P5 30: Iz = P3 XOR P6 31: tx[n] = ty[n] = tz[n] = FLT MAX 32: tx[n] = VEC-LINE-PLANE-INTERSECT(Πx=d, pˆ in[n], pˆ out[n], Ix) 33: ty[n] = VEC-LINE-PLANE-INTERSECT(Πy=d, pˆ in[n], pˆ out[n], Iy) 34: tz[n] = VEC-LINE-PLANE-INTERSECT(Πz=d, pˆ in[n], pˆ out[n], Iz) 35: COMPARE-RAYS() 36: return

49 Algorithm A.1.4 : N Rays parallel Octree Traversal, Part 2 1: COMPARE-RAYS() 2: tmin[n] ← min(tx[n], ty[n], tz[n]) 3: tmax[n] ← max(tx[n], ty[n], tz[n]) 4: P f irst[n] ← 4 5: Plast[n] ← 1 6: Q =VEC-COMPARE-EQUAL(tmax[n], ty[n]) 7: VEC-WRITE-INT(2, Plast[n], Q) 8: Q =VEC-COMPARE-EQUAL(tmax[n], tx[n]) 9: VEC-WRITE-INT(4, Plast[n], Q) 10: Q =VEC-COMPARE-EQUAL(tmin[n], ty[n]) 11: VEC-WRITE-INT(2, P f irst[n], Q) 12: Q =VEC-COMPARE-EQUAL(tmin[n], tz[n]) 13: VEC-WRITE-INT(1, P f irst[n], Q) 14: while R , 0 do 15: i ← BITSCAN(R) 16: C1[n] ← P1[i] 17: C2[n] ← P2[i] 18: C3[n] ← P3[i] 19: C4[n] ← P4[i] 20: C5[n] ← P5[i] 21: C6[n] ← P6[i] 22: C f irst[n] ← P f irst[i] 23: Clast[n] ← Plast[i] 24: Q ← R 25: Q =VEC-COMPARE-EQUAL-AND(C1[n], P1[n], Q) 26: Q =VEC-COMPARE-EQUAL-AND(C2[n], P2[n], Q) 27: Q =VEC-COMPARE-EQUAL-AND(C3[n], P3[n], Q) 28: Q =VEC-COMPARE-EQUAL-AND(C4[n], P4[n], Q) 29: Q =VEC-COMPARE-EQUAL-AND(C5[n], P5[n], Q) 30: Q =VEC-COMPARE-EQUAL-AND(C6[n], P6[n], Q) 31: Q =VEC-COMPARE-EQUAL-AND(C f irst[n], P f irst[n], Q) 32: Q =VEC-COMPARE-EQUAL-AND(Clast[n], Plast[n], Q) 33: R ← R XOR Q 34: PUSH-CHILD-NODES(i,Q)

50 Algorithm A.1.5 : N Rays parallel Octree Traversal, Part 3 1: PUSH-CHILD-NODES(i,Q) 2: intersections[n] ← 0 3: VEC-ADD-INT(1, intersections[n], Ix) 4: VEC-ADD-INT(1, intersections[n], Iy) 5: VEC-ADD-INT(1, intersections[n], Iz)

6: BBuild A from P1[i] ... P3[i] and B from P4[i] ... P6[i]

7: if intersections[i] = 3 then 8: node ← GETCHILD(node, B) 9: if node , NIL then 10: STACK-PUSH(stack, node, depth + 1, tthird[n], tout[n], Q) 11: node ← GETCHILD(node, B XOR Plast[i]) 12: if node , NIL then 13: STACK-PUSH(stack, node, depth + 1, tsecond[n], tthird[n], Q) 14: node ← GETCHILD(node, A XOR P f irst[i]) 15: if node , NIL then 16: STACK-PUSH(stack, node, depth + 1, t f irst[n], tsecond[n], Q) 17: node ← GETCHILD(node, A) 18: if node , NIL then 19: STACK-PUSH(stack, node, depth + 1, tin[n], t f irst[n], Q) 20: else if intersections[i] = 2 then 21: node ← GETCHILD(node, B) 22: if node , NIL then 23: STACK-PUSH(stack, node, depth + 1, tsecond[n], tout[n], Q) 24: node ← GETCHILD(node, A XOR P f irst[i]) 25: if node , NIL then 26: STACK-PUSH(stack, node, depth + 1, t f irst[n], tsecond[n], Q) 27: node ← GETCHILD(node, A) 28: if node , NIL then 29: STACK-PUSH(stack, node, depth + 1, tin[n], t f irst[n], Q) 30: else if intersections[i] = 1 then 31: node ← GETCHILD(node, B) 32: if node , NIL then 33: STACK-PUSH(stack, node, depth + 1, t f irst[n], tend[n], Q) 34: node ← GETCHILD(node, A) 35: if node , NIL then 36: STACK-PUSH(stack, node, depth + 1, tin[n], t f irst[n], Q) 37: else 38: node ← GETCHILD(node, A) 39: if node , NIL then 40: STACK-PUSH(stack, node, depth + 1, tin[n], tout[n], Q)

51 BIBLIOGRAPHY

[1] J. T. Kajiya, “The rendering equation,” SIGGRAPH Comput. Graph., vol. 20, no. 4, pp. 143–150, 1986.

[2] T. Akenine-Moller,¨ E. Haines, and N. Hoffman, Real-Time Rendering 3rd Edi- tion. Natick, MA, USA: A. K. Peters, Ltd., 2008.

[3] AMD, “Medical visualization on the GPU in real-time.” http://developer.amd.com/samples/demos/pages/ MedicalVisualization.aspx.

[4] D. Shreiner, M. Woo, J. Neider, and T. Davis, OpenGL(R) Programming Guide: The Official Guide to Learning OpenGL(R), Version 2 (5th Edition), pp. 10–14. Addison-Wesley Professional, 2005.

[5] L. Szirmay-Kalos and T. Umenhoffer, “Displacement mapping on the GPU - State of the Art,” Computer Graphics Forum, vol. 27, no. 1, pp. 1567–1592, 2008.

[6] R. L. Cook, L. Carpenter, and E. Catmull, “The reyes image rendering ar- chitecture,” SIGGRAPH Comput. Graph., vol. 21, no. 4, pp. 95–102, 1987.

[7] A. S. Glassner, ed., An introduction to ray tracing. London, UK: Academic Press Ltd., 1989.

[8] H. W. Jensen, Realistic Image Synthesis Using Photon Mapping. Natick, MA, USA: A. K. Peters, Ltd., 2009.

[9] NVIDIA, “Nvidia optix ray tracing engine.” http://developer. nvidia.com/object/optix-home.html.

[10] D. Pohl, “Quake wars gets ray traced.” http://software.intel.com/ en-us/articles/quake-wars-gets-ray-traced/.

[11] J. Olick, “Current generation parallellism in games.” http://s08.idav.ucdavis.edu/ olick-current-and-next-generation-parallelism-in-games. pdf.

[12] J. van Waveren, “ 5 challenges: From texture virtualization

52 to massive parallelization.” http://s09.idav.ucdavis.edu/talks/ 05-JP_id_Tech_5_Challenges.pdf.

[13] S. Laine and T. Karras, “Efficient sparse voxel octrees – analysis, extensions, and implementation,” NVIDIA Technical Report NVR-2010-001, NVIDIA Corporation, Feb. 2010.

[14] C. Crassin, F. Neyret, S. Lefebvre, and E. Eisemann, “Gigavoxels: ray- guided streaming for efficient and detailed voxel rendering,” in Proceedings of the 2009 symposium on Interactive 3D graphics and games, I3D ’09, (New York, NY, USA), pp. 15–22, ACM, 2009.

[15] J. Amanatides and A. Woo, “A fast voxel traversal algorithm for ray trac- ing,” in In Eurographics 87, pp. 3–10, 1987.

[16] M. Abrash, “A first look at the larrabee new instructions LRBni.” http://software.intel.com/sites/billboard/archive/ larrabee-new-instructions.php.

[17] C. Ericson, Real-Time Collision Detection (The Morgan Kaufmann Series in In- teractive 3-D Technology), pp. 175–177. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2004.

[18] B. T. Phong, “Illumination for computer generated pictures,” Commun. ACM, vol. 18, pp. 311–317, June 1975.

[19] E. Eisemann and X. Decoret,´ “Fast scene voxelization and applications,” in ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, pp. 71– 78, ACM SIGGRAPH, 2006.

[20] NVIDIA, “Fermi compute architecture.” http://www.nvidia.com/ content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_ Architecture_Whitepaper.pdf.

[21] S. Rusinkiewicz and M. Levoy, “Qsplat: a multiresolution point render- ing system for large meshes,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’00, (New York, NY, USA), pp. 343–352, ACM Press/Addison-Wesley Publishing Co., 2000.

53