<<

University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2018-09-21 Spatial Partitioning for Distributed Path-Tracing Workloads

Hornbeck, Haysn

Hornbeck, H. (2018). Spatial Partitioning for Distributed Path-Tracing Workloads (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/33077 http://hdl.handle.net/1880/108724 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Spatial Partitioning for Distributed Path-Tracing Workloads

by

Haysn Hornbeck

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN

CALGARY, ALBERTA

SEPTEMBER, 2018

c Haysn Hornbeck 2018

Abstract

The literature on path tracing has rarely explored distributing workload using distinct spatial parti- tions. This thesis corrects that by describing seven which use Voronoi cells to partition scene data. They were tested by simulating their performance with real-world data, and fitting the results to a model of how such partitions should behave. Analysis shows that image-centric partitioning outperforms other algorithms, with a few exceptions, and restricting Voronoi centroid movement leads to more efficient algorithms. The restricted algorithms also demonstrate excellent scaling properties. Potential refinements are discussed, such as voxelization and locality, but the tested algorithms are worth further exploration. The details of an implementation are outlined, as well.

ii Acknowledgements

U. R. Alim, for extensive advice on testing and this thesis.

Keynan Pratt, for introducing me to the Vivaldi . While I was working on spatial partitioning of path tracing before that moment, I lacked an algorithm for load balancing and was unsure of how to draw the bounds between nodes in the system. He didn’t intend to solve either problem while presenting Vivaldi, but I quickly realized its potential. It was later out-performed by other algorithms that I developed on my own, but his action still saved me substantial time and effort.

Mea Wang, for allowing me to outline an implementation of this system as homework in one of her courses.

iii Table of Contents

Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vi List of Figures ...... vii List of Symbols ...... x 1 Introduction ...... 1 1.1 Motivation ...... 5 1.2 Methodology Overview ...... 7 1.3 Summary of Contributions ...... 8 1.4 Thesis Organization ...... 9 2 Related Work ...... 10 2.1 Projection and Rasterization ...... 10 2.2 Ray Tracing and Radiosity ...... 12 2.3 Path Tracing ...... 14 2.4 Backwards Ray Tracing and Photon Mapping ...... 16 2.5 Acceleration Structures ...... 17 2.6 Load Balancing ...... 19 3 Spatial Partitioning ...... 22 3.1 System Overview ...... 22 3.1.1 The Vivaldi Algorithm ...... 26 3.2 Voronoi Cell Partitions ...... 30 3.2.1 Voronoi Diagrams ...... 30 3.2.2 Adjusting Partitions ...... 32 3.3 Challenges to Spatial Partitioning ...... 38 3.3.1 Unstable Partitions ...... 38 3.3.2 Network Bandwidth ...... 39 3.3.3 Unbiased Pixel Sampling ...... 40 3.3.4 Algorithm Tuning ...... 41 3.3.5 Alternatives to Vivaldi ...... 41 3.3.6 Initial Node Placement ...... 42 3.3.7 Node Movement Restrictions ...... 44 3.3.8 Damping ...... 47 3.3.9 Bundling Ray Data ...... 47 4 Simulation and Results ...... 48 4.1 Simulation Methodology ...... 48 4.2 Behavioural Metrics ...... 49 4.2.1 Statistical Modelling ...... 51 4.2.2 Ray Collisions ...... 52 4.2.3 Network Transmissions ...... 55 4.2.4 Node Movement ...... 57 4.2.5 Memory Cost ...... 58

iv 4.3 Key Metrics ...... 59 4.3.1 Generating the Posterior Distribution ...... 60 4.4 Results ...... 62 4.4.1 Collisions ...... 63 4.4.2 Test Scenes ...... 68 4.4.3 Network Transmission ...... 70 4.4.4 Node Position ...... 71 4.5 Comparisons to Other Techniques ...... 75 4.5.1 Visual Partitioning ...... 75 5 Future Work and Conclusion ...... 78 5.1 Summary ...... 78 5.2 Local Algorithms ...... 78 5.3 Additional Algorithms ...... 79 5.4 Voxelization ...... 79 5.5 Conclusions ...... 80 Bibliography ...... 81 A Additional Figures ...... 92 B Implementation Overview ...... 101 B.1 Requirements ...... 101 B.2 Operating Environment ...... 102 B.2.1 Master-Client vs. Peer-to-Peer ...... 103 B.2.2 Network Overview ...... 104 B.2.3 Nodes ...... 104 B.3 Scenes and Spaces ...... 104 B.3.1 Scene Log ...... 106 B.3.2 Data Management ...... 107 B.3.3 Security ...... 108 B.3.4 Backup ...... 109 B.3.5 Edit and Render Modes ...... 110 B.3.6 Consensus ...... 111 B.4 Rendering ...... 112 B.4.1 Render Pools ...... 114 B.4.2 Consolidating Renders ...... 115 C Network Packets ...... 116 C.1 Status Messages ...... 117 C.2 Public Keys ...... 119 C.3 Scene Logs ...... 120 C.4 Data ...... 123 C.5 Render Pools ...... 124 C.6 Rendering ...... 124

v List of Tables

3.1 Calculating certainties for Figure 3.5. Numbers have been rounded for presentation. 36 3.2 Repositioning Node 1, from Figure 3.5...... 36 3.3 Bandwidth accounting for path segments...... 39

4.1 Fitness to Model 2, collisions, by algorithm. Based on a subsample, some values are rounded for presentation...... 63 4.2 Fitness to Model 3, network transmissions, by algorithm. The camera axis algo- rithm data is from a subsample, and the “free” algorithms are initialized with radial camera axis. Values are rounded for presentation...... 72 4.3 The median, 16th and 84th percentiles from Model 5’s posterior, by algorithm. . . . 77

A.1 Statistics for each of the seven scenes used in this paper...... 92 A.2 The median, 16th and 84th percentiles from Model 2’s posterior, by algorithm. Based on a random subsample...... 94 A.3 Select median, 16th and 84th percentiles from Model 3’s posterior, by algorithm. . . 94

vi List of Figures and Illustrations

1.1 The correlation between supercomputer performance and number of cores, for the 500 fastest supercomputers between June 1995 and November 2017...... 3 1.2 The progress of computing speed, in terms of floating point operations per second per computing core, for the fastest supercomputers between June 1995 and Novem- ber 2017. Purple dots represent individual supercomputers. Updates are done twice a year, each update lists 500 supercomputers, and supercomputers likely exist in multiple updates. Random noise has been introduced into the time axis for illus- trative purposes. Note the use of logarithmic scale on the y axis...... 4

2.1 The projection and rasterization approach to generating . See the text for details...... 11 2.2 A demonstration of the effects of global illumination, or light which is not directly reflected or refracted into the camera. The left image lacks any global illumination. All surfaces are perfect Lambert radiators...... 13 2.3 A comparison of path and ray tracing...... 15 2.4 The three most common acceleration structures for path tracing. See the text for details...... 17

3.1 A basic flow chart of how ray data may flow in a distributed spatial partitioning system...... 23 3.2 A search example for a network established via the Vivaldi algorithm. The red server is the one initiating the data query, the gold servers are the peers it is aware of, the blue ”X” the eventual destination, and the remainder are peers known to other servers but unknown to the originating one...... 27 3.3 A failed search example, for when servers lack perfect knowledge of the system. Here, the red server mistakenly thinks it is the closest server to the target, because it is closer to the target than any of the peers it is aware of, again in gold. If it had perfect knowledge of all other servers, in black, it would have forwarded the request on to one of them and it would reach the closest server, again in blue. . . . 29 3.4 Two examples of Voronoi diagrams. Both use the Euclidean norm...... 32 3.5 An in-depth example of how modified Vivaldi works, using a set of thirteen servers. The large numbers next to each server are proportional to their workload. The current server is in pink, its peers are in gold, and all other servers are black. . . . . 35 3.6 The new position of the current server is a dark blue circle, and has been adjusted to fit within the convex hull. The current server is in pink, its peers are in gold, and all other servers are black. The unadjusted new position is the lighter blue circle. . 37 3.7 A comparison of stable and unstable balanced partitions, where each is responsible for an equal number of collisions; spheres denote Voronoi cell centroids, and “bal- anced” implies all partitions have equal workloads. Small perturbations in server positions cause significant boundary changes in the unstable type and induce large swings in workload. This can prevent the system from finding a stable, persistent partition...... 38

vii 3.8 The density of ray segment collisions, as a function of camera location. The verti- cal axis represents the distance along the central axis of the camera, the horizontal the distance from this axis. Darker regions indicate more collisions, and gamma correction has been applied to make slight interactions more visible. The source is raw data from five of the scenes in Figure 4.1...... 43 3.9 Three methods for spatially partitioning a unit cube into Voronoi cells. Clockwise from top: unrestricted or ”free”, planar camera-axis, and radial camera-axis. Cell centroids are marked by red spheres...... 46

4.1 The seven test scenes used in this paper. From top to bottom, left to right: the Blender 2.77 splash scene, by Pokedstudio; “Class room,’ by Christophe Seux; a toy helicopter by “vklidu;” a benchmark scene from “Cosmos Laundromat,” by the Blender Institute; Mike Pan’s BMW benchmark scene; “Barcelona Pavillion,” by eMirage; and the Blender 2.74 splash scene, by Manu Jarvinen. All are available from the Blender Demo Scenes web page...... 50 4.2 An overview of Model 2, and its key parameters. See the text for details...... 54 4.3 An overview of Model 3, and its key parameters. See the text for details...... 57 4.4 The performance of each of the unrestricted algorithms, as measured by efficiency (see text), across all four initial node placement algorithms. The middle tic is the median. This chart is based on a subsample of the full data set...... 62 4.5 The posterior distributions of Model 2, for each algorithm, drawn from the data set. Magenta lines correspond to maxima, green lines correspond to minima, and grey areas represent the area between the 16th and 84th percentiles of both extremes. The number of nodes is fixed to three, and the corresponding ideal line of convergence is drawn in black. All charts use the same scale. See the text for analysis...... 65 4.6 The performance of each algorithm, as measured by efficiency (see text), with the unrestricted algorithms using the radial camera axis initial placement. Based on a subsample of the original data set...... 66 4.7 The performance of the two radial algorithms, Vivaldi, and No-backtracking, as measured by ray collision efficiency (see text), when the number of nodes in the system is fixed. The radial algorithm data is from a subsample...... 67 4.8 The performance of the two radial algorithms, Vivaldi, and No-backtracking, as measured by efficiency (see text), when the damping amount is fixed. The radial algorithm data is from a subsample...... 68 4.9 The performance of the two radial algorithms, Vivaldi, and No-backtracking, as measured by efficiency (see text), when the size of the ray bin is fixed. The radial algorithm data is from a subsample...... 69 4.10 The performance of the two radial algorithms, as measured by efficiency (see text), when the number of bins to track collisions is fixed. The radial algorithm data is from a subsample...... 70 4.11 The performance of all algorithms, as measured by efficiency (see text), for each of the test scenes. The camera axis algorithm data is from a subsample...... 71 4.12 The evolution of node positions for 300 runs each of the two radial camera axis algorithms, with the number of nodes fixed at five. Time is represented as a linear scaling factor away from the origin...... 72

viii 4.13 The evolution of node positions for many runs each of the four unrestricted algo- rithms, with the radial initial conditions and number of nodes fixed at five. The view is orthographic, with the viewing plane perpendicular to the camera axis, and each uses the same scale. The scene used to generate the data was “bmw27,” and the number of samples used varied; both Swarming and No Backtracking used 300 simulation runs randomly drawn from the full data set, while Vivaldi and No Certainty only had 34 and 58 suitable runs in total, respectively...... 74 4.14 A comparison of the six algorithm variants for visual partitioning. See the text for details...... 76

A.1 The main patch used to capture ray data from the Cycles rendering engine...... 93 A.2 The performance of each of the unrestricted algorithms, as measured by efficiency and three variables from Model 2 (half-span, offset, and error, see text for details), across all four initial node placement algorithms...... 96 A.3 A corner plot of the posterior for Model 2, with ray collisions as the metric under consideration...... 97 A.4 The performance of Swarming, as measured by efficiency (see text), when the number of nodes in the system is fixed but the initial configuration of nodes is varied. 98 A.5 The performance of the two radial algorithms, Vivaldi, and No-backtracking, as measured by efficiency (see text), when the number of nodes is fixed at 3 and 13, respectively, for a range of damping values. The radial algorithms used a random subsample...... 99 A.6 The posterior distributions of Model 3, for each algorithm, drawn from the data set. Magenta lines correspond to the waterline, green lines correspond to the 16th and 84th percentiles of both extremes, and grey areas represent posterior density (with darker being more certain). The number of nodes is fixed to three. All charts use the same scale, but the y axis is logarithmic. See the text for analysis...... 100

B.1 Space filling curves can be used to map multi-dimensional spaces onto a one- dimensional spaces, and by extension to partition them in a somewhat spatially- compact manner...... 105 B.2 A block diagram of the rendering process. See the text for details...... 112

ix List of Symbols, Abbreviations and Nomenclature

Symbol Definition The proportion of outgoing to incoming light of a surface, integrated across albedo all possible viewing angles. API Application Programming Interface

BRDF Bidirectional Reflectance Distribution Function BSDF Bidirectional Scattering Distribution Function

BTDF Bidirectional Transmittance Distribution Function BVH Bounding Volume Hierarchy

CPU Central Processing Unit

GPU Graphics Processing Unit

IBM International Business Machines irradiance The illumination which is incident on a surface, per unit of surface area. A function which returns a positive scalar when handed a non-zero vector,

and zero for a zero vector. It must also obey the triangle inequality ( f (~a + norm ~b) f (~a)+ f (~b), for any norm f ) and be absolutely scalable ( f (s ~a)= ≤ · s f (~a)). · OpenGL Open Graphics Library The illumination which is reflected outwards from a surface, per unit of radiance surface area. TFLOPS Trillions of Floating Point Operations per Second

U of C University of Calgary

USB Universal Serial Bus

x Chapter 1

Introduction

In the last quarter of the 19th century, the Harvard Observatory had a large data set of astronomical observations from Joseph Winlock which needed processing, thanks to his untimely death, but lacked the funds to do so. His eldest daughter, Anna, was tasked with providing for her four siblings and mother. She approached the Observatory with a proposition: she would process the data for Harvard, at a rate cheaper than any man would accept[Gri13].

This small historical note marked the approximate beginning of a revolution in scientific com- puting, one that occurred before computers were machines. Photography was revolutionizing as- tronomy; the Harvard Observatory alone would generate half a million glass plates in the span of sixty years[Nel08]. Harvard responded to the data processing challenge by hiring an entire depart- ment of women to process astronomical data, who in turn would use that data set to make discov- eries of their own. Annie Jump Cannon, for instance, used her experience looking at a third of a million stars to generate a stellar classification system still used by astronomers today[Cam41].

Henrietta Swan Leavitt increased the number of known variable stars in the Small Magellanic cloud by a factor of fourty, and observed that there was a strong correlation between their lumi- nosity and how quickly they went from bright to dim and back again[Mit76]. She was able to refine her measurements of these “Cepheid variable” stars and turn them into a measuring stick for cosmological distances; Harlow Shapley would rely on her work to show the universe is much larger than previously thought, while Edwin Hubble used her work to show that the universe was expanding. Similar computational revolutions were happening in the biological sciences, thanks to the influence of Francis Galton and Karl Peterson[Gri13].

Another revolution occurred around the end of the 20th century. The high demand for per- sonal computing created enormous pressure for the major players to innovate, and as a result

1 personal computer performance grew ten-thousand fold in a mere two decades[Cou11]. In 1994,

Jim Gray and Chris Nyberg noted that personal computers now offered more performance per dol-

lar than traditional mainframes, for tasks which could be effectively distributed across multiple

computers[GN94]. This lead to a new model of computing, where cheap components could be

added or removed to follow demand. Storage space was no longer restricted to what could fit in a

single computer, leading to an explosion of large data sets. In the span of a decade, for instance,

Google went from indexing one million web pages to indexing one trillion[FB13]. It’s been esti-

mated that at least thirty zettabytes of information are generated annually[RGR17]. To put that in

perspective: if a large USB key stores sixty-four gigabytes of information, and each key is roughly

half a centimetre thick, then storing thirty zettabytes of information would require six stacks of

USB keys that reach from the Earth to the Moon.

Visualizing data sets this large is tricky enough, but the increased capacity in computation

has also lead to a parallel demand for increased realism. Video games have been quick to seize on

this, and developers are increasingly using real-world physics to generate their graphics[HMJ+16].

Hollywood has done the same, pushing realism to the point that many visual effects don’t appear to be effects at all[Rob16]. This has shifted the emphasis towards more realistic but more expensive image generation techniques[KFF+15].

This also extends to the scale of scenes being rendered. The movie Avatar, released in 2009, was an ambitious expansion of what visual effects could accomplish. Researchers from Weta

Digital and NVIDIA developed a new system for rendering large scenes called PantaRay, that worked together with ’s existing Renderman software[PFHA10]. It allowed for scenes which contained over one billion polygons to be efficiently rendered. Four years later, the movie Elysium would feature a few scenes with three trillion polygons[F13].

It took three years for that record to be broken, however, via a test render for the movie with four trillion polygons[Lus16]. One reason may be that CPU computing performance is reach- ing a plateau; evidence for this can be found in the Top 500 list of supercomputers[noac]. If pro-

2 Figure 1.1: The correlation between supercomputer performance and number of cores, for the 500 fastest supercomputers between June 1995 and November 2017.

cessors can be made arbitrarily powerful, then there would be no correlation between the number

of cores in a supercomputer and the number of floating-point operations it can handle per-second;

if processors have reached maximum performance, with no room for improvement, then the cor-

relation between computer cores and performance should be perfect. Reality lies between those

two extremes, but if the yearly trend is towards greater correlation we have reason to believe there

is less flexibility in processor design. Figure 1.1 shows that this is generally the case, with two

notable exceptions between the years 2002 and 2004, and in November 2017.1

The general trend has been towards parallelism, even on the scale of a single computer. CPU

manufacturers have found the easiest way to maintain performance gains is to divide their chips

into multiple processing cores[Gee05]. Researchers have discovered that graphic processing units

or GPUs can perform some non-graphical calculations more efficiently than CPUs[CBM+08].

Their architecture is even more parallel; NVIDIA’s GV100 processor consists of 84 “streaming

1While the exact reasons for the exceptions are unclear, Figure 1.2 suggests the former is due to a significant number of high FLOPS-per-core supercomputers coming online while the majority of supercomputers used much lower FLOPS-per-core designs, and the latter is due to an uptick of low FLOPS-per-core supercomputers with very high core counts.

3 Figure 1.2: The progress of computing speed, in terms of floating point operations per second per computing core, for the fastest supercomputers between June 1995 and November 2017. Purple dots represent individual supercomputers. Updates are done twice a year, each update lists 500 supercomputers, and supercomputers likely exist in multiple updates. Random noise has been introduced into the time axis for illustrative purposes. Note the use of logarithmic scale on the y axis. multiprocessors,” each of which have 104 computing cores apiece, while the AMD Threadripper

2990WX holds the off-the-shelf CPU record with 32 cores[Smi17][Lea18]. Due to their good per- formance and improved efficiency, GPUs are increasingly showing up in supercomputers[KT11].

This has maintained an exponential growth in computation capacity per core, as Figure 1.2 demonstrates. A careful look at the graph, however, shows that slower processors are disappearing more rapidly than faster processors have been created; the chart shows a peak in performance- per-core with IBM’s POWER5+ 2C processor, introduced in 2006, which has yet to be matched a decade later[noae]. The past few years have seen the reintroduction of low FLOPS-per-core supercomputers that compensate by deploying more cores. The system with the greatest num- ber of cores on the November 2017 list, “Gyoukou” at the Japan Agency for Marine-Earth Sci- ence and Technology, has more cores than the next seven supercomputers combined yet is only the fourth-fastest on the list[noaa]. In June 1995, it was possible for a single-core computer

4 to be a top-200 supercomputer;[noad] in November 2017, the least number of cores in the top

200 was 12,800[noab]. Multi-computer parallelism already plays an important role in computer graphics,[AB15] and these trends imply that will only intensify in future.

1.1 Motivation

The graphics community is not immune to these trends; on the contrary, the heavy computational workload of generating graphics has almost made it necessary to distribute the workload across several computers. In 1984, for instance, the short film The Adventures of Andre´ & Wally B. was rendered on a frame-by-frame basis across sixteen separate computers[CCC87]. That corresponds to the first of three distinct strategies to this problem.

1. Temporal partitioning: each frame within a sequence of animated images can be

uniquely assigned to one of several computers.

2. Visual partitioning: the pixels of an image can be uniquely assigned to one of sev-

eral computers.

3. Spatial partitioning: the geometry of the virtual scene can be uniquely assigned to

one of several computers.

Temporal partitioning appears to have arisen as a natural solution to creating graphics; while the director of that short film is quick to credit other people’s work, he does not provide any citations to the idea of partitioning the work temporally, and treats the concept as trivial without need of explanation[Smi84].

Visual partitioning is an extension of existing techniques. In 1989, several researchers at the

University of North Caroline published the details of a graphics rendering system known as “Pixel-

Planes 5.”[FPE+89] It divided the output image into 128x128 pixel tiles, and if requested would assign one of sixteen “Renderers” to handle each tile in a round-robin fashion. These were distinct

5 hardware modules within a single computer. The first attempt at employing multiple computers

appears to be from Wald et. al in 2001[WSB01]. This partition type has also been used outside of

academia, for instance in the Mitsuba renderer[Jak10].

Spatial partitioning has rarely been researched in the distributed context. The only major re-

search the author could find was Kilauea, as detailed by Toshi Kato in 2003[Kat03]. Kilauea relied

on common techniques at the time, but has become outdated by more recent advances and shifts

in the computer graphics field.2 A re-examination of distributed spatial partitioning is overdue.

If this technique is successful, it could be easier and more efficient to visualize very large data

sets, ones much larger than a single computer can fit into memory. The state-of-the-art in temporal

and visual partitioning is out-of-core rendering, where geometry is streamed into memory from

high-capacity storage; spatial partitioning would keep more of the geometry in core, potentially

speeding rendering.

This exploration of distributed spatial partitioning forms the research problem of this thesis.

Besides the obvious goal of exploration for its own sake, this form of partitioning offers an alter-

native to more common partitioning algorithms. In additions to the benefits outlined previously,

it may also offer some relief from the pressure to upgrade computational hardware, as older com-

puters can still contribute to some portion of the render even as the requirements for geometry and

texture storage increase. If the template in Appendix B is followed, distributed spatial partition-

ing also offers a system that requires minimal maintenance while offering strong redundancy and

abundant storage.

Defining a rigorous hypothesis a-priori is impossible, due to the abductive nature of this thesis,

so instead the data will be used to generate hypotheses after the fact. Within a Bayesian statistical

model (see section 4.2.1), this procedure is acceptable provided no evidence is withheld.

2More detail on Kilauea will be presented in the next chapter, as some understanding of ray tracing is required to properly discuss it.

6 1.2 Methodology Overview

There are significant challenges with distributed spatial partitioning, as compared to other tech- niques. Temporal partitioning can be thought of as a one-dimensional problem: a linear collection of work units must be completed by a fixed pool of computers, each unit demanding an uncer- tain amount of processing, as quickly as possible. This can be simplified to the 0/1 knapsack problem, which has been extensively studied[Sah75]. Visual partitioning is a two-dimensional variation, however the assigning of work also effects the amount of scene geometry that must be stored on each computer in complicated ways, as ray-based graphical techniques suffer from decoherence[SDB85]. Implementations such as Wald’s[WSB01] and DeMarle’s[DGP04] com- pensate for this by the use of a work-stealing algorithm which is aware of memory usage, but the extra dimension offers more algorithm choices and increases the difficulty via the “curse of dimensionality.”[Rus97]

Distributed spatial partitioning is a three-dimensional variation, so the curse is greater still.

Any analysis must be aware of this, and attempt to be as broad yet thorough as possible. There are two main approaches: build a working renderer, or simulate one. In theory building a renderer is relatively easy, as there are high-quality libraries and reference code available[WWB+14]. In practice there are a lot of other non-trivial problems that need to be solved as well: what level of physical realism is appropriate, how to structure the scene data, where that data will come from, the protocols necessary for exchanging data across the network, and so on. All of these need to be handled in a manner flexible enough to accommodate a significant number of potential algorithms, while still being grounded in how existing renderers are manufactured to allow for comparisons.

A simulation, in contrast, is not required to do ray-geometry collisions, handle textures, or deal directly with the complexities of networking. It can hook into an existing renderer and extract ray or geometry data directly, both drastically reducing the amount of code required to write while allowing the use of real-world rather than synthetic data. Figure A.1 contains the eighteen lines

7 of code used to capture ray data from the Cycles rendering engine within Blender,3 which has

support for hair and particle systems, volumetric rendering and subsurface scattering, and offers

a large collection of sample scenes released under a Creative Commons license[Foub]. The effort

saved could be used to experiment with more algorithms instead.

Nonetheless, it’s important to recognize that simulations establish plausibility at best, the true

gold standard for any algorithm is via real-world implementation.

The large parameter space to examine demands a rigorous analysis process. We can treat it

as a uniform sample space within a hypercube, and one simulation run as equivalent to a sample

from that space. A series of random samples would permit inferences about which parameters

are best, without having to examine every point within that space. Rather than use a simplistic

metric like median run time, we can model how any given algorithm behaves; this allows for more

fine-grained analysis of algorithm behaviour.

Section 3.1 will cover the advantages of spatial partitioning in detail, as it requires some un-

derstanding of how path tracing works in a distributed context.

1.3 Summary of Contributions

The primary contribution of this thesis is the development and exploration of a variety of spatial

partitioning algorithms. Two broad classes of algorithms have been covered, those which restrict

how servers can balance themselves and those which cannot. Of the former kind, three that orga-

nized around the camera axis were considered; of the latter, two used a mass-spring system while

two used a hill-climbing algorithm. All told, a seven-dimensional parameter space was sampled,

then analysed via behavioural models. In general, the camera axis algorithms had the best perfor-

mance and show a surprising level of scalability as the number of servers increases. They also need

little tuning via parameters. The free-ranging algorithms may still prove useful in specific contexts,

such as when few servers are involved or the scene to be rendered has a particular structure.

3Blender is an open-source rendering program with an unusually large feature set;[Foua] Cycles is the path tracing engine built into Blender. Path tracing itself will be covered in more depth in the next chapter.

8 A secondary contribution is the approach to analysis, which uses statistical sampling tech- niques and non-trivial models to analyze algorithm performance; while commonly used in other

fields,[GCSR95] the author is unaware of similar examples in the Computer Science literature, so this may provide a useful template for empiric algorithm analysis.4

1.4 Thesis Organization

This thesis is divided into six sections. The current and the next section will describe the con- text that spatial partitioning exists in, from the algorithms behind it to similar solutions that other authors have created. The third will give an overview of several algorithms which can be used for spatial partitioning, outlining their assumptions, strengths, and weaknesses. The fourth sec- tion details the methodology used for evaluation, as well as important findings gathered from that methodology. The fifth will look forward to other variations of the discussed algorithms, and suggest directions for future researchers. Finally, supplemental data and a design outline will be presented.

4See section 4.2.1 for a discussion of this methodology.

9 Chapter 2

Related Work

2.1 Projection and Rasterization

The earliest attempts at computer graphics relied on the properties of Euclidean spaces and affine

transformations.1 The choice of origin is arbitrary in such spaces, which allows multiple coordinate

systems to be used. An object within a space can have its own “object coordinates,” with the origin

tied to a fixed point relative to the object. That coordinate system can then be translated into “world

coordinates” via a series of affine transformations, which properly situate it in the virtual space.

The reverse transformation is also possible, which is useful when the object in question is a virtual

camera used to view the scene. This allows any rendering algorithm to reorient the virtual world to

a position convenient for the camera; for instance, all objects in the scene can be translated so the

camera’s origin becomes the world’s origin, with the X axis coinciding to the horizontal portion

of the image plane, and the Y axis matching the vertical. If the camera has a focal point, one

final non-affine transformation “projects” every object relative to the image plane, such that distant

objects shrink and near objects enlarge. From there, it is a simple matter to map the X and Y axis

of the camera coordinate system to individual pixels on screen.

The next step, “rasterization,” depended on the type of object to be rendered. For objects

defined entirely by a mathematical formula, they can be evaluated at each point and mapped back

to pixel locations; as an example, Ed Catmull outlined an algorithm which recursively subdivides

bicubic patches until each component corresponds to a single pixel, which would come to be

known as “fragments.”[Cat74] For objects defined by a series of polygons in three-dimensional

space, the projection process converts them into two-dimensional polygons mapped to a range

1By definition, an affine transformation is any which is bijective and associative; each point is mapped uniquely to another point, and vice-versa, and if multiple transformations are involved they can be consolidated in any order. The latter case is equivalent to stating a +(b + c)=(a + b)+ c.

10 Figure 2.1: The projection and rasterization approach to generating computer graphics. See the text for details.

of pixel values, which in turn are subdivided to pixel- or scanline-sized fragments. Bouknight

et. al demonstrated the visualization of polygonal objects via this process[BK70]. Figure 2.1

summarizes the projection and rasterization approach.

These abstract mappings can then be converted to actual pixels, by applying a “shading” func-

tion of some sort. The earliest of these date back to 1760, where Johann Lambert asserted that

some materials scatter light equally in all directions[Lam60]. An observer looking at any place

on the surface of that object thus receives the same number of photons per second, no matter their

viewing angle. The observed illumination depends only on the angle between the surface and the

direction of the light source, the proportion of light intrinsically reflected by the surface or the

“albedo,” and the incoming illumination at the surface or “irradiance.” This can be stated more

compactly as

Io = A ∑Il (~Ll ~N) (2.1) · l · •

where Io is the outgoing illumination at the surface or “radiance,” Il is the irradiance due to light source l, A the surface albedo,~L a vector representing the direction of the illumination of light

source l, ~N the surface “normal,” and the dot product. If we approximate the area in question as •

11 a flat plane, the surface normal is the vector perpendicular to that plane and pointing outside of the

object. This equation assumes a point light source; for non-point sources, it should be integrated

over all possible Il vectors.

2.2 Ray Tracing and Radiosity

While this style of rendering continues to be popular, it has difficulty recreating certain visual situations. For instance, each fragment is intended to be self-contained, without access to others.

While this permits a remarkable level of parallelization, it also prevents checking if the incoming light is occluded by another object, or reflecting the contents of another fragment. Techniques have been developed to work around this, by making a number of simplifying assumptions that increase performance[BN76][Wil78]. These are still evolving and being invented, in step with computing performance[ECMM16].

The earliest competitor to projection and rasterization, “ray-tracing,” leverages the fact that photons exhibit a large degree of directional invariance: the odds of a photon arriving at a specific angle to a surface and leaving at another specific angle are the same as if it arrived at the second angle and departed at the first. This allows for pixels to be projected onto geometry instead of the other way around. There is no longer an advantage to shifting to a camera-oriented coordinate system, and this process trivializes projecting onto geometry from arbitrary points. That makes it more feasible to trace backwards along the ray of light which arrived at that pixel, and in turn to simulate more realistic photon-surface interactions. Whitted was the first to publish this process, and his paper demonstrates accurate shadows, reflection, and refraction[Whi79]. At each ray- geometry intersection, it would evaluate the following shading function:

Io = Ia + ks Is + A [kt It + ∑Il (~Ll ~N)] (2.2) · · · l · •

where Ia is an approximation of the indirect or “ambient” illumination received at that point, Is

is the irradiance due to reflections, It the irradiance due to refraction, and ks and kt are weighting

12 Figure 2.2: A demonstration of the effects of global illumination, or light which is not directly reflected or refracted into the camera. The left image lacks any global illumination. All surfaces are perfect Lambert radiators.

terms. Each illumination term, excluding Ia, would be generated by casting a new ray in the appropriate direction and tracing it into the scene; if it collided with geometry, the entire shading calculation would be repeated at the new collision point, generating further rays. If it did not, the illumination would be propagated back towards the camera and contribute to the illumination at that pixel.

Cook et al. improved this technique by introducing statistical sampling.2 As had been noted

by earlier researchers such as Torrance et al. [TS67], most physical surfaces appear to consist of

microscopic planar facets, of varying sizes and randomized orientations. Analytic representations

of this are difficult to compute, as it involves calculating intersections with statistically-defined

three-dimensional volumes instead of one-dimensional rays. Whitted’s paper noted this could be

approximated by tracing multiple one-dimensional rays, and Cook et al. demonstrated this was practical. They also demonstrated the use of statistical sampling to simulate semi-glossy surfaces, translucent surfaces, motion blur, and depth of field. Soft shadows could be generated by tracing multiple rays towards a light source, sampling its visibility at that point. The increase in realism left an impact on the field, to the point that an image from Cook et al. would be fondly remembered twenty-five years later by graphics professionals[Mik09].

2Confusingly, they also described this as “distributed” ray tracing, as in sampling from a statistical distribution, as opposed to “distributed” as in spread across multiple computers.

13 Nonetheless, there was still room for improvement. By focusing entirely on reflected and re-

fracted light rays, ray tracing was missing subtle but important lighting effects. If surfaces do

scatter light equally in all directions, then the light falling on one surface would provide illumi-

nation for any non-occluded other. As this light can come from all objects and all directions,

it has become known as “global illumination.” Figure 2.2 demonstrates this; notice that global

illumination greatly brightens the ceiling and creates an extra-bright area due to indirect light

from the top of the tall rectangular prism, adds colour to the shadowed side of both prisms, and

even creates a pseudo-Mach band effect along the roof edge and at the base of the foreground

prism.3 Researchers such as Cindy Goral and Michael Cohen tried to solve this by adapting an

algorithm from thermal engineers intended to model heat transfer, creating a technique known as

“radiosity.”[GTGB84][CCWG88] This model of indirect diffuse illumination allowed sufficient re-

alism that human observers were unable to determine a physical scene from a computer-generated

+ recreation[MRC 86]. It could also be combined with ray tracing, by replacing the Ia term.

2.3 Path Tracing

Ray tracing’s primary strength is also its weakness, however. By modelling the different ways light could interact as distinct modules, na¨ıve ray tracing algorithms would cast at least one ray per module. If direct diffuse illumination, reflection, and refraction were being modelled with one light source, for instance, then one cast ray would generate three more, which each would generate three more and so on. Because ray tracing traverses depth-first, all of these rays need to remain in memory to evaluate one sample of one pixel. There is also a conceptual problem with the modular approach: by treating different light paths as distinct, ray tracers added artificial complexity to the ray casting process. Light-generating objects and geometry are separate entities in ray tracing, for instance, when in reality all light is emitted from physical objects. Radiosity has

3This is due to object self-occlusion; geometry within a right-angle joint between two planes can receive light from a smaller solid angle than can geometry in the middle of a flat plane. If ambient light radiates equally from all directions, this leads to a physical difference in illumination that is distinct from Mach banding’s perceptual difference but has superficial similarities.

14 Figure 2.3: A comparison of path and ray tracing.

to be calculated before any rays are cast, and does not take into account any reflected light. It could

be approximated by casting rays to sample the occlusion at each collision point, but this approach

has difficulty simulating colour bleeding and incorporating scene lighting.

James Kajiya attempted to solve these issues by unifying all modules[Kaj86]. Every possible

path that light could travel can be explained by a single equation:

I(x,x′)= g(x,x′)[e(x,x′)+ ρ(x,x′,x′′)I(x′,x′′)dx′′] (2.3) ZS

0,if the two points are occluded by other geometry, g(x,x′)=  (2.4)  1 ,otherwise. (x x )2  − ′  Where S is the set of all surfaces in the scene, x is the last collision point of a ray, x′ the current collision point, g(x,x′) describes geometry occlusion, e(x,x′) the emitted light by the surface in the direction of the last collision point, and ρ(x,x′,x′′) the attenuation of incoming light from future collision point x′′. Kajiya demonstrated that both ray tracing and radiosity are approximations of this full equation; that the branching problem of ray tracing could be solved by only allowing one ray to spawn zero or one rays; and demonstrated the use of statistical sampling via Markov chains to evaluate the integral. “Path tracing,” as he dubbed it, placed no strong constraints on the

15 direction the next ray would be cast. Instead, a Bidirectional Reflectance Distribution Function or

“BRDF” determined the likelihood of where that next ray would reflect, given various parameters about the surface and the incoming ray. This statistical distribution could be sampled from by the algorithm. If more flexibility was needed, a Bidirectional Scattering Distribution Function or

“BSDF” could be substituted so that both reflection and refraction where handled.

Path tracing is more intuitive to implement than ray tracing or radiosity, as Figure 2.3 suggests, and is more in sync with the research on light-surface interactions. Shading models are usually defined in terms of BRDFs or BSDFs, and sometimes come with guidance on how to sample them. Noteworthy examples include the Beckmann model,[Bec67] Cook-Torrence,[CT81] Oren-

Nayar,[ON94] and GGX,[WMLT07] though this is still an active research area[MBT+18].

2.4 Backwards Ray Tracing and Photon Mapping

Tracing backwards along the path of an incoming photon works well in general, but certain lighting conditions are poorly captured by it. Glass or water will refract and focus the primary illumination, leading to a bright area known as a “caustic.” If rays are only cast from the camera, however, the odds of it exactly tracing this path back to the light source are incredibly small in path tracing and less likely still in ray tracing.

This led James Arvo to develop the confusingly-named “backwards ray tracing.”[Arv86] Be- fore performing the traditional ray tracing algorithm, rays would first be cast from light sources into the scene, depositing energy into an “illumination map.” This map would then be substituted for

Ia +A ∑ Il (~Ll ~N) in a conventional ray tracing algorithm, with any missing data filled in via in- · l · • terpolation. Arvo was able to demonstrate efficient caustic rendering using this technique, however the technique only worked on polygonal surfaces. Jensen et al. would later extend this; their “pho- ton maps” used the same energy-deposit approach, but could handle arbitrary geometry[JC95].

16 Figure 2.4: The three most common acceleration structures for path tracing. See the text for details.

2.5 Acceleration Structures

As scene size has increased over time, researchers have experimented with data structures that can

accelerate finding ray-geometry collisions, typically by splitting the virtual scene into partitions.

The similarity to the subject matter of this thesis justifies a brief overview.

The earliest acceleration structure, binary space partitions, pre-dates ray tracing by over a

decade[SBGS69]. These work by recursively dividing the scene by a number of planes, such

that each half of the partition contains an equal number of objects.4 Determining which partition a point falls into requires repeated evaluations of

t =(~P O~ ) ~N (2.5) − • where ~P is the point to be evaluated, O~ is a point on the plane, and ~N is the plane normal. The sign of t determines which side of the partition ~P belongs to. The corner case of t = 0 is handled by asserting that neither partition contains the point; this implies that the planes must be constructed so they never intersect with geometry.

4This is distinct from distributed spatial partitioning, which is agnostic as to how the space is partitioned but demands each be on a different computer. In contrast, BSPs are agnostic about where the partitions are stored.

17 Jon Bentley proposed a simple modification of BSPs in 1975;[Ben75] rather than use planes

with arbitrary orientations, “k-d trees” force the planes to be perpendicular to a specific axis. The

same formula is evaluated to determine partition ownership, however the computations can be

greatly simplified by noticing that ~N is 0 for all but one axis, and typically normalized so that the exception has value 1. In the three-dimensional case, this reduces the one addition, three subtractions, and three multiplications in the BSP case to one subtraction in the k-d tree case, a significant time savings.

Both structures take a divide-and-conquer approach, recursively splitting the data set to form a binary tree. In the ideal case each partition is surjective,5 and each divides the number of objects in half. Both can be represented as a binary tree, so planes which do not divide by half are equivalent to unbalanced trees and have the same impact on performance. In practice, a planar division that exactly partitions a space into two equal halves is quite rare, as the clustering of objects in a scene makes it unlikely that the appropriate non-intersecting plane exists. As a result, most algorithms will break apart or duplicate objects which cross a partition boundary. Whatever the case, the result is a structure which allows O(logn) searching. While k-d trees are not as common in contemporary path tracers as the next two techniques, they have nonetheless proven useful[PGSS07][WZL11].

In 1980, Donald Meagher invented another tree-like acceleration structure, the octree[Mea80].

The scene is divided into eight rectangular prisms of equal size and dimension, and as necessary those partitions are recursively divided by the same scheme. Octrees have fixed partition bounds, effectively guaranteeing some partitions will be divided more than others and increasing the odds of object-partition collisions. This reduces their efficiency, relative to k-d trees.

One key advantage of octrees over k-d trees is that each partition has a unique address, which can be translated into a spatial address via Morton codes, or a space-filling curve via Z-order curves. These were invented in 1966 by Guy Morton to deal with two-dimensional maps of Canada, but can be extended to arbitrary dimensions[Mor66]. As Figure 2.4 demonstrates, each partition in an octree can be assigned a three-bit number, where each bit determines its location along one

5Each partition may contain multiple objects, though one object must be uniquely associated with one partition.

18 of the three axes. This number can be appended to, as each partition is recursively divided, such

that each partition is encoded with a unique address. If these partitions are traversed in numeric

order, the result is a Z-order curve that traverses every partition once. Since the partitions also have

fixed locations, it’s possible to convert an arbitrary coordinate within the bounds of the octree into

a Morton code and rapidly recover the octree partition which contains it.

These advantages make octrees a common structure in computer graphics [LGF04] [ZGHG11][CNS+11].

In the same year octrees were published, Turner Whitted and Steven Rubin detailed another

structure called bounding volume hierarchies[RW80]. A typical scene consists mostly of empty

space, with only small volumes occupied by complex geometry. Rather than test the geometry

directly, it is first enclosed by a simple bounding volume, such as a sphere or box; if a path ray

fails to collide with the bounding volume, it cannot collide with any of the geometry contained

within the volume. This scheme can be improved by using hierarchies of bounding volumes, which

achieve the same O(logn) performance as k-d trees or octrees, and can be sped up by restricting the

sides of box volumes to be axis-aligned6. Bounding volumes can be arbitrarily placed and sized, which makes it substantially easier to represent animated scenes, but this also allows volumes to overlap and thus loses the binary-traversal guarantee that both k-d trees and octrees provide. On the whole, bounded volumes are currently the most popular acceleration structure for path tracing and remain an active area of research[MB17].

2.6 Load Balancing

Pharr et al.’s use of spatial ray queues to enforce spatial coherence is worth explicitly mentioning,

[PKGH97] as there are some similarities between their method and a practical implementation of what will be presented here. In their system, all geometry is partitioned into the “geometry grid,” a set of voxels each containing roughly a few thousand triangles. Overlaid on that is a “scheduling grid,” another set of voxels which contains all rays necessary to render a scene, arranged in queues.

6The reasoning behind this is identical to that of k-d trees.

19 The rendering algorithm only works on one voxel at a time, determined by a scheduling algorithm that weighs the costs and benefits of any given voxel. By grouping rays into spatially-delimited queues, memory accesses are kept coherent and cache efficiency increases.

Distributing rendering load across multiple servers is an old solution, as discussed in section

1.1.

As for distributing geometry across multiple computers, Reinhard et al. developed one of the

first such systems[RCJ99]. It did an octree partition where each node holds the structure of the full tree but the leaf contents were equally distributed across all servers. Missing leaves were queried from other servers on demand, and a caching system reduced data transfer. Rays are collected in “bundles,” which began as pixel tiles and are not shared across servers, so this is a visual partitioning algorithm. Efficiency ranged from 55% to 85% of an ideal parallel algorithm with 32 processors. DeMarle et al. instead used virtual paging to access scene geometry[DGP04].

By pre-processing to increase spatial coherence, their system was able to maintain useful render times when rendering out-of-core. Efficiency dropped linearly as more servers were included, despite constant communication overhead, such that with 31 servers the efficiency was 69%. They also used visual partitioning, but allowed work stealing to increase cache hit rates.

The closest analogy to the system outlined in this thesis is Kilauea, developed by Toshi Kato

[Kat03]. Using a master-slave framework, Kilauea divides the scene up into small geometry clus- ters which are scattered randomly across a minimum number of slaves. The master sends a list of rays to be cast to all slaves, who return potential collisions. These are sorted, discarding occluded collisions, and new rays are generated via standard ray tracing. Photon maps could be generated by spreading the calculation of the map across multiple slaves, then consolidating the results so that the minimum number of slaves contain the map. Lookups are sent to those slaves, and oc- cluded collisions are again discarded. Even with commodity hardware and 100Base-T network connections, the system still managed 60% efficiency of an ideal parallel algorithm.

Other authors have explored distributing workload via high-speed InfiniBand connections. Ize

20 et. al distributed scene geometry across multiple servers by sharing leaves of a Bounded Volume

Hierarchy (BVH), based on DeMarle et al.’s approach of virtual memory paging[IBH11]. On a scene with 16 Gigabytes of data, they could maintain 1.5fps if they limited each server to only 2

Gigabytes of scene data, as compared to 2.3fps if each server had their own copy. Ray data was never transferred, and their implementation did not consider textures.

21 Chapter 3

Spatial Partitioning

This chapter will cover the details of distributed spatial partitioning. It will give an overview of how such systems could work, the downsides of using them over more traditional partitioning schemes, and several variations which may offer better performance over the algorithm that inspired this line of research.

3.1 System Overview

With spatial partitioning, servers are assigned responsibility to a volume of geometry within the virtual space occupied by the scene. A path ray entering the scene will start on the server responsi- ble for its point of origin. The typical collision search is done for all geometry within that space. If no collisions are found, but the ray intersects with the domain of responsibility for another server, it will be passed to that server to check for collisions. This process will repeat until the ray ter- minates, either by colliding with geometry or intersecting with the environment at infinity. Unlike temporal or visual partitioning, there is no simple mapping for which server will generate which pixel; path rays will likely be passed between multiple servers before they terminate on a different server than the one which created them.

The primary traffic between servers are path rays, not geometry. Figure 3.1 presents a thumb- nail of how distributed spatial partitioning algorithms could transfer ray data.1 This traffic also scales better than what’s found in temporal or visual partitioning, as the number of rays cast through a scene primarily depends on the algorithm used and the device the resulting image will be dis- played on, not on the geometry in the scene. A review article of noise-removal techniques used

65,536 samples per pixel as the “ground truth” number of samples necessary to fool the human eye,

1Compare this with Figure B.2, which is more comprehensive and intended to form the basis of an implementation.

22 START

No

Ready to Render?

Generate Yes new rays

Yes

For each ray in the queue

Can We Generate Other Peers More? No Still Active? Did it Shade it, and if Collide with possible generate a Geometry? Yes replacement ray

No No Queue empty Is a Peer Responsible Transmit it to For It? Yes that peer’s queue

No FINISHED It went to infinity, shade it appropriately

Figure 3.1: A basic flow chart of how ray data may flow in a distributed spatial partitioning system.

23 though many of the algorithms presented could achieve similar results with much less[ZJL+15].

The highest-quality televisions signals have a resolution of 7680 by 4320 pixels, four times the resolution necessary to match traditional film formats[SM13].2 A distributed spatial partitioning renderer will thus have to deal with two trillion rays at maximum, given current hardware, a number which does not change as more geometry is added. This number is unlikely to grow significantly, either; human vision has a maximal resolution of one arcminute and one eye can see approximately

120 degrees horizontally and vertically, assuming typical eyesight,[CR07][FMHR87] so slightly more than three of those television signals could provide a sufficient pixel count to match human physiology.

If these volumes of responsibilities are convex sets,3 without exception, we can guarantee a number of useful properties. If a line segment begins outside a convex set, then exactly zero or one continuous intervals along it will be contained by the set. We can prove this by contradiction; assume there exists a line which has two or more separate continuous intervals contained by a convex set. That can only be true if there is one continuous interval which is not part of the set, but between two intervals which are. If we chose two points from two intervals within the set, plus one on an interval outside the set, such that all three define a line, we have just made a line between two points of a convex set where a third point is part of the line but not part of the convex set. This contradicts the definition of a convex set, thus proving the original assertion that only zero or one intervals can be contained.

As a corollary, if a line segment begins within a convex set, then it must be part of the one continuous interval. As rays are directional lines with a finite bound, these lemmas demonstrate that a ray will never intersect the hull of volume of responsibility at more than two places, and even if it does intersect twice no point in between those intersections will leave the volume of responsibility. Thus, a server never has to worry that geometry on another server may occlude any

2Most theatres in North America have switched to digital projectors, and their resolution is either 2048 by 1080 pixels or 4096 by 2060[UDH16]. The change in technology was driven by business interests, not the desires of the general public[Koz18]. 3Convex sets are defined as a collection of points, where any point that’s a linear interpolation between two arbitrary points within the set will also be contained by the convex set.

24 path ray segment it is responsible for, and once that segment has been processed it will never return

to that server.

This implies a ray will “ratchet” through a scene, never returning to a server, and if it is parti-

tioned n ways then at worst a ray may be processed by n servers. More likely, a three-dimensional

finitely-bounded scene of n servers will have approximately √3 n servers per dimension, and a one-

dimensional path spanning the diagonal will encounter O(√3 n) servers.

This subset of spatial partitioning also lends itself to peer-to-peer implementations. Rays only travel between servers which share a partition boundary, so in theory a server only needs to know about the bounds of its immediate peers, and even then it only needs to know enough to assign responsibility to a ray exiting the hull of its volume of responsibility. The server does not need to receive network traffic from non-adjacent servers, dramatically reducing the overall amount of bandwidth consumed. Eliminating the central server would also remove a potential bandwidth bottleneck.

As a side-effect of its peer-to-peer nature, spatial partitioning offers excellent fault tolerance.

If a server goes offline, neighbouring servers would detect this quickly thanks to the rapid passing of path rays, and remove that server from virtual scene space. The neighbours of the offline server would immediately carve up the space it occupied and take responsibility for a share. Other servers do not need to be notified, as from their perspective nothing has changed. Lost geometry can be compensated for by adding redundancy to the system, and lost image data can be re-rendered by generating more path rays.

Path tracing benefits greatly from coherent rays, and spatial partitioning naturally helps with coherence. As it would be inefficient to transfer rays from neighbour to neighbour one-by-one, any implementation would bundle them together and send them as a batch. Any recipient of a bundle will know those rays came from roughly the same direction and origin. Sorting these bundles would increase their compressibility, and as a side effect further increase coherence. Pharr et al. have demonstrated this with their own system [PKGH97], and the technique was a core part of

25 Eisenacher et al.’s renderer [ENSB13].

3.1.1 The Vivaldi Algorithm

Both temporal and visual partitioning provide obvious partition algorithms, in part because there

are limited options for carving up one- and two-dimensional space. The three dimensions of spatial

partitioning offer more possibilities.

One method comes via the networking literature. Ng et al. [NZ02] proposed a system of

“Global Network Positioning,” where the latencies between networked servers are mapped to dis- tances between coordinates in a Lebesgue space.4 Routing between servers would thus be reduced to finding the shortest path in a graph. Their algorithm required the use of “landmark” servers to bootstrap the location of other servers, and was allowed to pick the dimensionality of the space to suit the data. It performed best with five dimensions or higher.

Dabek et al. [DCKM04] used a similar approach, but implemented a peer-to-peer algorithm they dubbed “Vivaldi” and restricted the dimension of their Euclidean space. As latency informa- tion came in, servers would use equations inspired by mass-spring systems to adjust their position within this space, so their distance metric would better resemble the measured network latency between the destination server and the transmitting one. To help encourage convergence, the un- certainty of each server’s position was used to attenuate the step size taken towards other servers.

They demonstrated that despite having only local knowledge, a stable global consensus could be reached.

These metric spaces can be mapped to a data space, such that each atom of data occupies a unique location within this network. This leads to an efficient method for querying data, as shown in Figure 3.2.

Each server within the space is responsible for the data closer to it than to other servers. To query data, servers check if they are the closest to the requested datum. If another server is closer, the request is forwarded on to a closer server. If we assume server positions are static and all im-

4These are defined as any real vector space with a norm function, such as the Euclidean or Taxicab norms.

26 Figure 3.2: A search example for a network established via the Vivaldi algorithm. The red server is the one initiating the data query, the gold servers are the peers it is aware of, the blue ”X” the eventual destination, and the remainder are peers known to other servers but unknown to the originating one.

mediate neighbours5 to a server agree where it is located, this algorithm is guaranteed to terminate.

Assume instead it does not; this would require at least two servers to assess that another server is

closer to the request than itself, in a cyclic fashion. However, we can generate a simplex around

the location of the request by using server locations as vertices. Each of these servers are peers,

as the convex hulls of their responsibility intersect, and thus agree on their locations within the

Euclidean space. As distance is a well-ordered one-dimensional metric, this collection of servers

either agrees that one of their neighbours is closer to the request, or agrees that two or more servers

are equally close. These servers cannot be part of a cycle.

Now form a convex hull around the simplex with server locations as vertices. Each of the

5An “immediate neighbour” is any other server which shares a partition boundary with the server in question. Sharing a vertex is insufficient, as the odds of a ray passing through that vertex are approximately zero, and for the extremely rare exceptions the ray can be forwarded to immediate neighbours.

27 servers in this hull are peers to at least one server in the simplex. They thus have a mutual un-

derstanding of where they are located, and thus a server on the hull agrees that the server on the

simplex is closer to the requested data. Because of this, the servers on the convex hull cannot be

part of the cycle. If we again form a new convex hull around the original and repeat the reasoning,

we find yet more servers which cannot be part of a cycle. We can iterate this process until every

server in the system is accounted for, and demonstrate that no server within the system can be a

part of a cycle. This contradicts the assertion that such a cycle exists, and therefore proves the

algorithm will terminate.

If we allow servers to move, one of the core assertions behind the above proof is false. The basic

algorithm can be repaired in several ways. If a server retains geometry it is no longer responsible

for in some form of large but slow long-term storage, and all servers retain a log of peers, a failed

fetch can be handled by querying all peers which were closer at some point for the missing data.

If servers must offload geometry they are no longer responsible for, and queries can be delayed, a

failed fetch must be due to data which has not been fully transferred yet and a delay will give it

time to arrive. These methods can be combined, for instance the delay in the latter algorithm can

be reduced by consulting a log and explicitly asking a server for the missing geometry. All of these

techniques force a degradation of some sort by introducing latency, increased storage requirements,

or decreased reliability. They only need to be employed while servers move, however; if we can

guarantee this does not occur after some point in time, the proof applies from that point onward.

This allows the use of Vivaldi or a similar iterative algorithm to optimize server locations, provided

it can be shown to converge quickly relative to the lifespan of the entire system.

If we allow servers to be ignorant of their true peers, another issue arises. In outlining their

distributed hash-table system, Ratnasamy et al. point out that a lack of knowledge can lead to situations where servers mistakenly think of themselves as the closest server to the search destina- tion [RKY+02]. Figure 3.3 illustrates one such situation. There are multiple ways to resolve this situation as well, such as broadcasts that communicate global knowledge about server positions, or

28 Figure 3.3: A failed search example, for when servers lack perfect knowledge of the system. Here, the red server mistakenly thinks it is the closest server to the target, because it is closer to the target than any of the peers it is aware of, again in gold. If it had perfect knowledge of all other servers, in black, it would have forwarded the request on to one of them and it would reach the closest server, again in blue. periodic requests for peers to share the peers they are aware of. As meta-data about peers requires very little storage, there is little cost to maintaining an list of potential servers to query.

29 3.2 Voronoi Cell Partitions

Path tracing can be recast as an information retrieval problem: which geometry atom6 does a given ray collide with? This problem occupies a volume that could form a metric space for the Vivaldi algorithm. As each server’s influence is described by its closeness in comparison to its peers, their domains of responsibility form a Voronoi diagram with each server as the centroid of a cell.

3.2.1 Voronoi Diagrams

There is some disagreement in the literature on the definition of Voronoi diagrams. Franz Au- renhammer begins by defining a set C, which contains n distinct generator points drawn from

R2[Aur91]. They define a subset of R2

dom(p,q)= x R2 σ(x, p) σ(x,q) , p,q C (3.1) { ∈ | ≤ } ∈ where σ(a,b) is the Euclidean norm. They then define another subset for each point in C,

reg(p)= dom(p,q) (3.2) q C\ p ∈ −{ } which they label “regions,” which collectively form a polygonal partition of the plane, and that collective is what Aurenhammer defines as the Voronoi diagram. From the formal definition it is clear that

x R2 σ(x, p)= σ(x,q) x reg(p) x reg(q) (3.3) { ∈ | } ⇐⇒ ∈ ∧ ∈ and thus some points are contained by multiple regions. However, Preparata et al. define

Voronoi diagrams by first defining a half-plane as “the set of points closer to pi than to p j”[PS12], where pi and p j correspond to the generator points of Aurenhammer, and thus exclude points which are contained by multiple regions. Aggarwal et al. defines regions as a “locus of points

6Examples of geometry atoms include triangles, quadrilaterals, spheres, bicubic patches, and isosurfaces.

30 closer to [the generator point] than to any other site,”[AGSS89] also excluding multi-region points.

Stojmenovic et al. do the same, stating that regions “associated with node A consist of all the

points in the plane which are closer to A than to any other node.”[SRL06] Tran et al. also claims that all points in a region “are closer to the generator point of that polygon than any other generator point in the Voronoi diagram in Euclidean plane.”[HKW09]

For this thesis, we will use the following definitions. A Voronoi diagram consists of one or more “Voronoi cells”7 and an equal number of “Voronoi hulls” in a Lebesgue space. Formally, a

Voronoi cell will be defined as

cell(p)= x Rd N(x, p) < N(x,q) , p,q C,q = p (3.4) { ∈ | } ∈ 6 As-per Aurenhammer, C is a set of n distinct generator points, although these points are instead drawn from Rd where d N and d > 0. Generator points will be called “centroids.” N(a,b) is the ∈ norm associated with the Lebesgue space. We will also define a Voronoi hull to be

hull(p)= x Rd N(x, p)= N(x,q) , p,q C,q = p (3.5) { ∈ | } ∈ 6 Thus we can relate this thesis’ definitions to Aurenhammer’s definitions.

d = 2 N(x, p)= σ(x, p) cell(p) hull(p)= reg(p) (3.6) ∧ ⇐⇒ ∪ This relation can be generalized, allowing for an easily translation between inclusive and ex-

clusive definitions of Voronoi diagrams. This thesis uses the exclusive definitions as they simplify

assigning responsibility for each point on the diagram. Assigning responsibility on a Voronoi hull

can done via a hash function that maps to a single Voronoi cell, instead of allowing multiple re-

sponsibilities; this reduces storage needs and simplifies updates, but complicates the ray casting

portion by forcing rays through a server which may not have responsibility along the ray.

7The term “cells” was chosen in favour of “polygons,” the term employed by Preparata et al., as polygons need not be manifold and therefore may lack a well-defined interior[GTLH98].

31 Figure 3.4: Two examples of Voronoi diagrams. Both use the Euclidean norm.

Voronoi diagrams can be associated with any finite-dimensional coordinate space, and rely on any distance-measuring norm; Figure 3.4 demonstrates two- and three-dimensional variations. For simplicity, we will only consider the Euclidean norm. In this subset of Voronoi diagrams, every point in the space is either contained by one cell’s set of points or on the boundary between multiple cells. From that, it also follows that Voronoi cells are convex sets and thus Voronoi cells have all the benefits mentioned in section 3.18 As Voronoi diagrams work for arbitrary dimensions, they have no problem with the three-dimensional space typical of path tracing.

3.2.2 Adjusting Partitions

The original Vivaldi algorithm can be modified for path tracing. If each server has a queue of path rays to trace, “network latency” can be recast as a metric of workload. Rather than increase the distance to a high-workload server, neighbours would instead decrease it. As there is a monotonic relation between the geometry a server contains and how busy it is, making a server responsible for more geometry will either have no effect or increase how many rays are sent to it. By moving closer to a busy server, neighbouring servers will decrease the geometry it is responsible for and

8Proofs of these properties can be found in Aurenhammer[Aur91].

32 thus likely decrease the flow of incoming ray queries. If no decrease happens, the process is repeated until it does. This load balances between servers, without the need for global consensus.

There are several possible definitions of workload. If all the servers in the system are homoge- neous, then it can be as simple as the number of path rays in a server’s work queue. If they are not, it is more useful to estimate how long it would take for a server to empty its queue, assuming no more work is given to it, and use that as a measure of workload. We can also define it as the total number of path rays received in a unit of time, which is useful for simulation. Unless explicitly stated otherwise, this paper uses the last definition.

The original Vivaldi algorithm only considers latency between two servers, as it was intended to execute as each packet arrives. This is undesirable in a path-tracing context, since a server may receive hundreds of path ray packets per second, each of which could require substantial processing. Instead, it is better to modify the algorithm to run after either a fixed number of path rays are cast and the new rays grouped into an output queue, or a timeout value is reached; this ensures the scene is better sampled, while encouraging outliers to search for more work.

The certainty term also needs to be changed. The original Vivaldi algorithm took advantage of the fact that servers which had not reached a stable location would find significant discrepancies between their calculated and actual peer-to-peer latencies, and stable locations would find none.

This can be treated as a metric of how certain each server is about its current position, and encoded as a “certainty term” which modulated the forces on a server; if the other server has a greater certainty metric than the current one, it “tugs” on the latter stronger than if both had been equally weighted. Again, this is calculated on per-packet basis in Vivaldi, and needs modification.

A server which has found an equitable workload will be surrounded by servers that have the same workload, so the standard deviation of their workloads is 0; a server which has not will have a non-zero standard deviation for peer workloads, and the more unbalanced the workload the higher the deviation will be. This metric is an excellent measure of uncertainty, though care must be taken to invert it into a measure of certainty. Put into practice, balanced servers would be

33 “heavy” and resist the attraction of others, while unbalanced ones would be freer to move; this is

desired behaviour, as it favours convergence and prevents new servers from disrupting established

networks.

Algorithm 1: Load balancing within the metric space, via a modified Vivaldi algorithm. input : A list of peers and workloads. output: A new location for the current server.

1 for Each peer of the current server do 2 Calculate the standard deviation of their workloads; 3 end 4 Invert the deviation to arrive at a certainty value; 5 Share those certainty values to nearby peers; 6 if There is some workload variance above a threshold then 7 for Each peer of the current server do 8 Generate a normalized direction to that peer; 9 Multiply this by the peer’s workload; 10 Weight by the proportion of this server’s certainty, to the sum of this server’s certainty and the peer’s certainty; 11 Add the resulting vector to a rolling sum; 12 end 13 Add the weighted sum to the current server’s position in the metric space, weighted by a damping value; 14 Adjust the new location so that it remains within the convex hull formed by its peers; 15 else 16 Do not adjust our position; 17 end

Algorithm 1 outlines the modified Vivaldi algorithm, henceforth called Vivaldi. A simple illus- tration may clarify how it works: Figure 3.5 outlines a simple server layout in a two-dimensional space, for ease of understanding. The server currently attempting to reposition itself is the central one, labelled “1.”

Lines 1 to 4 start the algorithm by calculating each server’s certainty. As we need this value for each peer of the current server, we must calculate their certainties as well. Table 3.1 walks through the math behind those calculations. The certainty is the inverse of the standard deviation.

We next calculate a weighted average according to lines 8 through 11. Table 3.2 outlines that portion. “norm’d dir” refers to the normalized vector formed by using server 1 as the origin;

34 500 9

2000 8 3000 2500 3 2500 2 7 4500 1 3500 6 3500 4500 4 13 4500 5000 10 5 4000 12 4500 11

Figure 3.5: An in-depth example of how modified Vivaldi works, using a set of thirteen servers. The large numbers next to each server are proportional to their workload. The current server is in pink, its peers are in gold, and all other servers are black.

“weighting” is the peer’s workload times the current server’s certainty, divided by the sum of the

current and peer server certainties; “norm’d weight” is the weighting values rescaled so that their

sum is one; “combined” is simply the normalized direction multiplied by the normalized weight;

“sum” is the sum of all “combined” entries; “damping” is used to moderate the step size we take

during repositioning; and the “old pos.” and “sum (1-damp)” values are the old position for server · 1 and the new position after adding the “sum” value with damping applied. Since this new position

(as shown in Figure 3.6) is outside the convex hull formed by server 1’s peers, it is adjusted via

35 Table 3.1: Calculating certainties for Figure 3.5. Numbers have been rounded for presentation. server 1 2 3 4 5 6 peer workloads 2500 2000 500 3000 4500 2000 3000 4500 2000 4500 3500 2500 4500 3000 2500 5000 4000 3500 5000 4500 4500 4500 4000 3500 4500 4500 5000 4500 4500 4500 mean 3700 3167 3083 4250 4250 3583 std.dev 1037 1258 1686 866 418 1158 1 1 1 1 1 1 certainty 1037 1258 1686 866 418 1158

Table 3.2: Repositioning Node 1, from Figure 3.5. server 1 2 3 4 5 6 norm’d dir < 0.0068,0.9999 > < 0.6626,0.7490 > < 0.8400, 0.5426 > < 0.2362, 0.9717 > < 0.9166, 0.3998 > 2500 3000 4500− 5000− − 3500− 1258 1686 866 418 1158 weighting 1 1 1 1 1 1 1 1 1 1 1037 + 1258 1037 + 1686 1037 + 866 1037 + 418 1037 + 1158 norm’d weight 0.1136 0.1149 0.2467 0.3584 0.1663 combined < 0.0007,0.1136 > < 0.0761,0.0861 > < 0.2072, 0.1338 > < 0.0846, 0.3483 > < 0.1525, 0.06649 > sum < 0.2163, 0.3489 > − − − − damping 0.5− old pos. < 0.495,0.428 > + sum (1-damp) < 0.6032,0.2535 > · linear interpolation until it remains inside the hull.

The convex hull rule on line 14 is a heuristic which attempts to better preserve peer relation- ships, and minimize the discrepancy with any distant server’s cache of server locations. It also serves a similar function as the damping term, by placing limits on how far a server can move through this virtual space.

36 500 9

2000 8 3000 2500 3 2500 2 7 4500 1 3500 6 3500 4500 4 13 4500 5000 10 5 4000 12 4500 11

Figure 3.6: The new position of the current server is a dark blue circle, and has been adjusted to fit within the convex hull. The current server is in pink, its peers are in gold, and all other servers are black. The unadjusted new position is the lighter blue circle.

37 Figure 3.7: A comparison of stable and unstable balanced partitions, where each is responsible for an equal number of collisions; spheres denote Voronoi cell centroids, and “balanced” implies all partitions have equal workloads. Small perturbations in server positions cause significant boundary changes in the unstable type and induce large swings in workload. This can prevent the system from finding a stable, persistent partition.

3.3 Challenges to Spatial Partitioning

While it offers advantages over other partitioning methods, spatial partitioning also has weak- nesses.

3.3.1 Unstable Partitions

Vivaldi can be attracted to unstable partitions. Figure 3.7 illustrates a scene with nearly all ge- ometry collisions evenly occupying a small convex hull. A person tasked with placing Voronoi centroids will position them in a ball or lattice within the bounds of the convex hull. However, preliminary simulations revealed another balanced arrangement: placing server centroids along a line tangent to the hull but offset at some distance, so that they evenly slice the shape. The bound- aries of the resulting Voronoi cells are quite unstable, as small position shifts off this line will be

38 Table 3.3: Bandwidth accounting for path segments. type bytes (na¨ıve) bytes (packed) ray origin 4 * 3 4 * 3 ray direction 4 * 3 4 * 2 luminance 2 * 3 4 * 1 attenuation 4 * 1 2 * 1 pixel 4 * 1 4 * 1 age 1 * 1 (pixel) total 39 30 rays/s, 100base-T 304,000 396,000 ≈ ≈

magnified as server centroids move perpendicular to it. This will cause those slices to dramatically

shrink or even disappear at the convex hull, triggering server movement and thus preventing the

system from stabilizing. Unfortunately, empiric testing revealed there are more unstable but fair

partitions than stable ones.

3.3.2 Network Bandwidth

All spatial partitioning methods continually share ray data between servers and occasionally share

geometry. If current networking technologies cannot provide sufficient bandwidth, these methods

will have sub-par performance relative to more traditional partitioning schemes. If the algorithm

is working properly, network traffic will be dominated by ray transfers.

Table 3.3 accounts for ray bandwidth, for the six minimal values a path ray needs to track.9

A na¨ıve approach to representing those values requires 39 bytes per ray. If we reduce the number

of bytes necessary for the direction by using octahedral coordinates,[CDE+14] pack luminance into a log Y’CbCr encoding of 8:4:4 bits, switch to a 16-bit log representation of path attenuation, and steal five bits from the pixel index to represent ray age, we can reduce this by approximately

25%. By using all of a 64 kilobyte IPv4 packet on a 100baseT network, this translates to anywhere from 300,000 to 400,000 rays transferred in one direction per second. There is room for further

9“Origin” and “direction” are necessary to form a ray. “Luminance” and “attenuation” are needed to track the characteristics of the photon that ray represents. “Pixel” tracks which pixel the ray maps to, and “age” is used to limit the number of path segments to prevent infinite loops.

39 improvement, both on the algorithmic side via compression and on the networking side via jumbo

packets and gigabit Ethernet.

3.3.3 Unbiased Pixel Sampling

Vivaldi assumes the rays cast into a scene are an unbiased sample of the image. If instead the

rays cast into the scene originated from a clump or tile of pixels, Vivaldi will partition the scene optimally for those pixels. It would then be forced to re-partition for the next clump, causing servers to exchange geometry with one another. Thus batching pixels together into image-based tiles induces bias in the sample, leading these algorithms to constantly adjust server positions as active tiles cover different parts of the scene. This consumes network bandwidth, as servers swap scene geometry between them to maintain their responsibilities. An unbiased sample would avoid the use of tiles, instead using a random or low-discrepancy sequence to choose pixels.10 That can be inconvenient for artists, however, who commonly focus in a specific part of the scene in order to perfect it. That can be compensated for if the preview area treated as a final render, without tiling or bias.

However, it is also common to place restrictions on the number of samples for an individual pixel. It is almost guaranteed that some pixels will contain more complicated geometry or surface reflectance than others, and thus require more time to render. Consequently, some pixels will stop casting rays into the scene before others, causing a bias in the sample. This may be desired, as the algorithm would rearrange servers to optimize rendering those final pixels, but it means an increased amount of geometry shared across the network as the image nears completion, as well as idle servers. The latter can be deputized into consolidating other server’s individual pixels and generating a final result, or as caches for geometry, though these solutions are less than ideal.

10This excludes pixels which have cast all the rays they need for the final image.

40 3.3.4 Algorithm Tuning

Another challenge is in tuning spatial partition algorithms. Temporal partitions only divide along

one axis, time, which greatly limits the choices of an algorithm. Visual partitioning’s two dimen-

sions offer much more flexibility: renderers can choose the pixels to render at random, or use an

ordered pattern such as growing from the centre outward or following a space-filling curve, or

batch pixels together in tiles, or any other number of combinations. As the dimensionality grows,

there are more algorithms and more parameters to choose from.

The three dimensions of spatial partitioning continue this pattern. Vivaldi features an uncer- tainty term, for instance, which is intended to penalize servers which may shift position in future.

The algorithm works perfectly fine without this term, though, and it may converge faster without those penalties in place. This variant will be known as No certainty.

3.3.5 Alternatives to Vivaldi

Vivaldi is a mass-spring system, and such systems are prone to fall into harmonic oscillations or chaotic behaviour. These slow or stop convergence on an ideal partition, and the additional motion will cause unnecessary geometry transfers. There are ways to compensate for this, and Vivaldi’s certainty term already helps smooth out this chaotic behaviour. Rather than accept the new position

Vivaldi has calculated, we can also linearly interpolate between that new position and the original one. This “damping” term will reduce the odds of the algorithm passing by an equilibrium state and entering an oscillation, but at the cost of slower convergence.

There are also other approaches to updating our position which do not rely on mass-spring systems.

Ray-geometry collisions tend to cluster in a single area of the scene, so searching for and clustering around that area should avoid the problem of unstable partitions. The ideal case assigns a fair workload to every server, so minimizing relative workloads is sufficient to reach fairness.

Algorithm 2, or Swarming, incorporates these observations. The convex hull check again helps

41 Algorithm 2: Load balancing within the metric space, via Swarming. input : A list of peers, and their ray queue length. output: A new location for the current server.

1 Calculate a weighted average of server positions, with the amount of time taken to exhaust their path ray queue as the weight; 2 Calculate the average queue time of all peers, ourselves included; 3 Subtract our queue time from that, and divide the result by the average; 4 Use that value as the weighting for a linear interpolation between our position and the weighted average, which will become our proposed new location; 5 if The proposed location falls outside the Voronoi cell we occupied then 6 Form a line between the proposed new location and our old one, and find the point along that line furthest from our original position which is within the old Voronoi cell; 7 end

preserve locality by limiting a server to travelling within its external bounds. By avoiding the

mass-spring approach this balancing algorithm may afford more stability.

However, this algorithm allows a server with a workload above average to step “backwards,”

allowing for oscillations similar to those we were trying to avoid. We might fix this by clipping

the weight factor so it is always zero or positive; rather than have the current server back-track to

a superior position, other servers would move to offload their geometry to the current one. The

reliance on multiple servers may introduce enough inertia to prevent oscillations. This variant will

be known as No backtracking.

3.3.6 Initial Node Placement

These algorithms focus on how servers move through the metric space, but say nothing about how they should be initially placed. A good heuristic would assign them close to a stable partition, reducing the number of geometry transfers and if possible preventing the system from evolving towards an unstable partition.

The most obvious such scheme is to simply scatter servers randomly throughout the metric space. This requires no coordination between servers, and copes well with new servers joining mid-render, though it does not take any scene information into account and has short-term problems

42 Distance from Camera Axis

Camera BMW PokedStudio Benchmark Pavillion Fishy Cat Position along Camera Axis

Figure 3.8: The density of ray segment collisions, as a function of camera location. The vertical axis represents the distance along the central axis of the camera, the horizontal the distance from this axis. Darker regions indicate more collisions, and gamma correction has been applied to make slight interactions more visible. The source is raw data from five of the scenes in Figure 4.1. with clumping.

The next most obvious is a low discrepancy sequence. This does require coordination between peers, but for a deterministic method such as additive recurrence that can be satisfied by assigning a unique sequential ID for each server. Depending on the choice of sequence, this can carry all the benefits of random placement without the problem of clumping. Adding new servers can be tricky, depending on the choice of algorithm, though falling back to random placement is an option. In the simulations, a randomized van der Corput algorithm was used as it has a lower discrepancy than additive recurrences.

Neither of the above two leverage scene information, however. In practice, ray collisions and camera location are correlated; geometry within view of the camera is much more likely to collide

43 with path rays than geometry outside, and geometry closer to the camera is more likely to collide than distant geometry. Figure 3.8 visualizes this.

Suppose servers were evenly spaced along the camera axis, carving up responsibility as if the scene were filmed on a multi-plane camera. If ray collisions are concentrated in a bundle, this arrangement would have difficulty evolving towards an unstable partition. This is almost certainly unfair, however, as servers closer to or in front of the camera focal point would begin with disproportionate workload. This can also lead to an unequal distribution of geometry among servers; a forest scene could be constructed by uniformly distributing tree geometry along a plane, clipped to the view frustum, resulting in distant servers containing substantial geometry but rarely being hit by rays. This defeats a key advantage of spatial partitioning, by demanding that some servers carry substantial amounts of geometry. These issues may be solved by the algorithm’s evolution, but this dilutes the value of the heuristic.

Another approach is to distribute the servers perpendicular to the camera axis, in a radial fash- ion. Now all servers need to deal with both distant and far geometry initially, equalizing memory requirements. This is technically an unstable partition, though as the centroids occupy a plane it is more stable than the linear variant.

3.3.7 Node Movement Restrictions

In the lead-up to this thesis, an alternate partitioning scheme based on space-filling curves was tested. While this approach was ultimately rejected, as the partitions were not convex sets, it was superior at converging on a fair workload. One possible explanation is that Voronoi centroids have too much freedom to move in the above algorithms.

Another explanation is the level of collision information available. As the space-filling curve is one-dimensional, it was easy to divide partitions of that line into bins and store collision counts.

This was invaluable when partitions are adjusted, as the new bounds could be finely tuned to where collisions occurred. In contrast, all the above algorithms assign the total collision count to the Voronoi cell’s centroid, a zero-dimensional point.

44 This suggests an algorithm that restricts the movements of cell centroids in a three-dimensional space to a single dimension may behave better than an unrestricted one. We already have two algorithms which map centroids to one-dimensional spaces from the prior section, so if we add an evolutionary component we can promote them to full competitors to the algorithms.

Algorithm 3: Load balancing with bins in a one-dimensional space, given only local knowl- edge. input : The current server’s boundaries in the one-dimensional space. input : A histogram of collision locations within that space. input : The collision counts of this server and neighbouring servers. output: A new bound for the current server.

1 if This server’s collision count is above the average count then 2 for Each bin do 3 Find the bin with the greatest collision count; 4 end 5 Set the bounds of this server to match this bin; 6 if This bin’s collision count is less than the average count then 7 for Each bin, starting from the one with the greatest collision count and working left do 8 Extend the bounds until the sum of the bounded bins is just under half the average count; 9 end 10 for Each bin, starting from the one with the greatest collision count and working right do 11 Extend the bounds until the sum of the bounded bins is just under the average count; 12 end 13 Linearly extrapolate the bounds outward to approximate covering the average count; 14 end 15 Broadcast the change to all neighbouring servers; 16 end

Algorithm 3 details how this would be done using only local information. The global solution is even easier: starting from the leftmost bin and the first server, march until the average collision count would be crossed, then use linear interpolation to find the rightmost bound of the current server. Make this the leftmost bound of the next server, then repeat the algorithm until each server has been assigned bounds. The number of bins which produces optimal outcomes isn’t clear,

45 Figure 3.9: Three methods for spatially partitioning a unit cube into Voronoi cells. Clockwise from top: unrestricted or ”free”, planar camera-axis, and radial camera-axis. Cell centroids are marked by red spheres. whichever scope is chosen.

The algorithm which uses servers constrained to the central camera axis will be known as Pla- nar; see Figure 3.9 for an illustration of this. The radial variant creates some ambiguity, however, as its circular nature does not impose an absolute left-most or right-most bound. One solution is to map the circle to the interval [0:1] and define 0 and 1 as fixed bounds, though it may cause issues with server evolution. Another is to accommodate the circularity in the algorithms via modulus operations and deferred bound imposition, though this is tricky to implement. Both options will be tested; the type with fixed bounds will be Bounded Radial, while just Radial by itself will refer to the modulus type.

46 This creates two distinct types of algorithms within the system: the three “restricted” ones, just

introduced, which constrain how server positions can evolve, and “free” ones like Vivaldi that have no explicit constraints.

3.3.8 Damping

As mentioned, damping can prevent feedback loops in mass-spring systems. This concept can be generalized to all spatial partitioning algorithms, as a weighting applied to a linear interpolation between the previous and desired positions. It also introduces another parameter which needs to be tuned; too much damping will prevent oscillations but will also slow convergence.

3.3.9 Bundling Ray Data

Running these evolutionary algorithms after every single ray segment is impractical. Bundling rays together as a unit also permits batch sending of rays to other servers, which in turn allows for better data compression. Small bundle sizes permit more interactivity with the rendered image, which is useful for artists but may not provide a sufficient sample of the scene for the algorithm; large bundle sizes provide a much better sample, but increase ray travel latency and may decrease the rate of convergence.

47 Chapter 4

Simulation and Results

This chapter will detail how spatial partitions were tested. It will include information on the

methodology used, the results obtained, and a comparison to visual partitioning.

4.1 Simulation Methodology

In addition to the tuning issues mentioned in the previous chapter, spatial partitioning algorithms

can apply to a variable number of servers. Even if we pledge only to test the global variant of the

above algorithms, this still results in a seven-dimensional parameter space to explore.

1. The scene to process.

2. The algorithm used to update server positions, one of Vivaldi, No certainty, Swarm-

ing, No backtracking, Planar, Radial, and Bounded Radial.

3. The initial server placement for the algorithm, such as random or low-discrepancy

sequences.

4. The number of servers to use for rendering.

5. The damping value to use.

6. The number of rays to cast before evolving server positions.

7. The number of bins to collect collision counts in, if applicable.

As roaming this space with a path-tracer would take formidable computational resources, these algorithms will instead be simulated with real-world data. Each of the aforementioned algorithms will be tested, though in a global configuration where each server has total knowledge of the

48 state of all others. The Cycles rendering engine was modified to output path rays it followed during renders, and it was used to render seven separate scenes (see Figure 4.1 and Table A.1 for details)[Foua]. In the interest of saving time and disk space, for some scenes the sample count was reduced from the default. All of the scenes used an ordered image-based tiling to render, which meant the resulting rays were a biased sample of the scene. To correct for this, the path ray order was randomized before each sample run.

The simulation program would read in these rays, plus information about the camera position, and then execute one or more of the above algorithms with a randomly drawn parameter set. Algo- rithms with randomized initial positions or using the randomized van der Corput method were run several times, to assess their average performance. A single run of the simulation program gener- ated multiple simulation results: the initial point within the parameter space was chosen randomly, and subsequent points were generated via additive recurrence.

4.2 Behavioural Metrics

Assessing the performance of these algorithms is particularly difficult. All of them are evolution- ary, all of them are being applied to muddy real-world data, and some of them have stochastic initial states. Having a clearly defined performance metric is critical.

There are three candidates the author is aware of. The total distance travelled by all servers during operation is a worthy metric, as the more any given server moves the more geometry needs to be transferred; the total rays transmitted across the network, as this should be the primary net- work traffic in the system; and the workload of each server, in terms of the total number of rays processed, as these map directly to the system output. Both travel and network transmission are de- pendent on network technology, however, and new hardware advancements may turn an infeasible design into a feasible one. This makes work balance the best metric to consider.

49 Figure 4.1: The seven test scenes used in this paper. From top to bottom, left to right: the Blender 2.77 splash scene, by Pokedstudio; “Class room,’ by Christophe Seux; a toy helicopter by “vk- lidu;” a benchmark scene from “Cosmos Laundromat,” by the Blender Institute; Mike Pan’s BMW benchmark scene; “Barcelona Pavillion,” by eMirage; and the Blender 2.74 splash scene, by Manu Jarvinen. All are available from the Blender Demo Scenes web page.

50 4.2.1 Statistical Modelling

It is possible to simply run the simulations repeatedly, capture the work balance at the end of the run, and take the median of the results. All of the test scenes have different ray counts, however, thus by the end of their run they’ll have iterated over the algorithm differently. On a more practical level, using only the end point makes it difficult to incorporate results from partial runs.

This approach also fails to capture the variability of the underlying algorithm. Two algorithms may possess the same mean or median performance yet one may exhibit much more erratic be- haviour. Deploying the erratic algorithm is infeasible in practice, yet the instances where it out- performs the more steady algorithm may provide insight. Two algorithms may also take different times to converge on an optimal value. This is difficult to determine by looking at a single value, and yet like variance it also factors into the real-world uses of these algorithms.

Employing a statistical model also allows us to detect algorithms which do not fit the model.

This is an invaluable check on our expectations of how these algorithms will behave. Model

fitting is a standard technique in other fields [ELYM78] [Raf95] [LW04] [MC04] [MN89]; as early as 1805, the astronomer Adrien-Marie Legendre was fitting a model to the orbits of planets in the Solar System, for instance[Leg05]. The first sentence of the first page of “Bayesian Data

Analysis,”1 a common textbook for Bayesian statistical methods, is as follows[GCSR95].

Bayesian inference is the process of fitting a probability model to a

set of data and summarizing the result by a probability distribution

on the parameters of the model and on unobserved qualities such as

predictions for new observations.

Nonetheless, model fitting is uncommon when analysing algorithms in Computer Science. One possible explanation is that formal analysis is usually tractable, thus there is no need for an empiric approach. This also removes the desire for statistical techniques, which ensures they rarely get taught, so researchers are left without alternatives for those rare instances where formal analysis

1Third Edition, printed in 2014.

51 fails. The evolutionary nature of workload balancing algorithms presented in this thesis, combined

with the muddy real-world data and sheer size of the parameter space, conspire to make a model

fit the best analysis option.

Bayesian statistics were used, for several reasons. One is flexibility: frequentism is signifi-

cantly more efficient when the data follows the Gaussian distribution, but it can fail when more

complicated distributions characterize the data, like the exponential distribution. [JK76] Less care

needs to be taken with model choice with Bayesian statistics, consequently. A more significant

reason is the tolerance for noise and small data sets. The analysis requires examining thin slices

of the vast parameter space, so it is likely they will only contain a few samples. By treating the

parameters as varying and the data fixed, Bayesian statistics incorporates a wide range of plausible

parameters and captures the uncertainty of those values. Frequentist statistics fixes the parameters

and assumes instead that the data varies, so the results are easily skewed by noise and small data

sets.

Gellman et al. define three steps in performing a model fit: defining the model to be fit,

calculating a posterior distribution by conditioning on the data, and evaluating the quality of a

fit[GCSR95]. These occur in sections 4.2.2, 4.3.1, and 4.4.

4.2.2 Ray Collisions

From the original data set, the maximum and minimum number of collisions handled by any server

during a given iteration is extracted. If the algorithm and parameters in question are perfectly tuned,

we would expect

C¯ lim max(Ct)= lim min(Ct)= (4.1) t ∞ t ∞ N → →

where N is the total number of servers, t is the current iteration, Ct is the set of all collision counts at iteration t, and C¯ the average number of collisions per round. In practice, no algorithm will start out at that point, but instead may converge after a number of iterations. The fitness of

52 evolutionary algorithms tends to follow the exponential decay function,

eax,a < 0 (4.2)

where a controls the speed of the decay. The server with the both the most and least number of collisions in any given round should follow this curve.

In practice, the maximum and minimum collision counts will never follow this model precisely, due to the complexity of real-world data and ray randomization. We can model this via an error term ε, which is an offset drawn from a Gaussian distribution with a mean of 0 and a standard deviation of σ. To make comparisons between different ray bundle sizes easier, it is convenient to normalize the collision count to the range [0 : 1] by dropping the C¯ term.

N 1 1 modelmax(t)= − eat + + ε (4.3) 0 N N

1 modelmin(t)= (1 eat)+ ε (4.4) 0 N − This model is somewhat na¨ıve, as it assumes any algorithm will converge to the ideal value. In

C¯ reality, we should expect the maximum to converge to some offset above N and the minimum to converge to an offset below. It is difficult to know where those values will converge to and how quickly, however, but we can rely on Bayesian model fitting to marginalize away those details. The space between the values of convergence will span a distance s, though to make the computation slightly easier we will instead measure it by half that distance, h or the half-span. The centre of this

1 span will be offset from the ideal of N by some amount, o. While there may seem to be substantial overlap between the h and σ parameters, the former’s distribution is flat which makes it superior

for representing simulation runs where the algorithm oscillates between two extremes.

1 1 modelmax(t)=(1 ( + o + h))eat + + o + h + ε (4.5) 2 − N N

53 1 modelmin(t)=( + o h)(1 eat)+ ε (4.6) 2 N − −

Figure 4.2: An overview of Model 2, and its key parameters. See the text for details.

Figure 4.2 gives a visual representation of Model 2. A “Model 1” was designed as an interme- diate between Model 0 and Model 2, but it provided no additional insight. Only Model 2’s results are presented here, as it can perfectly duplicate Model 0.

A prior is necessary for a Bayesian fit, and the following was chosen for Model 2. It favours

54 values closer to zero for all input variables.

0,a > 0    0,h < 0    0,h > 1  2    0,o > 1    0,o < 0 prior (a,h,o,σ)=  (4.7) 2   0,o + h > 1   0,o h < 1   − −   0,σ 0  ≤   0,σ > 5    1 ,otherwise  ahσ o +0.0000001  | |  4.2.3 Network Transmissions

Network transmissions are more complicated, as unlike collisions we cannot normalize the range to [0:1]. Instead, our desired outcome is for the network traffic to equalize between all servers, as anything else could introduce latency or demand faster networking hardware. Put mathematically,

lim modelmax(t)= lim modelmin(t)= mb, (4.8) t ∞ 3 t ∞ 3 → → where b is a measure of network transmissions, acting as a sort of “waterline” for the given algorithm and parameter set, and m is a scaling term. As those transmissions depend on the number of ray segments cast, which differs from scene to scene, to make comparisons we must normalize by dividing by C = ∑Ct. Much like collisions, b must depend on the number of servers in the t scene, though unlike collisions we have no good theoretical description for how b will change

1 as the number of servers changes. A power relationship of approximately N− is a good guess, however, so we’ll assert m = N f and include the server falloff f as a parameter.

We do not have a similar guess for network usage evolution, however. If our heuristic assigns

55 one Voronoi cell to contain the majority of geometry responsible for ray collisions, the system starts off with zero network traffic; if instead two servers split that geometry, the total number of transmissions could start high then settle down to a lower equilibrium. Rather than dictate precisely how the system will change, we will promote the error term to the core of the model. The standard deviation of the error term, rather than the maximum or minimum, will follow an exponential decay from σi to σb.

at σ(t)=(σi σb)e + σb (4.9) − Some experimentation suggests that the relation between server count and normalized transi- tions is linear in log-log space, and that variance is proportional to the log of transitions. This can be compensated for by declaring σi and σb to represent the log of the standard deviation. Model 3 thus consists only of the error term, a Gaussian distribution with parameters

µ = log R model (t)= log(bN f )+ ε  C (4.10) 3   σ = σ(t)  where R = ∑rt is the sum of all transmissions that round. The prior used is similar to that of t Model 2,

0,b < 0    0, f > 10   | |   0,a > 0   prior3(b, f ,a,σi,σb)=  0,σi 0 (4.11)  ≤ 0,σb 0  ≤   , <  0 σi σb   1  ,otherwise  a f σiσb+0.0000001  − | | which is flat for b but favours values near zero for all other variables. Figure 4.3 illustrates how this model behaves.

56 waterline, b=0.081101

16/84 interval, s=-32.309465, ei=54.041489, eb=0.516502

3

0.8

0.3 density

0.03

0

Figure 4.3: An overview of Model 3, and its key parameters. See the text for details.

4.2.4 Node Movement

As mentioned prior, assessing server placement is critical for understanding network traffic. The

more a server moves, the more geometry must be exchanged and the lower the efficiency of the

algorithm. For an ideal evolutionary algorithm, we’d expect

N lim ∑ dn,t dn,t 1 = 0 (4.12) t ∞ − → n=0 | − |

where dn,t is the location of server n at time t. As before, it will probably converge to a value greater than 0, it should follow an exponential decay function, and there should also be a Gaussian error term. However, the error term should be a one-tailed; this offset above 0 will represent the minimal amount of motion, so the odds of one iteration having less than the minimum is zero. The model used to capture all of this, Model 4, will be

µ = 0 model (t)= bN f + ε  (4.13) 4  | | σ = σ(t)  with Model 3’s prior, as both have identical parameters.

57 4.2.5 Memory Cost

A key motivator for spatial partitioning is the distribution of scene geometry across multiple com- puters, so no one of them is required to contain the entire scene. This assumption should not be taken for granted, as the invention of the camera axis algorithms came from the observation that a small amount of geometry usually accounted for a disproportionate amount of workload when path tracing. Another model is worth developing.

It carries many of the same assumptions as Model 2. Most likely, any given server will start off with a small fraction of the total geometry; that fraction will grow as more path rays are cast; but it will decay towards an optimal amount of geometry; and that there will likely be different amounts of geometry stored on each server. This suggests another exponential function with an adjustable slope, decaying upwards towards a ceiling value. Much like Model 2, it makes sense to track both the maximum and minimum memory usage. We know no single server cannot store more geometry than is in the entire scene, nor can it store less than zero geometry, nor can the maximum be less than the minimum, which places limits on the height of either waterline. It’s also likely that the amount of geometry will be effected by the number of servers within the system, though we again have no idea how. Thus the math for Model 5 will be

max f at model (t)= bmaxN (1 e )+ ε (4.14) 5 − min f at model (t)= bminN (1 e )+ ε (4.15) 5 −

where bmax and bmin represent the ceiling values of the maximum and minimum, respectively,

58 and N, f , a, and ε are unchanged from prior models. The prior for this model will be

0,b < 0  max   0,bmax > 1    0,b < 0  min    0,bmin > 1   prior5(bmax,bmin, f ,a,σ)=  0,bmax < bmin (4.16)  0, f > 10  | |   , >  0 a 0    0,σ 0  ≤   1 ,otherwise  a f bmaxbminσ+0.0000001  − | |  As each scene consists of a different amount of geometry, we will standardize by dividing the per-server geometry count by the total geometry in the scene, hence why the waterline values range between zero and one.

4.3 Key Metrics

Our primary concern is with the efficiency of the system. Suppose we split the workload across N servers, each with the same performance. The overall render time of the system is determined by the amount of collision calculations done by the server with the heaviest workload over the entire rendering process, as the render is not finished until every server has finished. Model 2 predicts this as

1 w = CN¯ ( + o + h) (4.17) max N So the overall efficiency of the system, if there is negligible network latency and geometry

59 transfers, is approximately

w¯ 1 1 E = N = (4.18) w 1 N o h 1 ≈ max N + o + h ( + )+ We need not rely on approximations, however. For each simulated round of ray processing we know the maximally-loaded server, as well as the total number of collisions processed and number of active servers. Thus we can calculate efficiency directly.

∑∑cir r E = i (4.19) N ∑max(Cr) r

where cir is the collision count for server i during round r, and Cr is the set of collision counts for round r. This metric is also useful for network transmissions, as it evaluates how evenly net- work transmissions are distributed. Servers that send or receive large amounts of network data may become bottlenecked by network hardware, and less trafficked servers can almost never loan bandwidth to compensate.

4.3.1 Generating the Posterior Distribution

Other measures provide additional insight into the fit. The MCMC integrator provides the log- likelihood of each parameter, so we’ll use the maximal likelihood from the posterior as a rough assessment of how well the models fit the data. This is on top of the standard model parameters, which provide insight into the typical behaviour of the system.

The second step in performing a model fit is to calculate the posterior distribution by condition- ing on the data. Emcee was used for this [FMHLG13]. 100 walkers were allowed 1,200 steps for burn in, then an additional 180 steps were taken to form the posterior. It thus consisted of 18,000 samples, and error bars correspond to the 16th and 84th percentiles of those samples. The total data set consists of 463,534 simulation executions; as a full analysis takes days of computation, on most occasions a random subsample was drawn from this full data set instead. This was filtered such that no more than 1,024 simulations would be included for any specific combination of initial

60 placement and evolutionary algorithm. To ensure these were representative of the full data set, multiple subsamples were extracted and re-analysed.

Other parameters, such as the scaling factors and standard deviation of errors, may prove useful.

It is dangerous to rely too heavily on numbers to assess the behaviour of a complex system, and so some qualitative metrics will be used as well. These consist of ad-hoc charts, such as posterior illustrations or corner diagrams, which will be introduced as needed.

61 Node Initial Placement, Overall (16/84 confidence intervals)

100 Free

50 efficiency (higher is better) is (higher efficiency

0 random planar radial van der

algorithm

Figure 4.4: The performance of each of the unrestricted algorithms, as measured by efficiency (see text), across all four initial node placement algorithms. The middle tic is the median. This chart is based on a subsample of the full data set.

4.4 Results

The third and most critical part of the model fitting process is the evaluation of the fit[GCSR95].

If the data is not well-described by the model, any patterns that emerge may be false positives due to the fitting process or the model.

Before we can compare the seven algorithms, we need to sort out the best initial node placement for the four unrestricted ones. Figure 4.4 charts the relative efficiency of those across the entire data set. The radial camera axis method does remarkably better than all others, and manages to be superior in two of three other variables from Model 2 (Figure A.2). The exception is the offset, where van der Corput sequences manage roughly the same performance. This bodes well for

Radial and Bounded Radial, though it would be wise to double-check other initial placements.

The wide error bars indicate there are a lot of confounding factors that are introducing noise,

62 Table 4.1: Fitness to Model 2, collisions, by algorithm. Based on a subsample, some values are rounded for presentation. Category Algorithm Log Maximal Likelihood Samples LML / Samples Camera Axis Bounded Radial 226846 69362 3.270 Camera Axis Radial 240556 78220 3.075 Free No Backtracking 111930 48340 2.315 Free No Certainty 2837 1678 1.691 Free Swarming 49543 45939 1.078 Free Vivaldi 835 832 1.004 Camera Axis Planar 192857 212380 0.908 which is to be expected: only two of the seven variables are fixed, so that chart marginalizes over every input scene, every number of nodes helping render, every damping value, and so on. A low signal-to-noise ratio should be the consequence.

4.4.1 Collisions

Table 4.1 lists how well Model 2 fit the ray collision output of the seven algorithms, with the “free” algorithms using the radial camera axis initial placement. Since the number of samples is unequal between algorithms yet the log likelihood is proportional to the sample count, we divide the former by the latter. Radial and Bounded Radial fit Model 2 better than all other algorithms, and of the free algorithms No Backtracking does best. Planar fails badly, and quite surprisingly the original

“free” algorithms do worse than their modified versions.

The low sample counts of the Vivaldi algorithms are due to technical issues. One way to cal- culate the peers of any given node is to perform a Delaunay triangulation; as this is the dual of the

Voronoi cells associated with the node positions, all edges correspond to peer relations. Most pro- gramming libraries which perform Delaunay triangularizations make assumptions about the data which give the same results as exact methods but in less time, for instance that no subset of points are colinear. Unfortunately, Vivaldi’s tendency towards unstable partitions means that colinear points occur regularly, and these libraries tend to respond via corrupt memory accesses. This is catastrophic in a multi-threaded environment, as all other algorithms must halt. To preserve their

63 results, both Vivaldi variations must be run single-threaded; as a consequence, all other algorithms

enjoy a performance advantage that is only magnified in a high-performance computing environ-

ment. The differing low sample counts show up in the graphs as jagged lines and unusually narrow

credible intervals.

Figure 4.5 charts the posterior distributions of each model. It’s immediately apparent that Pla-

nar is inferior to all others, due to slow convergence and very large error σ. Both Vivaldi and No

Certainty show very rapid convergence combined with little uncertainty and a small σ. Swarm-

ing and No Backtracking appear next best, with significantly slower convergence but a moderate amount of uncertainty.

Radial and Bounded Radial would appear to be surprisingly weak, as their convergence is no faster than Swarming but suffer from higher uncertainty. The secret to their strong performance is their offset values; indeed, while all the “free” algorithms have noticeable positive offsets from the perfect partition line, Radial and Bounded Radial have small negative offsets. This gives them an edge over the long-term, and implies they scale extremely well as more nodes are added.2

Table A.2 gives the raw numbers for Model 2. Figure A.3 provides a corner diagram of the same posterior. All of the posteriors follow Gaussian or exponential distributions and do not show strong correlations between one another, even Planar’s, suggesting the model is doing a good job of describing the underlying data. Three of the “free” algorithms do have some significant outliers in the slope parameter, hence why they don’t appear Gaussian.

The efficiency metric takes precedence over the model, and as Figure 4.6 demonstrates it has some surprises. Vivaldi does the worst of the “free” algorithms, while No Backtracking pulls off the

best performance. Planar does better than its model suggests, equal to No Backtracking no less.

While the camera axis algorithms again have the best median performance, the outliers suggest

there are instances where the “free” algorithms outperform them.

Figure 4.7 reveals why performance varies so much: the number of nodes in the system plays

2On its face, a negative offset may seem to imply the algorithm is more efficient than a perfect workload partition, which is impossible; the positive half-span argues against this, and instead suggests the early termination of a few simulations were sufficiently strong outliers to pull down a minuscule offset.

64 Figure 4.5: The posterior distributions of Model 2, for each algorithm, drawn from the data set. Magenta lines correspond to maxima, green lines correspond to minima, and grey areas represent the area between the 16th and 84th percentiles of both extremes. The number of nodes is fixed to three, and the corresponding ideal line of convergence is drawn in black. All charts use the same scale. See the text for analysis. 65 Algorithm Performance, Overall (16/84 confidence intervals)

100

50 efficiency (higher is better) is (higher efficiency

Free Restricted 0 Vivaldi No Cert. Swarm No Back. Planar Radial B. Radial

algorithm

Figure 4.6: The performance of each algorithm, as measured by efficiency (see text), with the unrestricted algorithms using the radial camera axis initial placement. Based on a subsample of the original data set. a critical role. Both Vivaldi and the No Backtracking show a typical pattern for distributed algo- rithms, where their performance drops as the number of nodes increases. Surprisingly, Radial and

Bounded Radial do not exhibit the same pattern; when there are five or more servers in the system, efficiency stabilizes at approximately 70%. This may be due to the deep slices these algorithms make through the scene, which encourage cast rays to be handled internally for two of the three spatial dimensions. Despite this remarkable performance curve, No Backtracking manages to out- perform the radial algorithms with five or fewer nodes in the system. This explains why its credible interval extends above that of the swarming algorithms. Figure A.4 double-checks if varying the initial node placement could improve the performance of Swarming, and demonstrates Radial is still superior.

Intuitively, damping should be a major factor in algorithm performance. In practice, as Fig- ure 4.8 shows, damping has mixed effects. No Backtracking does quite a bit better as the damping

66 Figure 4.7: The performance of the two radial algorithms, Vivaldi, and No-backtracking, as mea- sured by ray collision efficiency (see text), when the number of nodes in the system is fixed. The radial algorithm data is from a subsample. is increased, Vivaldi is too noisy to come to any firm conclusion, but Radial and Bounded Radial show a small but noticeable decline in performance. The latter implies collision binning is very effective at guiding the algorithms that use it. The former may be more evidence that the radial camera axis division is a good heuristic, as higher damping values keep nodes closer to the original division. Substituting that for other initial placements with Swarming confirm that it outperforms all other methods. Fixing the node count does not significantly change the behaviour of damping, either (Figure A.5), though the flatter lines imply that the Vivaldi data set may contain a number of outliers.

The choice of size for ray bundles and the number of collision bins should have less of an effect on the system than damping. Figures 4.9 and 4.10 suggest that may not be the case. Radial and

Bounded Radial show signs of excessive sampling, and actually perform best with a mere 131,072 ray segments cast between position updates; this is likely due to the effectiveness of collision

67 Effect of Damping (16/84 confidence intervals)

100

50

B. Radial efficiency (higher is better) is (higher efficiency Radial No Back. Vivaldi Best Fit (No Back). m = 13.8115, b = 43.7592 0 0.01 0.1 1

damping amount

Figure 4.8: The performance of the two radial algorithms, Vivaldi, and No-backtracking, as mea- sured by efficiency (see text), when the damping amount is fixed. The radial algorithm data is from a subsample. binning, which adapts itself to the size of partitions and thus gives better data if updated quickly.

The unrestricted algorithms generally show the opposite pattern, preferring bin sizes of 4,194,304 ray segments, though this preference is slight for No Backtracking. Even the radial algorithms

show complex behaviour, with performance improving again at the 4,194,304 bin mark.

As for the radial camera axis algorithms, neither Radial nor Bounded Radial show any effect

from varying the bin size. This supports the hypothesis that the entire parameter range offers more

than enough bins for both algorithms.

4.4.2 Test Scenes

Figure 4.11 examines how each algorithm performs for each test scene (see Figure 4.1). Radial and

Bounded Radial are superior for every scene, when comparing medians, and the latter consistently

seems better. There also is no obvious pattern for when they fail; both “bmw27” and “scene-

68 Effect of Ray Bin Sizes (16/84 confidence intervals)

100

50

B. Radial efficiency (higher is better) is (higher efficiency Radial No Back. Vivaldi Best Fit (Vivaldi). m = 2.4363, b = -16.7279 0 217 220 222

number of rays per bin

Figure 4.9: The performance of the two radial algorithms, Vivaldi, and No-backtracking, as mea- sured by efficiency (see text), when the size of the ray bin is fixed. The radial algorithm data is from a subsample. helicopter” have roughly the same geometry layout, yet are respectively the second-best and worst scenes for the radial algorithms. Those algorithms do not behave significantly better or worse when geometry is clustered together (“scene-helicopter”) or spread over great distances (“benchmark”).

Vivaldi serves as a wonderful illustration of the Yule-Simpson effect3, as it has a better median

than the other “free” algorithms for every scene but “scene-helicopter.” It doesn’t show a clear

pattern of behaviour from scene to scene. The other three “free” algorithms, in contrast, perform

best with clustered geometry and worst with spread-out geometry. The certainty term in Vivaldi

may be the reason for its unorthodox behaviour.

Planar does quite poorly on “fishy-cat.” The probable reason is a flat translucent plane between

the camera and the scene geometry, parallel to the camera plane; this leads to an infinitely-thin but

3Breaking a statistic into categories can have unintuitive effects, to the point of reversing trends observed in aggre- gate data. This is also known as Simpson’s paradox[Sim51].

69 Effect of Varying Collision Bin Counts (16/84 confidence intervals)

100

50 efficiency (higher is better) is (higher efficiency

B. Radial Radial Best Fit (B. Radial). m = -0.0109, b = 80.7480 0 28 210 212

number of bins per span

Figure 4.10: The performance of the two radial algorithms, as measured by efficiency (see text), when the number of bins to track collisions is fixed. The radial algorithm data is from a subsample.

substantial cluster of ray collision points, which is difficult for this algorithm to balance. Other-

wise, its median is consistently superior to the “free” algorithms.

4.4.3 Network Transmission

Moving on to network transmissions, we find some surprises with the quality of fit to Model 3

listed in Table 4.2. Vivaldi comes out on top, while both Swarming and No Backtracking do quite

poorly. Planar manages to beat both algorithms, as well.

The raw numbers of Table A.3 and the charted posteriors of Figure A.6 shed some light on

this. Vivaldi’s waterline does have better than expected performance, but this can either be due to the algorithm being excellent at minimizing ray transfers, or because it tends to make large partitions which transfer few rays but create an unbalanced workload. Given its poor results from ray collisions, high σb and falloff, and low efficiency scores, the latter is much more likely. The

70 Figure 4.11: The performance of all algorithms, as measured by efficiency (see text), for each of the test scenes. The camera axis algorithm data is from a subsample. waterline parameter is a poorer judge of performance than expected.

4.4.4 Node Position

Figure 4.12 charts the position changes for the two radial algorithms, over a subset of the total data. The horizontal “scratches” indicate Radial has difficulties with node drift, while with a few exceptions Bounded Radial shows remarkably little movement. The overlaid black lines show that while the nodes locations do not stray far from their original positions, they do indeed move; the two nodes above the horizon tend to move downward while the bottom-most swerves to either side to compensate for visual asymmetry. The “hole” in the upper-left of Radial’s graph may be an arti- fact of the global update algorithm; it begins from the 12 o’clock position and proceeds clockwise, so the position of the last node in the upper-left is determined more by those of surrounding nodes than its own collision bins. An error in the simulation code is also possible though less likely.

71 Table 4.2: Fitness to Model 3, network transmissions, by algorithm. The camera axis algorithm data is from a subsample, and the “free” algorithms are initialized with radial camera axis. Values are rounded for presentation. Category Algorithm Log Maximal Likelihood Samples LML / Samples Free Vivaldi 746 1129 0.6622 Free No Certainty -19.0 2518 -0.0075 Camera Axis Radial -6603 78806 -0.0838 Camera Axis Bounded Radial -8177 84887 -0.0963 Camera Axis Planar -185547 237816 -0.7802 Free Swarming -1322 644 -2.053 Free No Backtracking -1934 926 -2.089

Camera Axis Camera Axis Radial Bounded Radial

position

time

Figure 4.12: The evolution of node positions for 300 runs each of the two radial camera axis algorithms, with the number of nodes fixed at five. Time is represented as a linear scaling factor away from the origin.

Figure 4.13 repeats this for the unrestricted algorithms. Vivaldi does surprisingly well, with the straight initial hops demonstrating rapid convergence to a consistent area. However, there is motion within that area, and the closeness of each node will magnify any shift into a significant boundary change. No Certainty is similar but shows even more node movement. Swarming is the worst of the unrestricted algorithms, with the fuzzy areas demonstrating oscillations that would cause a significant amount of geometry thrash. No Backtracking is much improved, but both algorithms demonstrate an odd leap from their starting positions. While not visible in the charts, this leap quickly converges to a specific spot and stays motionless. Interestingly, both algorithms have more

72 space between nodes than either Vivaldi or No Certainty, thus their bounds are less sensitive to small perturbations.

73 Unrestricted Unrestricted

Vivaldi Vivaldi, without Certainty position Unrestricted Unrestricted

Swarming Swarming, no Backtracking position

Figure 4.13: The evolution of node positions for many runs each of the four unrestricted algo- rithms, with the radial initial conditions and number of nodes fixed at five. The view is ortho- graphic, with the viewing plane perpendicular to the camera axis, and each uses the same scale. The scene used to generate the data was “bmw27,” and the number of samples used varied; both Swarming and No Backtracking used 300 simulation runs randomly drawn from the full data set, while Vivaldi and No Certainty only had 34 and 58 suitable runs in total, respectively.

74 4.5 Comparisons to Other Techniques

On the face of it, comparing spatial partitioning to either temporal or visual partitioning is an

apples-to-oranges comparison. Each approach has a different philosophy towards hardware limi-

tations, and chooses different trade-offs as a result. If the scene being rendered will fit in a server’s

memory, spatial partitioning has no advantage over temporal partitioning, yet carries substantially

more overhead. Conversely, if the scene is significantly larger than server memory then spatial

partitioning may be the only feasible approach, even if that technique is less efficient than other

partitioning approaches.

Nonetheless, there is value in verifying the trade-offs are as advertised. Other algorithms can

be included in the simulation.

4.5.1 Visual Partitioning

As mentioned prior, visual partitioning involves dividing the responsibility by assigning spe-

cific pixels to specific servers. This implementation will be based on the work of DeMarle et

al., [DGP04] with some modifications.

The visual area will be divided up into rectangular tiles, and these will be assigned to different nodes. While the options are not nearly as vast as in spatial partitioning, we still run into the problem of how to initially divide these tiles. The easiest option is to randomly scatter them across nodes, though this creates problems. There is a loose correlation between the visual and spatial placement of geometry, due to physical proximity and a high chance of similarly-oriented surface normals. A better approach assigns tiles in contiguous bands, and the earlier point that horizons tend to be horizontal suggests vertical bands would lead to a fairer distribution of geometry than the alternatives.

DeMarle et al. used a work-stealing method to balance workloads between servers. A similar approach will be used here, where a greedy algorithm examines each tile according to the number of ray collisions underneath it and transfers them from one server to another, beginning from the

75 Algorithm Performance, Overall (16/84 confidence intervals)

100

50 efficiency (higher is better) is (higher efficiency

Banded Random 0 Work sharing Coherent Sharing No Sharing

algorithm

Figure 4.14: A comparison of the six algorithm variants for visual partitioning. See the text for details. one with the least collisions. While superior methods can be used to balance collisions, in practice there’s enough variation in tile collision counts to ensure good performance. The above observation about visual continuity still applies, so it is wise to retain that during rebalancing.

The remaining parameters are identical to the spatial partitioning algorithms. For that reason, the same methodology can be deployed. As there are a total of six variations of the underlying algorithm (random or vertical band initial placement, plus either no work sharing, work sharing via greedy algorithm, or visually coherent work sharing), each will be examined to determine the best.

Figure 4.14 presents the efficiency of each of the six variants. Contrary to the findings of

DeMarle et al., fixing the tile assignments results in the greatest performance. One plausible explanation is that there is enough variation in the rays cast to create significant variance in collision counts, so the work-sharing algorithm constantly switches low-count blocks and reduces the overall

76 Algorithm Initialization Maximum Minimum Falloff Error +0.4274 +0.2568 +0.9323 +7.6044 Work-sharing Vertical Bands 0.4246, 0.0272, 1.924 10 6, 0.2058, 0.3321 0.0267 · − 1.0094 0.2037 +−0.4390 +−0.1052 −+0.6089 +−3.4106 Coherent Sharing Vertical Bands 0.2645, 0.0132, 6.6962 10 4, 0.3375, 0.2250 0.0128 · − 0.4521 0.3333 +−0.4174 +−0.2400 −+0.4186 +−4.7742 Fixed Vertical Bands 0.4252, 0.0218, 3.1560 10 3, 0.2011, 0.3746 0.0213 − · − 1.8262 0.1978 +−0.4720 − +0.1310 +−1.0851 +−2.8766 Work-sharing Random 0.2926, 5.7698 10 3, 9.8987 10 4, 0.1886, 0.2470 · − 0.0056 − · − 0.8634 0.1850 +−0.4669 +−0.1635 +−0.9355 +−4.7006 Coherent Sharing Random 0.3780, 9.6621 10 3, 2.5809 10 6, 0.1596, 0.3401 · − 0.0094 · − 0.6500 0.1580 +−0.4614 +0−.2564 +−1.1287 +−4.2784 Fixed Random 0.3732, 0.01656, 1.4192 10 5, 0.6546, 0.2737 0.0165 · − 0.7401 0.6465 − − − − Table 4.3: The median, 16th and 84th percentiles from Model 5’s posterior, by algorithm.

efficiency. Another possibility is that DeMarle et al.’s result is a statistical fluke; this thesis tests

more scenes, of a less synthetic nature, and employs a more rigorous methodology to separate out

the effects of other parameters. A third is programmer error in the simulation, made more likely

by the lack of a tangible rendered image.

Table 4.3 applies Model 5 to the amount of geometry contained in the system; these numbers

are more in line with DeMarle et al. Work-sharing with visual coherency, combined with an initial placement of tiles in vertical bands, has the best maximum. Unsurprisingly, the same algorithm performs poorly when the tiles are randomly placed on start; the use of bands maximizes the number of tiles with a valid neighbour to join. The falloff variable appears to have little effect, with every algorithm clustering around the zero mark but with wide confidence intervals. A wide error term on the maximum and minimum suggest a number of confounding variables, however.

77 Chapter 5

Future Work and Conclusion

This chapter will summarize the prior results, consider other directions for exploration, and provide

recommendations for implementation.

5.1 Summary

Of the seven algorithms checked, there was a clear preference for incorporating camera data. All

of the “free” algorithms did best when they had the same initial placement as Radial and Bounded

Radial, and those two algorithms do better than all others in general. They perform best without a damping parameter, and the choice of bin sizes has little effect on their performance. Of the two,

Bounded Radial had the least server movement, which should minimize geometry transfers. Of the “free” algorithms, No Backtracking performed best, managing to out-perform Bounded Radial when the system consisted of five or fewer servers. The “free” algorithms showed a preference for large damping values and bin sizes.

It is also worth mentioning the statistical process used to analyze these results behaved well.

The combination of Bayesian statistics and behavioural models made it possible to qualitatively as- sess algorithm convergence while also verifying model fitness. The posterior illustration in Figure

4.5, in particular, conveys a lot of information about aggregate behaviour in a single chart. Models also made it possible to incorporate results from sample runs that ended prematurely.

5.2 Local Algorithms

All of the algorithms simulated were global, in that they had full information about that state of the system and acted in a synchronous manner. Each of them could be converted into local algorithms,

78 which do not posses full information nor synchronize, and the latter in particular may cause issues when one server moves within the phase space and invalidates the bounds recorded on others.

There is good reason to believe local algorithms are well-approximated by global variants, however. Even if we have 128 nodes partitioning the geometry via an unrestricted algorithm, a complete description of their location and workload requires four float values each or 2048 bytes.

This occupies slightly more than 3% of a typical TCP/IP v4 packet. As servers are constantly swapping ray data back and forth, embedding this information in every packet sharing ray data costs minimal bandwidth yet traverses across the entire network rapidly. Alternative schemes that only share subsets of this information as it changes reduce bandwidth further. This effectively makes global information available to each node.

5.3 Additional Algorithms

One way to minimize geometry transfer is to alter the damping term to have an exponential decay towards increasing inertia. A few rounds of sampling should achieve a near-optimal partition of the scene, so long as the rays cast are properly randomized or scattered, thus additional imbalances are most likely caused by sampling error rather than genuine inequity. There are a number of ways to accomplish this, such as tying the decay to the number of rays cast, the amount of movement in nearby nodes, and even the number of nodes working on the scene.

5.4 Voxelization

One reason why Radial and Bounded Radial perform so well is their use of collision binning.

While this is much easier to do in a one-dimensional space, it is still possible in a three-dimensional one. Consider an algorithm that modifies Algorithm 2 to incorporate collision data captured from the voxelized Voronoi cell1. This has the same basic behaviour, but the additional data allows a more precise retreat from the area causing excessive work. As voxels are not used directly, the

1See Algorithm 4 for a potential high-level description.

79 convex-everywhere property is preserved.

Other variants are possible, such as an algorithm which “grows” outward from one of the least occupied voxels to form a new Voronoi cell. Voxels need not be used, either; if the implemen- tation breaks up the geometry into a Bounded Volume Hierarchy with network-friendly leaves, collisions can instead be tallied according to the leaf which contained the geometry. The growing process would leave behind specific geometry to be transferred to neighbours, without the need for a separate pass to handle it. As a bonus, leaves with an excess of collisions could be subdivided to allow for a fairer distribution of geometry across nodes, helping tremendously with scenes like

“fishy-cat” that spread low-detail geometry over a significant proportion of screen space.

5.5 Conclusions

On the whole, spatial partitioning has significant promise for rendering large scenes. Two algo- rithms stood out, one which used the camera’s location as a heuristic, and a simple swarming approach. The former demonstrated remarkable scalability when simulated, easily scaling to 127 servers with little loss of efficiency. Many of the tuning parameters had little effect on the algo- rithm’s performance, and the camera axis algorithms did not show obvious weaknesses relating to the scene layout.

80 Bibliography

[AB15] Ruby Annette and Aisha Banu. A Service Broker Model for Cloud based Render

Farm Selection. arXiv:1505.06542 [cs], May 2015. arXiv: 1505.06542.

[AGSS89] Alok Aggarwal, Leonidas J Guibas, James Saxe, and Peter W Shor. A linear-time

algorithm for computing the voronoi diagram of a convex polygon. Discrete &

Computational Geometry, 4(6):591–604, 1989.

[Arv86] James Arvo. Backward ray tracing. In Developments in Ray Tracing, Computer

Graphics, Proc. of ACM SIGGRAPH 86 Course Notes, pages 259–263, 1986.

[Aur91] Franz Aurenhammer. Voronoi Diagramsa Survey of a Fundamental Geometric Data

Structure. ACM Comput. Surv., 23(3):345–405, September 1991.

[Bec67] Petr Beckmann. Scattering of Light by Rough Surfaces. In E. Wolf, editor, Progress

in Optics, volume 6, pages 53–69. Elsevier, January 1967.

[Ben75] Jon Louis Bentley. Multidimensional binary search trees used for associative search-

ing. Communications of the ACM, 18(9):509–517, 1975.

[BK70] J. Bouknight and K. Kelley. An Algorithm for Producing Half-tone Computer

Graphics Presentations with Shadows and Movable Light Sources. In Proceedings

of the May 5-7, 1970, Spring Joint Computer Conference, AFIPS ’70 (Spring), pages

1–10, New York, NY, USA, 1970. ACM.

[BN76] James F. Blinn and Martin E. Newell. Texture and Reflection in Computer Generated

Images. Commun. ACM, 19(10):542–547, October 1976.

[Cam41] Leon Campbell. Annie Jump Cannon. Popular Astronomy, 49(7):345, August 1941.

81 [Cat74] Edwin Catmull. A Subdivision Algorithm for Computer Display of Curved Sur-

faces. Technical Report UTEC-CSC-74-133, UTAH UNIV SALT LAKE CITY

SCHOOL OF COMPUTING, UTAH UNIV SALT LAKE CITY SCHOOL OF

COMPUTING, December 1974.

[CBM+08] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and

Kevin Skadron. A performance study of general-purpose applications on graph-

ics processors using CUDA. Journal of Parallel and Distributed Computing,

68(10):1370–1380, October 2008.

[CCC87] Robert L. Cook, , and Edwin Catmull. The Reyes Image Rendering

Architecture. In Proceedings of the 14th Annual Conference on Computer Graphics

and Interactive Techniques, SIGGRAPH ’87, pages 95–102, New York, NY, USA,

1987. ACM.

[CCWG88] Michael F. Cohen, Shenchang Eric Chen, John R. Wallace, and Donald P. Green-

berg. A Progressive Refinement Approach to Fast Radiosity Image Generation. In

Proceedings of the 15th Annual Conference on Computer Graphics and Interactive

Techniques, SIGGRAPH ’88, pages 75–84, New York, NY, USA, 1988. ACM.

[CDE+14] Zina H. Cigolle, Sam Donow, Daniel Evangelakos, Michael Mara, Morgan McGuire,

and Quirin Meyer. A survey of efficient representations for independent unit vectors.

Journal of Computer Graphics Techniques, 3(2), 2014.

[CNS+11] Cyril Crassin, Fabrice Neyret, Miguel Sainz, Simon Green, and Elmar Eisemann.

Interactive indirect illumination using voxel cone tracing. In Computer Graphics

Forum, volume 30, pages 1921–1930. Wiley Online Library, 2011. 00000.

[Cou11] National Research Council. The Future of Computing Performance: Game Over or

Next Level? The National Academies Press, Washington, DC, 2011.

82 [CR07] Ozan Cakmakci and Jannick Rolland. Design and fabrication of a dual-element off-

axis near-eye optical magnifier. Optics letters, 32(11):1363–1365, 2007.

[CT81] Robert L. Cook and Kenneth E. Torrance. A Reflectance Model for Computer

Graphics. In Proceedings of the 8th Annual Conference on Computer Graphics

and Interactive Techniques, SIGGRAPH ’81, pages 307–316, New York, NY, USA,

1981. ACM.

[DCKM04] Frank Dabek, Russ Cox, Frans Kaashoek, and Robert Morris. Vivaldi: A decen-

tralized network coordinate system. In ACM SIGCOMM Computer Communication

Review, volume 34, pages 15–26. ACM, 2004.

[DGP04] David E. DeMarle, Christiaan P. Gribble, and Steven G. Parker. Memory-Savvy

Distributed Interactive Ray Tracing. In EGPGV, pages 93–100. Citeseer, 2004.

[ECMM16] Sharif Elcott, Kay Chang, Masayoshi Miyamoto, and Napaporn Metaaphanon. Ren-

dering Techniques of Final Fantasy XV. In ACM SIGGRAPH 2016 Talks, SIG-

GRAPH ’16, pages 48:1–48:2, New York, NY, USA, 2016. ACM.

[ELYM78] Lindon J Eaves, Krystyna A Last, Phillip A Young, and Nick G Martin. Model-

fitting approaches to the analysis of human behaviour. Heredity, 41(3):249, 1978.

[ENSB13] Christian Eisenacher, Gregory Nichols, Andrew Selle, and Brent Burley. Sorted

deferred shading for production path tracing. In Computer Graphics Forum, vol-

ume 32, pages 125–132. Wiley Online Library, 2013.

[F13] Ian F. 3 Trillion Polygons Used To Make The Torus For Elysium, August 2013.

[FB13] Wei Fan and Albert Bifet. Mining big data: current status, and forecast to the future.

ACM sIGKDD Explorations Newsletter, 14(2):1–5, 2013.

83 [FMHLG13] Daniel Foreman-Mackey, David W. Hogg, Dustin Lang, and Jonathan Goodman.

emcee: The MCMC Hammer. Publications of the Astronomical Society of the Pa-

cific, 125(925):306–312, March 2013. arXiv: 1202.3665.

[FMHR87] Scott S Fisher, Micheal McGreevy, James Humphries, and Warren Robinett. Virtual

environment display system. In Proceedings of the 1986 workshop on Interactive

3D graphics, pages 77–87. ACM, 1987.

[Foua] Blender Foundation. Blender Foundation.

[Foub] Blender Foundation. Demo Files.

[FPE+89] Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather, David

Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura Israel. Pixel-planes

5: A Heterogeneous Multiprocessor Graphics System Using Processor-enhanced

Memories. In Proceedings of the 16th Annual Conference on Computer Graphics

and Interactive Techniques, SIGGRAPH ’89, pages 79–88, New York, NY, USA,

1989. ACM.

[GCSR95] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data

analysis. Chapman and Hall/CRC, 1995.

[Gee05] D. Geer. Chip makers turn to multicore processors. Computer, 38(5):11–13, May

2005.

[GN94] J. Gray and C. Nyberg. Desktop batch processing. pages 206–211. IEEE, 1994.

[Gri13] David Alan Grier. When computers were human. Princeton University Press, 2013.

[GTGB84] Cindy M. Goral, Kenneth E. Torrance, Donald P. Greenberg, and Bennett Battaile.

Modeling the Interaction of Light Between Diffuse Surfaces. In Proceedings of the

11th Annual Conference on Computer Graphics and Interactive Techniques, SIG-

GRAPH ’84, pages 213–222, New York, NY, USA, 1984. ACM.

84 [GTLH98] Andr Guziec, Gabriel Taubin, Francis Lazarus, and William Horn. Converting Sets

of Polygons to Manifold Surfaces by Cutting and Stitching. In Proceedings of the

Conference on Visualization ’98, VIS ’98, pages 383–390, Los Alamitos, CA, USA,

1998. IEEE Computer Society Press.

[HKW09] A. Hameurlain, J. Kung,¨ and R. Wagner. Transactions on Large-Scale Data- and

Knowledge-Centered Systems I. Lecture Notes in Computer Science. Springer

Berlin Heidelberg, 2009.

[HMJ+16] Stephen Hill, Stephen McAuley, Cyril Jover, Sbastien Lachambre, Angelo Pesce,

and Xian-Chun Wu. Physically Based Shading in Theory and Practice. In ACM

SIGGRAPH 2016 Courses, SIGGRAPH ’16, New York, NY, USA, 2016. ACM.

[IBH11] Thiago Ize, Carson Brownlee, and Charles D. Hansen. Real-time ray tracer for

visualizing massive models on a cluster. In EGPGV, pages 61–69, 2011.

[Jak10] Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.

[JC95] Henrik Wann Jensen and Niels Jrgen Christensen. Photon maps in bidirectional

Monte Carlo ray tracing of complex objects. Computers & Graphics, 19(2):215–

224, March 1995.

[JK76] Edwin T Jaynes and Oscar Kempthorne. Confidence intervals vs bayesian intervals.

In Foundations of probability theory, statistical inference, and statistical theories of

science, pages 196–201. Springer, 1976.

[Kaj86] James T Kajiya. The rendering equation. In ACM Siggraph Computer Graphics,

volume 20, pages 143–150. ACM, 1986.

[Kat03] Toshi Kato. ”Kilauea” - parallel global illumination renderer. Parallel Computing,

29(3):289–310, March 2003. 00023.

85 [KFF+15] Alexander Keller, Luca Fascione, Marcos Fajardo, Iliyan Georgiev, Per H Chris-

tensen, Johannes Hanika, Christian Eisenacher, and Gregory Nichols. The path trac-

ing revolution in the movie industry. In SIGGRAPH Courses, pages 24–1, 2015.

[Koz18] Alicia Kozma. Downloading Soon to a Theater Near You: Digital Film, Local Exhi-

bition, and the Death of 35mm. The Projector; Bowling Green, 18(1):39–70, 2018.

[KT11] Volodymyr Kindratenko and Pedro Trancoso. Trends in High-Performance Com-

puting. Computing in Science & Engineering, 13(3):92–95, May 2011.

[Lam60] Johann Heinrich Lambert. Photometria Sive De Mensura Et Gradibus Luminis,

Colorum Et Umbrae. Klett, 1760. Google-Books-ID: fBlmAAAAcAAJ.

[Lea18] Antony Leather. AMD 32-Core Threadripper 2990wx And 16-Core 2950x Reviews:

Most Powerful Ever Desktop Processors?, August 2018.

[Leg05] A.M. Legendre. Nouvelles methodes´ pour la determination´ des orbites des

cometes` . Nineteenth Century Collections Online (NCCO): Science, Technology,

and Medicine: 1780-1925. F. Didot, 1805.

[LGF04] Frank Losasso, Fred´ eric´ Gibou, and Ron Fedkiw. Simulating water and smoke with

an octree data structure. In ACM Transactions on Graphics (TOG), volume 23, pages

457–462. ACM, 2004.

[Lus16] Germain Lussier. One Animal in Zootopia Has More Individual Hairs Than Every

Character in Combined, 2016.

[LW04] Hedibert Freitas Lopes and Mike West. Bayesian model assessment in factor analy-

sis. Statistica Sinica, pages 41–67, 2004.

[MA14] Luiz Monnerat and Claudio L. Amorim. An effective single-hop distributed hash

table with high lookup performance and low traffic overhead. arXiv:1408.7070 [cs],

August 2014. arXiv: 1408.7070.

86 [MB17] D. Meister and J. Bittner. Parallel Locally-Ordered Clustering for Bounding Vol-

ume Hierarchy Construction. IEEE Transactions on Visualization and Computer

Graphics, PP(99):1–1, 2017.

[MBT+18] D. Meneveaux, B. Bringier, E. Tauzia, M. Ribardire, and L. Simonot. Rendering

Rough Opaque Materials with Interfaced Lambertian Microfacets. IEEE Transac-

tions on Visualization and Computer Graphics, 24(3):1368–1380, March 2018.

[MC04] H. Motulsky and A. Christopoulos. Fitting Models to Biological Data Using Linear

and Nonlinear Regression: A Practical Guide to Curve Fitting. Oxford University

Press, USA, 2004.

[Mea80] Donald JR Meagher. Octree encoding: A new technique for the representation, ma-

nipulation and display of arbitrary 3-d objects by computer. Electrical and Systems

Engineering Department Rensseiaer Polytechnic Institute Image Processing Labora-

tory, 1980.

[Mik09] Mike Seymour. 1984 Pool Balls 25 Years Later, May 2009.

[Mit76] Helen Buss Mitchell. Henrietta swan leavitt and cepheid variables. The Physics

Teacher, 14(3):162–167, 1976.

[MN89] P. McCullagh and J.A. Nelder. Generalized Linear Models, Second Edition. Chap-

man & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis,

1989.

[Mor66] Guy M. Morton. A computer oriented geodetic data base and a new technique in file

sequencing. 1966.

[MRC+86] Gary W. Meyer, Holly E. Rushmeier, Michael F. Cohen, Donald P. Greenberg, and

Kenneth E. Torrance. An Experimental Evaluation of Computer Graphics Imagery.

ACM Trans. Graph., 5(1):30–50, January 1986.

87 [Nel08] Sue Nelson. Big data: the harvard computers. Nature, 455(7209):36, 2008.

[noaa] Gyoukou - ZettaScaler-2.2 HPC system, Xeon D-1571 16c 1.3ghz, Infiniband EDR,

PEZY-SC2 700mhz TOP500 Supercomputer Sites. |

[noab] Inspur SA5212h5, Xeon E5-2682v4 16c 2.5ghz, NVIDIA Tesla P100, 25g Ethernet

TOP500 Supercomputer Sites. |

[noac] November 2017 TOP500 Supercomputer Sites. |

[noad] S-3800/180 TOP500 Supercomputer Sites. |

[noae] SR11000-K2 TOP500 Supercomputer Sites. |

[NZ02] TS Eugene Ng and Hui Zhang. Predicting Internet network distance with

coordinates-based approaches. In INFOCOM 2002. Twenty-First Annual Joint Con-

ference of the IEEE Computer and Communications Societies. Proceedings. IEEE,

volume 1, pages 170–179. IEEE, 2002.

[ON94] Michael Oren and Shree K. Nayar. Seeing beyond Lambert’s law. In Jan-Olof

Eklundh, editor, Computer Vision ECCV ’94, volume 801, pages 269–280. Springer-

Verlag, Berlin/Heidelberg, 1994.

[PFHA10] Jacopo Pantaleoni, Luca Fascione, Martin Hill, and Timo Aila. PantaRay: Fast Ray-

traced Occlusion Caching of Massive Scenes. In ACM SIGGRAPH 2010 Papers,

SIGGRAPH ’10, pages 37:1–37:10, New York, NY, USA, 2010. ACM.

[PGSS07] Stefan Popov, Johannes Gunther,¨ Hans-Peter Seidel, and Philipp Slusallek. Stack-

less kd-tree traversal for high performance gpu ray tracing. In Computer Graphics

Forum, volume 26, pages 415–424. Wiley Online Library, 2007.

[PKGH97] Matt Pharr, Craig Kolb, Reid Gershbein, and . Rendering complex

scenes with memory-coherent ray tracing. In Proceedings of the 24th annual con-

88 ference on Computer graphics and interactive techniques, pages 101–108. ACM

Press/Addison-Wesley Publishing Co., 1997.

[PS12] F.P. Preparata and M.I. Shamos. Computational Geometry: An Introduction. Mono-

graphs in Computer Science. Springer New York, 2012.

[Raf95] Adrian E Raftery. Bayesian model selection in social research. Sociological

methodology, pages 111–163, 1995.

[RCJ99] Erik Reinhard, Alan Chalmers, and Frederik W. Jansen. Hybrid scheduling for paral-

lel rendering using coherent ray tasks. In Proceedings of the 1999 IEEE symposium

on Parallel visualization and graphics, pages 21–28. IEEE Computer Society, 1999.

[RGR17] David Reinsel, John Gantz, and John Rydning. Data Age 2025: The Evolution of

Data to Life-Critical. Dont Focus on Big Data, 2017.

[RKY+02] Sylvia Ratnasamy, Brad Karp, Li Yin, Fang Yu, Deborah Estrin, Ramesh Govindan,

and Scott Shenker. GHT: a geographic hash table for data-centric storage. In Pro-

ceedings of the 1st ACM international workshop on Wireless sensor networks and

applications, pages 78–87. ACM, 2002.

[Rob16] Robbie Collin. Why invisible effects are Hollywood’s best kept secret. The Tele-

graph, January 2016.

[Rus97] John Rust. Using Randomization to Break the Curse of Dimensionality. Economet-

rica, 65(3):487–516, 1997.

[RW80] Steven M. Rubin and Turner Whitted. A 3-dimensional representation for fast ren-

dering of complex scenes. In ACM SIGGRAPH Computer Graphics, volume 14,

pages 110–116. ACM, 1980.

[Sah75] Sartaj Sahni. Approximate Algorithms for the 0/1 Knapsack Problem. J. ACM,

22(1):115–124, January 1975.

89 [SBGS69] Robert A. Schumacker, Brigitta Brand, Maurice G. Gilliland, and Werner H.

Sharp. Study for applying computer-generated images to visual simulation. Tech-

nical report, GENERAL ELECTRIC CO DAYTONA BEACH FL APOLLO AND

GROUND SYSTEMS, 1969.

[SDB85] L. Richard Speer, Tony D. DeRose, and Brian A. Barsky. A Theoretical and Em-

pirical Analysis of Coherent Ray-Tracing. In Computer-Generated Images, pages

11–25. Springer, Tokyo, 1985.

[Sim51] Edward H Simpson. The interpretation of interaction in contingency tables. Journal

of the Royal Statistical Society. Series B (Methodological), pages 238–241, 1951.

[SM13] M. Sugawara and K. Masaoka. UHDTV Image Format for Better Visual Experience.

Proceedings of the IEEE, 101(1):8–17, January 2013.

[Smi84] . The Making of Andre & Wally B. Computer Graphics De-

partment, Computer Division, Lucasfilm. Retrieved from http://alvyray. com/Memo-

s/CG/Lucasfilm/Andre&WallyB TheMakingOf. pdf, page 8, August 1984.

[Smi17] Ryan Smith. NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 Accelerator

Announced, May 2017.

[SRL06] Ivan Stojmenovic, Anand Prakash Ruhil, and DK Lobiyal. Voronoi diagram and

convex hull based geocasting and routing in wireless networks. Wireless communi-

cations and mobile computing, 6(2):247–258, 2006.

[TS67] K. E. Torrance and E. M. Sparrow. Theory for Off-Specular Reflection From Rough-

ened Surfaces*. JOSA, 57(9):1105–1114, September 1967.

[UDH16] Francisco Utray Delgado and Gerald Hooper. Production and delivery in Ultra HD

and 4k. 2016.

90 [Whi79] Turner Whitted. An improved illumination model for shaded display. In ACM SIG-

GRAPH Computer Graphics, volume 13, page 14. ACM, 1979.

[Wil78] Lance Williams. Casting Curved Shadows on Curved Surfaces. In Proceedings

of the 5th Annual Conference on Computer Graphics and Interactive Techniques,

SIGGRAPH ’78, pages 270–274, New York, NY, USA, 1978. ACM.

[WMLT07] Bruce Walter, Stephen R. Marschner, Hongsong Li, and Kenneth E. Torrance. Mi-

crofacet models for refraction through rough surfaces. In Proceedings of the 18th

Eurographics conference on Rendering Techniques, pages 195–206. Eurographics

Association, 2007. 00129.

[WSB01] Ingo Wald, Philipp Slusallek, and Carsten Benthin. Interactive distributed ray trac-

ing of highly complex models. Springer, 2001.

[WWB+14] Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst.

Embree: a kernel framework for efficient CPU ray tracing. ACM Transactions on

Graphics (TOG), 33(4):143, 2014.

[WZL11] Zhefeng Wu, Fukai Zhao, and Xinguo Liu. Sah kd-tree construction on gpu. In

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics,

pages 71–78. ACM, 2011.

[ZGHG11] Kun Zhou, Minmin Gong, Xin Huang, and Baining Guo. Data-parallel octrees for

surface reconstruction. IEEE Transactions on Visualization and Computer Graph-

ics, 17(5):669–681, 2011.

[ZJL+15] M. Zwicker, W. Jarosz, J. Lehtinen, B. Moon, R. Ramamoorthi, F. Rousselle, P. Sen,

C. Soler, and S.-E. Yoon. Recent Advances in Adaptive Sampling and Reconstruc-

tion for Monte Carlo Rendering. Computer Graphics Forum, 34(2):667–681, May

2015.

91 Appendix A

Additional Figures

Test Scene Ray Segments Rays Dimension Mean Rays per Pixel BMW 33,781,572 9,863,228 960 x 540 19.03 Fishy Cat 49,855,072 13,659,871 1002 x 460 29.64 Helicopter 113,231,324 34,660,279 960 x 540 66.86 Pavillion 294,546,498 65,851,125 1280 x 720 71.45 Classroom 451,666,004 137,275,796 1920x1080 66.20 Benchmark 557,972,808 166,495,129 2048 x 858 94.75 Pokedstudio 1,122,660,076 285,833,041 1127 x 620 409.1 Table A.1: Statistics for each of the seven scenes used in this paper.

92 d i f f git a/intern/cycles/kernel/kernel path.h b/intern/cycles/kernel/kernel p a t h . h index−− c2421c1ec18 .. c4b15618897 100644 a/intern/cycles/kernel/kernel p a t h . h −−−+++ b/intern/cycles/kernel/kernel p a t h . h @@ 48,6 +48,13 @@ #include− ”kernel/kernel p a t h v o l u m e . h” #include ”kernel/kernel p a t h subsurface .h”

+/∗ INSERTION ∗/ +# i f n d e f WRITELOCK +# d e f i n e WRITELOCK +# i n c l u d e +static std::mutex writeLock; +# e n d i f + CCL NAMESPACE BEGIN

c c l d e v i c e forceinline bool kernel p a t h s c e n e i n t e r s e c t ( @@ 551,14 +558,22 @@ ccl d e v i c e forceinline void kernel p a t h i n t e g r a t e ( − SubsurfaceIndirectRays ss i n d i r e c t ; k e r n e l p a t h s u b s u r f a c e i n i t indirect(&ss i n d i r e c t ) ;

+ /∗ INSERTION ∗/ + bool h i t ; + f o r ( ; ; ) { # e n d i f /∗ SUBSURFACE ∗/

/∗ path iteration ∗/ f o r ( ; ; ) { + + fprintf(stderr,” <%.12f ,%.12f ,%.12f >[%.12f,%.12f,%.12f]” , + (double)ray >P.x, (double)ray >P.y, (double)ray >P . z , + (double)ray >−D.x, (double)ray >−D.y, (double)ray >−D. z ) ; + − − − /∗ Find intersection with objects in scene. ∗/ Intersection isect; bool hit = kernel p a t h s c e n e intersect(kg, state , ray, &isect , L); −+ hit=kernel p a t h s c e n e intersect(kg, state , ray, &isect , L);

/∗ Find intersection with lamps and compute emission for MIS. ∗/ k e r n e l p a t h l a m p emission(kg, state , ray, throughput , &isect , &sd, L); @@ 678,6 +693,11 @@ ccl d e v i c e forceinline void kernel p a t h i n t e g r a t e ( − } } # e n d i f /∗ SUBSURFACE ∗/ + i f ( h i t ) + fprintf(stderr,”h”); + e l s e + fprintf(stderr,”m”); + }

c c l device void kernel p a t h trace(KernelGlobals ∗kg , @@ 712,6 +732,10 @@ ccl device void kernel p a t h trace(KernelGlobals ∗kg , − PathState state; p a t h s t a t e init(kg, emission sd, &state , rng hash , sample, &ray);

+ /∗ INSERTION ∗/ + writeLock.lock(); + fprintf( stderr, ”(%d,%d)”, x, y ); + /∗ I n t e g r a t e . ∗/ k e r n e l p a t h integrate(kg, &s t a t e , @@ 721,6 +745,9 @@ ccl device void kernel p a t h trace(KernelGlobals ∗kg , − b u f f e r , e m i s s i o n s d ) ;

+ fprintf( stderr, ”\n” ) ; + fflush( stderr ); + writeLock.unlock(); k e r n e l w r i t e result(kg, buffer , sample, &L); }

Figure A.1: The main patch used to capture ray data from the Cycles rendering engine.

93 Category Algorithm Half-span Offset Error +0.2129 +3.5132 10 3 +2.4495 10 3 Free Vivaldi 1.3495 10 2, 4.8389 10 2, · − 0.1482, · − · − 1.3377 10 2 · − 4.0701 10 3 2.8920 10 3 − · − − · − − · − +0.2686 +1.8375 10 3 +1.3623 10 3 Free No Certainty 1.3866 10 2, 2.1704 10 2, · − 0.1042, · − · − 1.3565 10 2 · − 2.0118 10 3 1.1719 10 3 − · − − · − − · − +0.2742 +5.0296 10 4 +3.6145 10 4 Free Swarming 3.0908 10 2, 5.7573 10 2, · − 0.1411, · − · − 2.9094 10 2 · − 5.0198 10 4 3.0345 10 4 − · − − · − − · − +0.3706 +2.4740 10 4 +1.7771 10 4 Free No Backtracking 0.1081, 1.2343 10 2, · − 7.6013 10 2, · − 9.1059 10 2 · − 2.2993 10 4 · − 1.6026 10 4 − · − − · − − · − +0.4168 +2.3721 10 4 +1.6113 10 4 Camera Axis Planar 0.1665, 0.0188, · − 0.1537, · − 0.1406 2.4107 10 4 1.6877 10 4 − − · − − · − +0.4428 +1.3433 10 4 +1.0013 10 4 Camera Axis Radial 0.2806, 1.5058 10 3, · − 0.0520, · − 0.2134 − · − 1.2875 10 4 9.3449 10 5 − − · − − · − +0.4280 +1.3063 10 4 +8.4685 10 5 Camera Axis Bounded Radial 0.2215, 2.6288 10 3, · − 4.7170 10 2, · − 0.1768 − · − 1.2116 10 4 · − 8.9551 10 5 − − · − − · − Table A.2: The median, 16th and 84th percentiles from Model 2’s posterior, by algorithm. Based on a random subsample.

Category Algorithm Waterline Final Error Falloff 3 2 2 2 +5.4723 10− +6.3056 10− +6.0960 10− Free Vivaldi 6.713 10− , · 2 0.1487, · 0.9566, · 2 · 1.0702 10− 0.1486 2.4944 10− − · 4 − 3 − · 3 +8.0326 10− +2.5309 10− +2.6181 10− Free No Certainty 0.1135, · 4 0.2280, · 3 0.7989, · 3 9.1796 10− 2.0618 10− 2.4582 10− − · 4 − · 4 − · 2 7 +5.0782 10 +8.8791 10 +8.2246 10 Free Swarming 3.8548 10 , · − 1.4693, · − 2.9577, · − · − 4.6142 10 5 9.3839 10 2 0.1233 − · − − · − − +1.2733 10 7 +1.5198 +9.8997 10 2 Free No Backtracking 3.2990 10 7, · − 0.9641, 3.0996, · − · − 2.5775 10 8 6.8747 10 2 4.0698 10 2 − · − − · − − · − +3.4941 10 5 +1.5319 10 4 +1.5955 10 4 Camera Axis Planar 0.0719, · − 0.5343, · − 1.3212, · − 3.6062 10 5 1.5394 10 4 1.6061 10 4 − · − − · − − · − +3.9034 10 5 +1.4614 10 4 +1.5612 10 4 Camera Axis Radial 0.0811, · − 0.5165, · − 0.9762, · − 3.9790 10 5 1.6268 10 4 1.4323 10 4 − · − − · − − · − +2.3774 10 4 +6.8103 10 4 +7.3389 10 4 Camera Axis Bounded Radial 0.1079, · − 0.2665, · − 0.9905, · − 2.3437 10 4 6.4071 10 4 7.9232 10 4 − · − − · − − · − Table A.3: Select median, 16th and 84th percentiles from Model 3’s posterior, by algorithm.

94 Algorithm 4: Load balancing within the metric space, via a voxelized swarming algorithm. input : A list of peers and workloads. input : The geometry this node is responsible for. output: A new location for the current node.

1 for Each atom of geometry do 2 Update the bounding box of all geometry on this node; 3 end 4 Voxelize the calculated bounding box; 5 for Each ray which collides with geometry, up to a limit do 6 Increment the voxel corresponding to the collision point; 7 end 8 Calculate the average queue time of all peers, ourselves included; 9 if Our queue time is above-average then 10 for Each voxel do 11 if The voxel recorded at least one collision then 12 Add it to a weighted average of voxel centroids, with the inverse of the collision tally as the weight; 13 else 14 Add it to an average of voxel centroids; 15 end 16 end 17 Interpolate between the two averages to arrive at the new location; 18 else 19 Calculate a weighted average of node positions, with the amount of time taken to exhaust their path ray queue as the weight; 20 Subtract our queue time from that, and divide the result by the average; 21 Use that value as the weighting for a linear interpolation between our position and the weighted average, which will become our proposed new location; 22 if The proposed location falls outside the Voronoi cell we occupied then 23 Form a line between the proposed new location and our old one, and find the point along that line furthest from our original position which is within the old Voronoi cell; 24 end 25 end

95 Figure A.2: The performance of each of the unrestricted algorithms, as measured by efficiency and three variables from Model 2 (half-span, offset, and error, see text for details), across all four initial node placement algorithms.

96 Camera Axis Planar

Camera Axis error offset slope Camera Axis Radial Bounded Radial halfspan

slopehalfspan offset errorslope halfspan offset error

Unrestricted error offset slope Unrestricted Vivaldi Vivaldi, without Certainty halfspan

slopehalfspan offset errorslope halfspan offset error

Unrestricted Unrestricted Swarming error offset slope Swarming, no Backtracking halfspan

Figure A.3: A corner plot of the posterior for97 Model 2, with ray collisions as the metric under consideration. Figure A.4: The performance of Swarming, as measured by efficiency (see text), when the number of nodes in the system is fixed but the initial configuration of nodes is varied.

98 Effect of Damping (16/84 confidence intervals)

100

50 efficiency (higher is better) is (higher efficiency

B. Radial No Back. Radial Vivaldi 0 0.01 0.1 1

damping amount

Effect of Damping (16/84 confidence intervals)

100

50 efficiency (higher is better) is (higher efficiency

B. Radial No Back. Radial Vivaldi 0 0.01 0.1 1

damping amount

Figure A.5: The performance of the two radial algorithms, Vivaldi, and No-backtracking, as mea- sured by efficiency (see text), when the number of nodes is fixed at 3 and 13, respectively, for a range of damping values. The radial algorithms used a random subsample.

99 Camera Axis Planar

Camera Axis Camera Axis Radial Bounded Radial

Unrestricted Unrestricted Vivaldi Vivaldi, without Certainty

Unrestricted Unrestricted Swarming Swarming, no Backtracking

Figure A.6: The posterior distributions of Model 3, for each algorithm, drawn from the data set. Magenta lines correspond to the waterline, green lines correspond to the 16th and 84th percentiles of both extremes, and grey areas represent posterior density (with darker being more certain). The number of nodes is fixed to three. All charts use the same scale, but the y axis is logarithmic. See the text for analysis. 100 Appendix B

Implementation Overview

Now that simulations have outlined the most effective combinations of algorithms and parameters,

the next logical step is to implement the path tracer1. This is the only true proof that spatial partitioning works as promised. It is also beyond the scope of a Master’s thesis, but nonetheless we can outline an implementation.2

B.1 Requirements

While the bulk of the simulation deals with geometry, path tracers must store a lot more than that. Each surface must have a shading model applied to it, which acts like a small program that transforms an incoming path ray into an outgoing one with more illumination information.

Storing the entirety of a model for each geometric primitive is wasteful; instead, said primitive contains a pointer to an appropriate program, which is generalized across multiple types of geome- try. These programs frequently reference two-dimensional textures, and again these are not stored per-polygon nor even per-program, but as a link to an external resource.

Some scenes are animated. This is usually done by linking to a set of ”bones,” which act to move the geometry according to a series of linear transformations. These bones are transformed according to the current time-stamp being rendered. In some cases, however, animation is done by translating, rotating, shearing, or scaling geometry according to the time-stamp. In a few cases, such as with fog or smoke volumes, a four-dimensional data set is used consisting of particles or voxels.

Finally, one or more cameras need to be stored. These contain properties such as focal length

1For a justification of beginning with a simulation instead of jumping straight to an implementation, see “Method- ology Overview” in the first chapter 2See the Acknowledgements for additional information on the inspiration for these sections on implementation.

101 and pixel format, and sometimes describe post-processing done to the rendered image. All of this may be animated. Cameras are important sources of path rays, and if bidirectional tracing is done then so too is geometry with shading models that emit light.

All of the above effect geometry or the path rays in the scene, and thus a path tracer needs to be able to interact with them. Ideally, it must also be able to deal with multiple scenes, as setting up a render farm for each scene would be a waste of time.

There are additional wrinkles that come with using a distributed renderer. As it is designed to work with scenes bigger than what any one computer can contain, it would be impractical to contain such scenes on a single computer and then transfer them to the distributed renderer. Instead, the renderer should double as a distributed filesystem, allowing the scene to be edited in-place. It would be unwise to render a scene during editing, so the distributed system must be aware of this.

If this distributed path-tracer is to be deployed in a production environment, however, the ability to edit the scene implies users will have direct access to it. In a large environment working with multiple clients, it is likely that administrators would want to block employees working on one scene from being able to read another. The more people who have access to something, after all, the more likely it is someone will leak it outside the company. This path tracer must have some notion of security. At the same time, it would be infeasible to restrict the distributed filesystem to only hold scenes on servers that are cleared to work on them.

B.2 Operating Environment

This path-tracer is intended to be used on a private network with reasonably fast connections be- tween each node. Security concerns rule out its operation over a public network, and if the system must span a public connection then technologies like Virtual Private Networking can maintain the required security. For nearly all use cases, however, this system would be deployed on a collection of computers dedicated solely to graphical work. Cloud deployment is feasible, provided some information about the server’s execution environment is provided; otherwise, it is impossible to

102 guard against correlated failure.

B.2.1 Master-Client vs. Peer-to-Peer

All similar distributed rendering systems use a master-client framework, while this system is peer- to-peer. There are several justifications for the switch.

The most obvious is that the rendering algorithm works best in a peer-to-peer system. Nodes only pass data to their immediate peers, and they rearrange their location within the rendering metric space based on information from those peers. Routing that through a master server would only introduce delays. The algorithm makes no assumptions about non-peer nodes, so there is little advantage to having global information.

Peer-to-peer systems also have reduced administration costs. Master-client frameworks have at least two separate components, which complicates their setup and maintenance. Increasing the size of the system requires balancing the proportions of each type. Failure of a master server needs to be handled quite differently then that of a client, and there must be a plan for when both a master and client fail at the same time. The conformity of nodes within peer-to-peer systems greatly simplify all these concerns, and allow for a plug-and-play design. Theoretically, someone with no network administration experience could use this system to create their own render farm, by simply adding and removing hardware as necessary.

Finally, peer-to-peer networking eliminates a potential bandwidth bottleneck. If all geometry data is coordinated via a master server, all information related to that must pass through their network hardware. Rendering will be bottlenecked if that information involves actual geometry or texture data, and even meta-data can bottleneck if the master server is connected to enough clients.

Peer-to-peer designs may result in more data being transmitted overall, but that cost should be evenly spread across the network in a well-designed system.

103 B.2.2 Network Overview

Two separate IP ports are used by this system, 17349 and 17350. These are, respectively, used for

low-priority and high-priority information. Having two ports allows network administrators to set

up priority queues without having to do deep packet inspection. UDP is used for data which can

be lost with minimal impact to the system, such as periodic broadcasts or ray transfers, while TCP

is used for data which must arrive, such as geometry transfers and consensus building.

B.2.3 Nodes

By listening for periodic broadcasts, peers can gather information about the bounds of other peers

within the system. New peers/nodes can assign themselves a random location, or find an existing

node storing more data than average and deliberately pick a location nearby.

To aid identification, each node will have a unique ID associated with it. This can be assigned

automatically via a high-quality random source, or manually by an administrator. The network

address of the node is a good choice for ID. A persistent, unchanging ID is useful to have, as it

allows the system to incorporate network topology (see Backup for details). It can also be used as a tiebreaker in algorithms.

B.3 Scenes and Spaces

At the heart of this system is a “scene,” or a collection of data necessary to completely describe one or more renders. Each scene has an unique numeric ID associated with it, typically by incrementing the ID of the highest-known scene. Scene zero is considered invalid and only used to flag that peer isn’t associated with any scene.

As per prior sections, scenes can contain:

Meta-data, such as the location of one or more cameras, or settings related to the • look of the render.

104 Figure B.1: Space filling curves can be used to map multi-dimensional spaces onto a one-dimen- sional spaces, and by extension to partition them in a somewhat spatially-compact manner.

Shaders, which describe how light interacts with geometry. •

Texture maps, which are used as input to shaders. •

Geometry, which represents most of the physical objects within a scene. •

Particles, used to simulate certain physical processes. •

Temporal data, used to describe how all the above change with time. •

Renders of the entire scene, both partial and complete, for each of the cameras. •

As all of this data is finite with a fixed number of dimensions, we can pack them into a finite

“space.” By winding a space-filling curve through each space, we can map any location contained within to a location along a line, and vice-versa. That in turn allows us to partition those spaces and assign ownership of partitions to servers or “nodes” within the system; see Figure B.1.

This leads to an intuitive algorithm for equalizing storage across all nodes, Algorithm 5. The use of bounds creates a problem, as asynchronous updating makes it possible for the boundaries between two nodes to overlap or have a gap. The easiest solution is for each node to only control their upper or lower bound, rather than both.

One problem with Algorithm 5 is that it is vulnerable to race conditions: peers may swap the same packet between each other as they load balance with multiple neighbours. To help prevent this, incoming data is tagged with an arrival time. That data cannot be switched to another node

105 Algorithm 5: Storage balancing via partitions of a scene. input : A list of known nodes. This may be blank initially.

1 while This node is active do 2 for A short period of time do 3 Listen for periodic broadcasts from other nodes; 4 Update the list of known nodes; 5 Accept requests to store items that fall within the current nodes’ bounds; 6 end 7 Enumerate or update the locations of all locally-stored objects; 8 for Each node which shares a bound with us do 9 if Our proportion of used space is greater than theirs then 10 Calculate the number of items we’d have to transfer to them to equalize the proportions; 11 Transfer those items; 12 Update the boundary between our nodes; 13 end 14 end 15 if It is time for the current node to broadcast then 16 Broadcast our current state, including our bounds; 17 end 18 end until sufficient time has elapsed from when it arrived, say on the scale of ten seconds. This should be sufficient time to allow node boundaries to settle down and become more reliable.

B.3.1 Scene Log

Managing all major scene-related information is done via the “scene log,” an incremental list of major events that are associated with a scene. It notes the creation of spaces, how they are trans- formed, and any other macro-level changes. It also contains information which is not incrementally updated, such as the IDs of the associated keys, the ID of the node which created the scene, a Lam- port timestamp, the key for the symmetric cypher associated with the public and private keys, and whether the scene is in edit or render mode. For security reasons, only the first three items of that list are unencrypted.

As additions to the scene log are disruptive, all nodes within the scene must come to a con-

106 sensus on the change. Nodes should engage in distributed broadcasting to help prevent incomplete information from leading to a false consensus. This will be discussed further in Consensus.

B.3.2 Data Management

Some types of data contained within the system have a natural association to individual nodes, such as the partial renders necessary to add up to a finished one. These are nonetheless large and plentiful enough to justify being stored within a space, rather than via a separate mechanism.

This case can be handled by allowing two different management methods for spaces. A space with a fixed-location policy guarantees that data stored at a specific location will always remain at that location until deletion. A node-managed location policy, on the other hand, allows nodes to shuffle the location of any data contained within that space as they see fit. This allows a node to “slide” their incremental renders with them as they shift location within the scene, for instance.

This policy does not make it possible to directly link to data, unfortunately, but this can be some- what resolved by allowing range queries. A peer looking for a specific incremental render, for instance, could search for all incremental renders on the target node which match a certain criteria, and have a set of locations handed back to them. These can then be queried.

Data packets themselves contain the following information:

1. A checksum, to validate the data is intact.

2. The scene it belongs to.

3. The space within that scene it belongs to.

4. A Lamport timestamp (this is described in Security).

5. Raw data, preferably slightly smaller than a network packet.

107 Algorithm 6: The algorithm for editing or creating a data packet input : A packet of data to transmit. input : The appropriate public and private key pair for a specific scene.

1 Encrypt the scene data to be saved with the symmetric key; 2 Generate a SHA512 checksum of the encrypted data, and sign it with the private key; 3 Attempt to store the encrypted data on a server; 4 The receiving server uses the public key to decrypt the checksum and compares it against a checksum it calculates; 5 if The two checksums match then 6 The data is stored; 7 else 8 The data is rejected as corrupt or forged; 9 end

B.3.3 Security

Security of data is provided by use of public key cryptography. This turns encryption into a one- way function: the public key is used to encrypt data, and in theory the only way to decrypt is to have a copy of the private key. This allows us to protect data from being read by unauthorized people or computers, by restricting which servers have access to the associated private key for a scene. This does not protect data from being written over, however, as anyone with the public key could encrypt forged or even random data and submit it as legitimate. Fortunately, the public and private keys are mirror images of one another; anything encrypted using the private key can be decrypted by anyone with the public key. If the private key is used to encrypt a checksum of the data, every node can decrypt that with the matching public key and verify it, so it can serve as a zero-knowledge proof that the encrypting peer has access to the private key. Those node’s storage requests can be trusted. This process is outlined by Algorithm 6.

Note that the data is actually encrypted with a symmetric key, which is contained in the scene log. Public key encryption is much slower than symmetric encryption schemes like AES-192, so to save processing time the public and private keys are used to encrypt and decrypt keys used for

AES-192, and AES-192 is used to encrypt data.

This only works if the public key is widely shared, so the system needs some way to query

108 public keys. This also creates a problem: certain scene log operations can change scene data. How is this updated, if the node containing the data cannot decrypt it? The solution is to associate a

Scene Log Lamport timestamp with each data packet. A node which has the decryption key will be able to compare the data’s timestamp with that of the log, and know if it hasn’t been updated.

It can then fast-forward through the transformations contained in the log and update the node containing the original packet. As that node has no idea if the associated scene is in edit mode or not, it will have no grounds to refuse the update. This permits multiple nodes to simultaneously update a data packet, but the sheer amount of data makes this unlikely, and the worst failure case is that the data has only had one update from the Scene Log applied until another node reads the location.

B.3.4 Backup

When dealing with data sets this large, great care must be taken to preserve the data contained within. Checksums only detect when data has become corrupted, they cannot recover missing data. The latter case is handled by the use of “shadow spaces,” which serve as duplicates of existing spaces. The location of each original data packet is passed through a hash function that maps it to a new location, so the location bounds shared across other spaces are also used here. As shadow spaces are included in storage calculations, this still allows for storage balancing. A good choice of “hash function” is a location offset added to the original location within a finite space.

To handle correlated failures, such as is typical for nodes which share the same server rack, a node blacklist can be used to avoid storing data on certain nodes. Nodes implicitly place themselves on their own blacklist, and periodically broadcast the list to all other nodes. To make the best use of blacklists, node IDs should be network addresses. Algorithm 7 details how nodes can translate a location into a shadow space location.

This design is unusual, as most large-scale distributed filesystems handle data integrity via error-correcting codes like Reed-Solomon. These also have the advantage of tuning the amount of extra space required for recovery; full-copy backups can only adjust the number of backup copies

109 Algorithm 7: The algorithm for finding which node is responsible for a shadow space data packet. input : A data packet location for a specific scene and space. output: The appropriate nodes to store or find shadow packets.

1 Retrieve the blacklist from the node responsible for storing the data packet; 2 Add that node to its blacklist; 3 Examine the Scene Log to learn which shadow spaces are associated with the input space; 4 Set aside space for a list of as many nodes as there were shadow spaces; 5 for Each shadow space do 6 Add the offset to determine the new location; 7 Find the closest node to that location; 8 while That node is listed in the blacklist do 9 Find the next-closest node to the new location; 10 end 11 Add that node to the list; 12 end 13 return The list of nodes; they store. This system sticks with copies of data for the simple reason that access speed is a major concern. The extra latency introduced by coding will slow down recovery, and it prevents the use of multi-path data retrieval.

B.3.5 Edit and Render Modes

This rendering system is designed to work with scenes much larger than any one computer could contain, which creates a problem: the data must be read into the system, where it will likely be iteratively edited, yet rendering depends on a static scene. One solution is to permit scenes to be in one of two modes: editing, where the scene contents can be loaded or edited, and rendering, where the rendering algorithms are active and scene data is read-only.

It’s important to note that fixed-location spaces only fix the location of a data packet. During edit mode, a node is free to break apart a geometry packet into two separate packets which have different locations. These new packets have a fixed location within this space, so this does not violate the management policy. Care must be taken to respect links to specific packets, of course; a shader program cannot be broken up in this way, as its location is almost certainly referenced

110 Algorithm 8: Updating the Scene Log. input : A list of nodes. output: Consensus on whether to permit the log append or not.

1 if A request for a change is received then 2 if We are not a participant in this scene then 3 Forward this request to the next node; 4 return 5 end 6 if This node cannot permit the change, or a node which should have be included in the request is missing then 7 Send back a rejection of this change, along with a reason; 8 return 9 end 10 if We have a neighbour which has not received the broadcast then 11 Send it the request for a change, with our ID appended; 12 Listen for a response; 13 if The request is denied then 14 Add our ID to the rejection; 15 Send back the rejection of this change; 16 return 17 end 18 end 19 Note this change as “tentative” in a log; 20 Add our ID to the acceptance; 21 Send back the acceptance of this change; 22 end 23 if A system log is broadcast which contains this exact change, and no other then 24 Remove the “tentative” change and accept the changed log as canonical; 25 end by multiple geometry data packets scattered across the system. Instead it can be divided into subroutines, which are scattered through the space, while the shader entry point remains immobile.

As the mode of a scene is contained within the Scene Log, it can only be changed through consensus of all nodes. This prevents accidental editing during rendering, or vice-versa.

B.3.6 Consensus

Algorithm 8 details one method for developing consensus. Note that because each node is located along a line, we can restrict message passing to only travel along this line, creating unambiguous

111 Ray Generator Casting Ray Queue

Shading Ray Casting

Pixel Manager

Geometry Cache Outgoing Ray Queue

Network Geometry Manager Ray Manager

Figure B.2: A block diagram of the rendering process. See the text for details. transmission chain. This allows the algorithm to be modified to generate consensus on a more general level. This is useful for generating render pools, or consolidating incremental renders into

finished ones.

This particular algorithm is slow, but it should also be invoked quite rarely in practice. As an alternative, the algorithm of Monnerat et al. could be adapted[MA14]. Since every node should have a complete routing table, missed nodes can be detected by using the current node’s routing table to determine where the original node should have sent Scene Log proposals. If the two do not match, a rejection is broadcast out using the same algorithm.

B.4 Rendering

Space-filling curve partitions will be treated as a filesystem that the rendering portion relies on.

The filesystem will be kept almost entirely on disk, while the rendering portion will be almost entirely stored in RAM. The renderer will use the node location within the filesystem as a starting point, but its location will inevitably drift over time. During the render process, it will fetch data it needs on demand from the appropriate node; as consensus is not necessary for this, queries can be made directly to the responsible node.

Figure B.2 is a block diagram of the low-level rendering process. It begins with the creation of new rays that are used to generate finished pixel values. These are fed into a queue used to store

112 rays being cast into the scene. The routine used to find collisions between rays and geometry pulls them off one-by-one, where one of three things happen: it will collide with some geometry we have stored in memory, it will collide with geometry we haven’t loaded yet from another node, or it will miss all our geometry. The ray will then either be passed to the shading routine, back to the casting ray queue, or placed in the outgoing ray queue, respectively. As necessary, the ray casting routine will draw from the geometry cache.

If the ray reaches the shading routine, it will interact with the surface of the geometry and either generate a new ray or terminate. In the former case, the new ray is added to the casting ray queue; in the latter case, it will be sent to a routine which collects finished pixels for the final render and become incorporated into the final product. At minimum, that data will be used to tune the ray generation routine, so that it can focus generating rays on areas of the image which need more samples. This routine will also save the results to a designated space, for later consolidation.

If the ray is placed in the outgoing pool, a separate thread devoted to managing the queue will sort and consolidate those rays. Their destinations will be found, and a collection of them will be sent along with server metadata. This same thread will also take in rays passed to this node by others, sort them to increase coherency, and place them into the casting queue. Nodes will be permissive of the rays they take in, attempting to cast them even if the internal position data suggests they are better suited for another node, to help prevent ray cycling between nodes.

If the ray casting routine finds a ray might have collided with geometry it should contain, it will place a request for said geometry in with a separate geometry management thread. This routine will the search the network to satisfy the request, and place the geometry in the cache. This same routine will also be responsible for handling geometry transfers during repositioning events.

Much of this code will be running in separate threads, with clean paths of communication between each. Lock-free structures will be preferred, in general, to maximize throughput.

113 B.4.1 Render Pools

While a scene may be too big for any one computer to handle, it may be small enough to rendered by a subset of nodes within the system. Supporting rendering by subsets would decrease the network overhead, as fewer rays would be passed around the system. Nodes are thus permitted to self-assemble into “render pools.”

All nodes initially pick a random scene and frame to render and are considered a single-node pool; if the associated scene is not an animation, it may be broken up into a number of pseudo- frames so that multiple render pools can work on it simultaneously. If they calculate they do not have sufficient RAM to contain a scene, they will contact neighbouring pools and request all members of that pool join them. It is possible that the neighbouring pool is both larger and in need of additional help, in which case the smaller pool abandons their scene and frame and joins in. To prevent race conditions, nodes that have requested joining another pool will ignore all incoming pool join requests until either the original request is responded to, or a timeout occurs. In the latter case, a rejection notice is sent to the original recipient, and the most attractive offer is responded to with the same timeout as before.

To save time, render pools can be created during edit mode and periodically updated to keep up with scene edits. To handle cases where all participating nodes greatly over-estimate the memory requirements of a scene, they are permitted to divide their pool in two. The smallest of the two portions is paired up with the smallest neighbouring pool. The aforementioned consensus algo- rithm will be used to guarantee all nodes in a pool agree to the changes. Small over-estimates are ignored, to prevent excessive divisions.

The distributed rendering algorithms mentioned earlier suffer to varying degrees from Pareto distribution issues; specifically, it is almost guaranteed that some geometry will generate more ray and path collisions than other geometry. This creates one or more nodes which store very little geometry in RAM but have a high workload, and typically one node which could contain more than half the geometry in the scene. The overlay design offers a partial solution, however; RAM

114 not occupied with rendering geometry can be used as cache for other geometry. Taking inspiration from the flexible routing of most distributed hash-table implementations, we can state that any node which finds itself with extra RAM can become the target of storage requests from nodes with less, even if those storage requests are located on another node. This effectively turns those nodes into

RAM caches for memory-starved nodes, speeding access relative to retrieval from a hard drive.

B.4.2 Consolidating Renders

Generating an image is quite complex. For an animated movie, hundreds of thousands of rendered frames may need to be stored. Each of those may be broken up into separate layers, such as the colour, distance or ID of objects. This is ideally done via a render space within the scene, with each image given a specific location determined by entries in a meta-data space. Nodes are responsible for consolidating the incremental renders of each final rendered image they contain.

Ideally, these incremental renders occupy their own node-managed space. Since all nodes broadcast the scene and frame they are working on, there is no need to query every node to de- termine relevant incremental renders. The consolidation process is Algorithm 8 but with data accumulation: nodes begin by querying the closest node or nodes with a matching incremental render, those nodes continue the query along the line until there is no more line, and incremental renders are consolidated together as they is returned. This removes a bandwidth bottleneck, as the node with the final render only receives two incremental renders at most. It does add significant lag between each update of the final image, but this is desirable as it leaves more bandwidth for the path tracing portion of the render process.

115 Appendix C

Network Packets

Based on the above information, we can begin constructing the core data packets exchanged be- tween peers, in BNF.

116 C.1 Status Messages

PingBroadcast ::= ”1” StatusBroadcast

StatusBroadcast ::= Status Status StatusBroadcast | Status ::= StatIdent StatScene StatStorage StatRender StatBlacklist

StatIdenti f y ::= Lamport NodeId RightBound

StatScene ::= RenderPool SceneFrame

StatStorage ::= O f f lineFraction O f f lineBlocks OnlineFraction OnlineBlocks

StatRender ::= RaysPerSecond RayQueueLength

StatBlacklist ::= Size Blacklist

Blacklist ::= Id Id Blacklist | Lamport ::= 16-bit number

NodeId ::= 32-bit number

RightBound ::= Location

RenderPool ::= 32-bit number

SceneFrame ::= 64-bit number

O f f lineFraction ::= 32-bit number

O f f lineBlocks ::= 64-bit number

OnlineFraction ::= 32-bit number

OnlineBlocks ::= 32-bit number

RaysPerSecond ::= 32-bit number

RayQueueLength ::= 32-bit number

Size ::= 16-bit number

Location ::= 160-bit number

117 PingBroadcast is a UDP packet sent out periodically by a peer to inform others of its status,

help create the routing tables of new nodes, and detect when a node has failed. Lamport is a

Lamport timestamp associated with this status update, Id the peer’s ID, RightBound is the right-

most bound from its partition along the 1D line, RenderPool is as described before, SceneFrame

is the concatenation of the scene and frame number associated with aforementioned RenderPool, the Block values detail the amount of hard drive (“Offline”) and RAM (“Online”) storage on the current peer in 4KB increments, while the Fraction values specify how much of that capacity is used, RaysPerSecond is the number of path rays this peer is processing per second, and the

RayQueueLength is the number of unprocessed rays sitting in this peer’s queue. In this case, Size

details the number of entries in Blacklist.

PingBroadcast’s first Status entry is for the peer sending out the broadcast; all other blocks

are other nodes contained in that peer’s routing table, enough to pad out the network packet. By

sharing all this information, new nodes are given a wealth of data about the system. These packets

are sent out once per second, with probability 1 / N, where N is the number of nodes this peer

is aware of. This throttles the bandwidth used by these packets, and also creates a probabilistic

method for detecting node failure. The odds of a peer still alive, if it has not been seen in B

broadcasts, is

N 1 p(alive)=( − )B (C.1) N 1 So if we find this probability falls below a certain threshold, like 20 , a peer is justified in removing that node from its routing table.

If a node’s Id is the same as its network address, the NodeId field in StatIdentity and the fol-

lowing fragment can be dropped.

MiniStatus ::= NodeId RightBound RenderPool O f f lineFraction OnlineFraction StatRender

To ensure the system is as up-to-date as possible, all requests and most replies include some

118 data from PingBroadcast. This should increase the performance of the system, with minimal bandwidth costs.

RequestState ::= ”2” MiniStatus

RequestState is the peer-to-peer version of PingBroadcast, and is replied to with PingBroad- cast. It primarily exists as a way to explicitly check if a peer is alive, and help new nodes to fill their routing tables quicker. Like all status broadcasts, it is sent via UDP.

C.2 Public Keys

RequestPublicKey ::= ”3” MiniStatus PubKeyId

PubKeyId ::= 32-bit number

Public keys are vital for the security of scene logs and data, so there must be a way for new peers to query them from existing peers. The system can afford to lose public key packets, though, so these are sent via UDP.

RespondPublicKey ::= ”4” (MiniStatus MiniStatus PubKeyId PubKeyExponent PublicKey) | PubKeyExponent ::= 32-bit number

PublicKey ::= 3072-bit number

The public key portion is not fixed in stone, and may be upgraded or downgraded depending on security demands. If the specified public key does not exist, the reply is truncated. This too can be sent via UDP.

119 C.3 Scene Logs

RequestSceneLog ::= ”5” MiniStatus SceneId

SceneId ::= 32-bit number

RespondSceneLog ::= ”6” (MiniStatus MiniStatus SceneLogPublic SceneLogPrivate) | SceneLogPublic ::= SceneID PubKeyId PrivateKeyId OriginId Signature

PrivateKeyID ::= PubKeyId

OriginId ::= NodeId

Signature ::= 512-bit number

SceneLogPrivate ::= Lamport Size Name SceneMode SymmetricKey LogEntries

Name ::= A UTF-8 String

SceneMode ::= 1-bit number

SymmetricKey ::= 192-bit number

LogEntries ::= LogEntry LogEntry LogEntries |

Requesting a scene log is not a vital operation, so it is sent via UDP. The response may be sent via UDP, if the log is small enough, otherwise TCP is used. PrivateKeyID and OriginId are just aliases for other data types. The latter of those is used to track the peer which created the scene, in case of disputes. Signature is a SHA-512 hash of SceneLogPrivate encrypted with the private

key. SceneMode is used to track whether the scene is in edit or render mode, and SymmetricKey is

the aforementioned AES-192 key used to encrypt data. Size is the number of characters in Name, a human-friendly string associated with the scene.

120 LogEntry ::= CreateSpace TranslateSpace ScaleSpace | |

CreateSpace ::= ”1” Dimensions ManagementMode SpaceId ParentId HashValue Size Name

Dimensions ::= 3-bit number

ManagementMode ::= 1-bit number

SpaceId ::= 13-bit number

ParentId ::= SpaceId

HashValue ::= Location

TranslateSpace ::= ”2” SpaceId SignedLocations

SignedLocations ::= SignedLocation SignedLocation SignedLocations | SignedLocation ::= Location

ScaleSpace ::= ”3” SpaceId Scalar

Scalar ::= 32-bit IEEE-754 float

There are three log entries defined. CreateSpace outlines the creation of a new space: Dimen- sions is the number of dimensions it occupies, minus one; ManagementMode is the management mode for the space, outlined earlier; SpaceId is a unique number associated with each space; if

ParentId is not 0, then it marks this as a shadow space of another space, and HashValue declares the associated hash function; finally, Size and Name are the human-friendly name of this space.

TranslateSpace specifies how data locations should be translated within this space. SignedLo- cation is almost identical to Location, however the most-significant bit is treated as the sign of the location to translate. The number of SignedLocation values necessary is determined by the number

121 of dimensions in that space.

ScaleSpace specifies how data locations should be scaled within this space. Scaling is done uniformly across all dimensions, so that any stored normals do not need to be updated.

ProposeSceneLog ::= ”7” MiniStatus ProposalId SceneLogPublic SceneLogPrivate

ProposalId ::= Lamport

AcceptSceneProposal ::= ”8” MiniStatus ProposalId NodeList

NodeList ::= NodeId NodeId NodeList |

Re jectSceneProposal ::= ”9” MiniStatus ProposalId Re jectReason NodeList

Re jectReason ::= 4-bit number

All requests related to scene log proposals must be sent via TCP, to guarantee delivery. The algorithm for dealing with scene log proposals is as outlined in the text: a node wishing to alter the scene log sends ProposeSceneLog to its neighbours. If a node receives a ProposeSceneLog packet it either sends back RejectSceneProposal if it has reason to, forwards on the packet to its other neighbour if that neighbour exists, or sends back AcceptSceneProposal with its NodeId in the

NodeList. If a node receives either of AcceptSceneProposal or RejectSceneProposal, and it didn’t send out the proposal, it adds its NodeId to NodeList and passes it to its other neighbour.

RejectReason can be 1, for a rejection because a proposal with that Lamport already exists, 2 because the Lamport is less than the scene log’s current Lamport, or 3 because a peer was missed

during the consensus protocol.

122 C.4 Data

ProposeData ::= ”10” MiniStatus SceneId SpaceId Location Signature EncryptedData

EncryptedData ::= Lamport (DataBlob NullDataBlob) |

AcceptData ::= ”11” MiniStatus Signature

Re jectData ::= ”12” MiniStatus Re jectReason

GetData ::= ”13” MiniStatus SceneId SpaceId Location

ReturnData ::= ”14” MiniStatus SceneId SpaceId Location EncryptedData

EmptyData ::= ”15” MiniStatus SceneId SpaceId Location

All requests to retrieve or store data should be sent via TCP. DataBlob is deliberately ill-formed, as its contents vary considerably across different spaces. NullDataBlob is a DataBlob which is used to erase a data packet stored at a specific location. For security reasons, it’s recommended that this looks as much like a DataBlob as possible and have multiple variations.

RejectReason can be 4, in which case the signature was invalid, or 5, in case of an ill-formed data blob.

QueryRange ::= ”16” MiniStatus SceneId SpaceId Location Location

ReturnRange ::= ”17” MiniStatus SceneId SpaceId Locations

Locations ::= Location Location Locations | Re jectRange ::= ”18” MiniStatus SceneId SpaceId Location Re jectReason

Range queries are not considered important, and should never span more than one network packet, so these are handled via UDP. QueryRange takes two locations to describe the range.

123 RejectReason can be 6 if ReturnRange would have required more than a single network packet, or

7 if all of the range fell outside the responsibility of the queried node.

C.5 Render Pools

RequestLocalPoolJoin ::= ”19” MiniStatus SceneFrame

RequestRemotePoolJoin ::= ”20” MiniStatus RenderPool SceneFrame

AcceptPoolJoin ::= ”21” MiniStatus

Re jectPoolJoin ::= ”22” MiniStatus Re jectReason

Render pool packets are sent via UDP. RequestLocalPoolJoin suggests another peer join the current peer’s render pool, and RequestRemotePoolJoin requests joining a remote peer’s render pool. SceneFrame is sent along to ensure consistency with the value of RenderPool, as it may have changed since the last status update.

C.6 Rendering

SendRays ::= ”23” MiniStatus RenderLocation PathRays

RenderLocation ::= Scalar Scalar RenderLocation | PathRays ::= PathRay PathRay PathRays | PathRay ::= PixelID Age 3DOrigin 3DDirection Channels

Channels ::= ImageData ImageData Channels |

Finally, we need a data packet to pass around path rays. RenderLocation is either one or three

Scalars which determine where the Voronoi cell centroid is located, depending on the algorithm.

PixelID is the pixel associated with the given path ray; Age is the number of bounces the ray has taken; 3DOrigin and 3DDirection are 3D points which determine the origin and direction of

124 the ray, respectively; Channels collects all the image data channels associated with the path ray, information such as luminance, luminance attenuation, distance, object IDs, and so on. As this depends heavily on the render settings, it is deliberately ill-defined.

125