DRPU A Programmable Hardware Architecture for Real-time Ray Tracing of Coherent Dynamic Scenes

Sven Woop Graphics Group Saarland University 66123 Saarbr¨ucken, Germany

Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften (Dr.-Ing.) der Naturwissenschaftlich-Technischen Fakult¨at I der Universit¨at des Saarlandes Betreuender Hochschullehrer / Supervisor: Prof. Dr.-Ing. Philipp Slusallek Universit¨at des Saarlandes, Saarbr¨ucken, Germany

Gutachter / Reviewers: Prof. Dr.-Ing. Philipp Slusallek Universit¨at des Saarlandes, Saarbr¨ucken, Germany

Prof. Dr. Wolfgang J. Paul Universit¨at des Saarlandes, Saarbr¨ucken, Germany

Prof. Erik Brunvand University of Utah, Salt Lake City, USA

Dekan / Dean: Prof. Dr. Thorsten Herfet Universit¨at des Saarlandes, Saarbr¨ucken, Germany

Eingereicht am / Thesis submitted on: 19. Dezember 2006 / December 19th, 2006

Datum des Kolloquiums / Date of defense: 18. Juni 2007 / 18th June 2007

Vorsitzender des Kolloquiums: Prof. Gert Smolka

Wissenschaftlicher Beisitzer: Rafal Mantiuk, Ph.D.

Sven Woop Lehrstuhl f¨ur Computergraphik, Geb. E 1 1 / E08 Im Stadtwald, 66123 Saarbr¨ucken, Germany [email protected]

Version 1.1 June 29, 2007 iii Abstract

Ray tracing is a rendering technique capable of generating high qual- ity photo-realistic images of three dimensional scenes. Rendering speed has been an issue for a long period, but recently high performance software im- plementations have made real-time ray tracing possible. But reaching per- formance levels comparable to rasterization also requires dedicated hardware solutions. This thesis proposes the DRPU architecture (Dynamic Ray Processing Unit) as the first programmable ray tracing hardware design for coherent dynamic scenes. For programmable shading it contains a shading processor that achieves a high level of efficiency due to SIMD processing of floating point vectors, massive multi-threading, and synchronous execution of pack- ets of threads. A dedicated traversal and intersection unit allows for efficient ray casting even in highly dynamic scenes by using B-KD Trees as spatial index structure - a kind of Bounding Volume Hierarchy. A Skinning Proces- sor is used to compute dynamic scene changes and an Update Processor to update the B-KD Tree node bounds after these changes. A working FPGA prototype implementation specified in the devel- oped hardware description language HWML is presented, which achieves performance levels comparable to commodity CPUs even though clocked at a 50 times lower frequency of 66 MHz. The prototype is mapped to a 130nm CMOS ASIC process that allows precise post layout performance estimates. These results are then extrapolated to a 90nm version with similar hardware complexity to current GPUs. It shows that with a similar amount of hard- ware resources frame-rates of 80 to 280 frames per second would be possible even with complex shading at 1024x768 resolution. This would form a good basis for game play and other real-time applications. iv v Kurzfassung

Ray-Tracing ist eine Rendertechnik, mit der man hochqualitative und photorealistische Bilder von dreidimensionalen Szenen erstellen kann. Lange Zeit war die geringe Geschwindigkeit des Ray-Tracing-Algorithmus ein großes Problem, in den letzten Jahren wurde jedoch sogar Echtzeit-Ray-Tracing durch hochperformante Ray-Tracing Software erm¨oglicht. Um die Leis- tungsf¨ahigkeit von Rasterisierungs-Hardware zu erreichen sind jedoch spezielle Hardware-Implementierungen f¨ur Ray-Tracing unerl¨asslich. Diese Arbeit stellt die DRPU-Hardware-Architektur vor (Dynamic Ray Processing Unit), die erste voll programmierbare Ray-Tracing-Hardware- Architektur, die selbst hochdynamische Szenen unterst¨utzt. Die DRPU enth¨alt f¨ur programmierbares Shading einen Shading-Prozessor, der hohe Ef- fizienz durch SIMD-Verarbeitung von Fließkomma-Vektoren, massives Multi- Threading und synchrone Ausf¨uhrung von Paketen von Threads gew¨ahrleistet. Die Verwendung von B-KD-B¨aumen (einer Art Bounding- Volume-Hierarchie) sowie einem speziellen Traversierungs- und Schnittpunkt- Berechnungs-Prozessor erm¨oglicht ein effizientes Verfolgen von Strahlen selbst in hochdynamischen Szenen. Dynamische Ver¨anderungen der Szene k¨onnen effizient durch einen Skinning-Prozessor berechnet werden. Dies erfordert die Neuberechnung der Begrenzungen (Bounds) der Index-Struktur durch einen optimierten Update-Prozessor. Dar¨uber hinaus wird eine lauff¨ahige FPGA-Prototypen- Implementie- rung der Architektur vorgestellt, die eine ¨ahnlich hohe Leistungsf¨ahigkeit wie heutige hochperformante CPUs erreicht, trotz seiner 50-mal niedri- geren Taktfrequenz von nur 66 MHz. Der Prototyp wurde in einer ei- gens entwickelten Hardwarebeschreibungssprache HWML implementiert wo- durch eine einfache Portierung auf einen 130nm CMOS-Prozess erm¨oglicht wurde. Die Ergebnisse von Place und Route werden auf eine 90nm Version mit ¨ahnlicher Hardwarekomplexit¨at wie heutige GPUs hochgerechnet und pr¨azise Geschwindigkeitsabsch¨atzungen durchgef¨uhrt. Selbst f¨ur komplexes Sha- ding w¨aren mit solch einem Chip Bildwiederholraten von 80 bis 280 Bildern pro Sekunde bei einer Aufl¨osung von 1024x768 Pixeln m¨oglich. Dies stellt eine gute Grundlage f¨ur Computerspiele oder sonstige Echtzeitanwendungen dar. vi vii Zusammenfassung

In der Echtzeit-Computergrafik ist der Rasterisierungs-Algorithmus der heute dominierende Algorithmus, haupts¨achlich da f¨ur diesen sehr effi- ziente Hardware-Implementierungen existieren. Die Grundidee der Raste- risierung besteht darin, die Szene abzubilden, indem jedes Dreieck einzeln auf die Bildebene projiziert wird und jedes bedeckte Pixel gezeichnet bzw. geshaded wird. Da dieses Verfahren Dreiecke unabh¨angig voneinander be- trachtet, k¨onnen komplexe Effekte wie Schatten oder Reflektionen nicht di- rekt berechnet werden, sondern nur durch ineffiziente Multi-Pass-Verfahren approximiert werden. Ray-Tracing hingegen erm¨oglicht die physikalisch korrekte Berechnung solcher Shading-Effekte, indem es Licht r¨uckw¨arts durch die Szene verfolgt und so beispielsweise durch das Schießen eines Reflektionsstrahls den Beitrag des Lichts aus der Reflektionsrichtung direkt in die Farbberechnung mitein- beziehen kann. Da das Schießen von Strahlen durch die Szene eine hohe Rechenkomplexit¨at erfordert, war Echtzeit-Ray-Tracing lange Zeit undenk- bar. In den letzten Jahren haben effiziente Software-Implementierungen dies jedoch erm¨oglicht. Diese Arbeit beschreibt die DRPU-Hardware-Architektur, die Ray- Tracing nicht nur echtzeitf¨ahig macht, sondern in Performanz und Funk- tionalit¨at nahe zur Rasterisierung bringt. Die DRPU erlaubt einerseits das rekursive Schießen von Strahlen w¨ahrend des voll programmierbaren Sha- dings, andererseits werden jedoch auch hochdynamische Szenen unterst¨utzt, solange eine gewisse Koh¨arenz der Dynamik vorhanden ist. Kapitel 1 gibt eine Einf¨uhrung in das Rendering, indem der Rasterisierungs-Algorithmus und der Ray-Tracing-Algorithmus beschrieben und verglichen werden. In Kapitel 2 wird ein Uberblick¨ ¨uber die DRPU- Architektur gegeben, in dem auf die Zusammenarbeit der einzelnen Hard- wareeinheiten eingegangen wird und wichtige Konzepte wie Multi-Threading oder Paket-Bearbeitung von Threads beschrieben werden. Kapitel 3 besch¨aftigt sich mit speziellen Hardwareeinheiten zum Schießen von Strahlen durch dynamische Szenen. Als eine neue Daten- struktur f¨ur Ray-Tracing von dynamischen Szenen werden B-KD B¨aume beschrieben, die eine Art Bounding-Volume-Hierarchie mit eindimensionalen Begrenzungen darstellen. Durch diese Datenstruktur werden Strahlen von einem Traversierungs-Prozessor (TP) traversiert, sowie Strahlen mit Drei- ecken durch einen Schnittpunkt-Berechnungs-Prozessor (GP) geschnitten. Diese beiden Einheiten erm¨oglichen das effiziente Verfolgen von Strahlen durch die Szene, eine der teuersten Operationen von Ray-Tracing. Um dy- namische Szenen zu unterst¨utzen, m¨ussen Teile der B-KD-B¨aume der Szene aktualisiert werden. Dies geschieht durch einen speziellen Update-Prozessor. Es wird demonstriert, dass ein FPGA-Prototyp, der diese Hardwareeinheiten viii zum Strahlenschießen verwendet, schneller ist als herk¨ommliche Software- Implementierungen. Eine weitere wichtige Operation von Ray-Tracing ist das Shading, das an jedem Auftreffpunkt eines Strahls mit der Szene eine Farbe berechnet. Kapitel 4 beschreibt den speziellen Shading-Prozessor (SP) der DRPU, der einen ¨ahnlichen Befehlssatz wie Fragment-Programme der heutigen GPUs hat. Der Shading Prozessor besitzt jedoch deutlich mehr F¨ahigkeiten, da er einerseits hardwaregest¨utzte Rekursion und zum anderen flexible Speicher- zugriffe erlaubt. F¨ur die Implementierung von rekursivem Ray-Tracing wird insbesondere die Rekursion ben¨otigt, wodurch Ray-Tracing-Effekte sehr ein- fach zu kombinieren sind. Der Shading-Prozessor k¨onnte auch dazu verwendet werden die dyna- mischen Ver¨anderungen der Szene zu berechnen w¨are hierbei jedoch nicht so effizient wie eine spezielle Hardwareeinheit. Deshalb wird in Kapitel 5 ein weiterer Spezial-Prozessor beschrieben, welcher einen weit verbreiteten Skinning-Algorithmus implementiert und sehr effizient ausf¨uhren kann. Da viele bereits vorhandene Hardware-Ressourcen von diesem Skinning Prozes- sor wiederverwendet werden, entstehen durch ihn kaum zus¨atzliche Kosten in der DRPU-Architektur. Kapitel 6 widmet sich einer eigens entwickelten Hardwarebeschreibungs- Bibliothek HWML (Hardware Meta Language) f¨ur die funktionale Program- miersprache ML die die Implementierung der DRPU-Architektur stark ver- einfacht hat. HWML erlaubt es, Schaltungen auf einer abstrakten struktu- rellen Ebene zu beschreiben, indem polymorphe Busse, getypte Speicher mit mehreren Ports, automatisches Pipelining, eine Vielzahl von polymorphen Schaltung (wie zum Beispiel FIFOs), sowie rekursive Schaltungsbeschrei- bungen unterst¨utzt werden. Durch die polymorphen Busse und Schaltungen wird eine besonders kompakte Beschreibung und gr¨oßere Parametrisierbar- keit der DRPU-Implementierung erm¨oglicht. So wurde die gesamte Hard- ware in nur 8000 Zeilen ML-Code implementiert. Dar¨uber hinaus k¨onnen Fließkommagenauigkeit, Anzahl der Hardware-Threads, Threads pro Paket, und viele weitere Parameter einfach durch eine Konfigurationsdatei f¨ur De- signstudien angepasst werden. Durch die Abstraktion der Speicher mittels HWML ist es problemlos m¨oglich, die Hardware-Beschreibung sowohl auf einem FPGA als auch auf einer ASIC-Plattform zu implementieren. In Kapitel 7 wird die FPGA-Prototypen-Implementierung der DRPU beschrieben, sowie Statistiken zu verschiedenen Szenen pr¨asentiert. Es wird gezeigt, dass die zum Einsatz gebrachten Techniken in der Praxis sehr gut funktionieren und der Prototyp trotz seiner geringen Taktfrequenz von nur 66 MHz bereits Echtzeit-Bildwiederholraten erreicht. Jedoch w¨urde seine Performanz noch lange nicht f¨ur heutige Computerspiele gen¨ugen, bei denen Bildwiederholraten von 40 Bildern pro Sekunde bei einer Aufl¨osung von 1024x768 Pixeln das absolute Minimum sind. Um zu zeigen, dass selbst dies mittels einer ASIC-Implementierung ix der DRPU-Architektur m¨oglich w¨are, beschreibt Kapitel 8 eine Abbildung des Prototyps auf einen 130nm CMOS-Prozess von UMC. Durch Messungen wird eine optimale Konfiguration des ASICs ermittelt, und deren Synthese- als auch Place- und Route-Ergebnisse analysiert. Eine Hochrechnung der Ergebnisse auf einen 90nm-Prozess erm¨oglicht die Absch¨atzung der Leis- tungsf¨ahigkeit, die mit vergleichbaren Hardware-Ressourcen wie heutigen Rasterisierungs-Grafikkarten m¨oglich w¨are. Diese Absch¨atzungen zeigen, dass eine Ray-Tracing-Leistung von etwa 70 heutiger CPUs in einem ein- zelnen Chip zu integrieren w¨are, was einer Bildwiederholrate von mehr als 90 Bildern pro Sekunde bei einer Aufl¨osung von 1024x768 Pixeln entspricht. Mit heutiger Chip-Technologie ist es somit m¨oglich Ray-Tracing-Hardware zu entwickeln, welche f¨ur den Einsatz in Computerspielen geeignet w¨are und signifikante Vorteile f¨ur die Berechnung realistischer Bilder h¨atte. x xi Acknowledgements

First of all, I would like to thank the supervisor of my thesis Prof. Dr. Philipp Slusallek for his support, many worthwhile discussions, and good ideas for pointing the thesis in the right direction. Secondly I have to thank Dr. J¨org Schmittler, the founder of the SaarCOR hardware project who established the basis for this thesis through detailed software simulations. I would also like to thank my reviewers, Prof. Dr. Philipp Slusallek, Prof. Dr. Wolfgang J. Paul and Prof. Brunvand for accepting the respon- sibility of reviewing my thesis. Thanks to Prof. Brunvand for his help in getting the ASIC tools running and for some interesting discussions on ASIC design. Furthermore, I would like to thank the Computer Graphics Group of Saarland University for a pleasant 3 years and the good working atmosphere it created. Thanks also go to the Graduate School “Leistungsgarantien f¨ur Rechnersysteme” of Saarland University for funding and Xilinx for providing two Virtex 4 FPGAs via their University Program. Finally, I would like to thank my family and my girlfriend for their support during the stressful period of writing this thesis. xii Contents

1 Introduction 1 1.1 Rasterization ...... 3 1.2 RayTracing...... 4 1.3 Ray Tracing versus Rasterization ...... 7 1.3.1 DynamicScenes ...... 8 1.3.2 Scalability...... 9 1.3.3 Shading ...... 10 1.3.4 ComplexScenes ...... 11 1.4 PreviousWork ...... 12 1.4.1 Shared Memory Machines ...... 12 1.4.2 CommodityPCs ...... 13 1.4.3 GPUs ...... 13 1.4.4 Cell ...... 14 1.4.5 CustomHardware ...... 15 1.4.6 SaarCORProject...... 18 1.4.7 OpenRTProject ...... 19 1.5 OutlineofthisThesis ...... 21

2 Overview of the DRPU Architecture 23 2.1 Pipelining ...... 27 2.2 Multi-Threading ...... 27 2.3 PacketsofRays/Threads...... 28 2.4 MemoryInterface...... 29 2.5 Performance Scalability ...... 31 2.6 ProgrammingModel ...... 32 2.7 Hardware Description Language HWML ...... 34 2.8 FPGAPrototypePlatform ...... 35 2.8.1 Floating Point Format ...... 35

3 Ray Casting Hardware for Dynamic Scenes 37 3.1 Introduction...... 37 3.2 Bounded KD Trees (B-KD Trees) ...... 40 3.2.1 B-KD Tree Construction ...... 44

xiii xiv CONTENTS

3.2.2 KD Tree Transformation ...... 46 3.2.3 B-KDTreeUpdate...... 47 3.2.4 B-KD Tree Traversal ...... 48 3.3 Hardware Implementation of B-KD Trees ...... 51 3.3.1 DataLayout ...... 51 3.3.2 UpdateProcessor...... 54 3.3.3 Ray Casting Unit (RCU) ...... 62 3.3.4 Results ...... 71 3.4 ConclusionsandFutureWork ...... 77

4 Programmable Shading Processor (SP) 79 4.1 Introduction...... 79 4.2 Instruction Set Architecture ...... 80 4.2.1 Shading Processor Registers ...... 82 4.2.2 Shading Processor Instruction Set ...... 84 4.3 Shading Processor Microarchitecture ...... 92 4.3.1 Design Decisions ...... 92 4.3.2 Hardware Implementation ...... 100 4.4 Implementation Results ...... 104 4.5 ConclusionsandFutureWork ...... 108

5 Skinning Processor 109 5.1 Introduction...... 109 5.2 Hardware Architecture ...... 111 5.3 Prototype Implementation and Results ...... 114 5.4 ConclusionsandFutureWork ...... 116

6 Hardware Description Language HWML 117 6.1 Introduction...... 117 6.2 LowLevelStructuralLibrary ...... 119 6.2.1 Circuit Creation ...... 119 6.2.2 Automatic Pipelining ...... 121 6.2.3 Simulation ...... 121 6.3 High Level Abstractions ...... 122 6.3.1 Components as Functions ...... 122 6.3.2 AbstractWires ...... 122 6.3.3 Stream Abstraction ...... 124 6.3.4 Pipelines as Stream Elements ...... 126 6.3.5 Multi-ported Memory Abstraction ...... 126 6.3.6 Recursion and Higher Level Functions ...... 127 6.3.7 Hierarchy Tagging ...... 128 6.4 Conclusion ...... 129 CONTENTS xv

7 DRPU FPGA Implementation 131 7.1 TestScenes ...... 132 7.2 FPGAConfiguration...... 135 7.2.1 Latencies and Throughput ...... 136 7.2.2 Floating Point Performance ...... 139 7.3 Performance Evaluation ...... 140 7.4 MemoryInterface...... 142 7.5 Scalability with Number of FPGAs ...... 146 7.6 Scalability with Number of Triangles ...... 147 7.7 Conclusions ...... 150

8 DRPU ASIC Implementation 151 8.1 Synthesis of different DRPU Configurations ...... 152 8.2 PlaceandRoute(P&R) ...... 155 8.3 Performance Evaluation ...... 158 8.4 Conclusions ...... 161

9 Final Summary, Conclusions, and Future Work 165 9.1 FutureWork ...... 166 9.2 FinalConclusions...... 167

A Abbreviations 169

B Phong 171

C Mandelbrot Shader 175 xvi CONTENTS List of Figures

1.1 Rasterization Algorithm ...... 3 1.2 KDTree: SpatialSubdivision ...... 5 1.3 WhittedRayTracer ...... 6

2.1 DRPUDataFlowChart...... 24 2.2 DRPUArchitecture ...... 25 2.3 DRPU Prototype FPGA Platform ...... 35

3.1 B-KDTreeNode ...... 40 3.2 ImplicitB-KDTreeBox ...... 41 3.3 Two-Level B-KD Trees ...... 42 3.4 B-KDTreeUpdate...... 47 3.5 B-KDTreeTraversal ...... 49 3.6 B-KDTreeInner-NodeLayout ...... 51 3.7 B-KD Tree Pointer Layout ...... 52 3.8 B-KDTreeLeaf-NodeLayout ...... 52 3.9 Vertex Position Layout ...... 52 3.10 TriangleLayout...... 53 3.11 Triangle Vertex Position Pointers ...... 53 3.12 UpdateInstructionSet...... 55 3.13 UpdateProcessor...... 57 3.14 UpdateAlgorithm ...... 59 3.15 RayCastingUnit...... 62 3.16 TraversalProcessor...... 64 3.17 Traversal Slice and Decision Unit ...... 65 3.18 Derivation of Packet Traversal Decision ...... 66 3.19 GeometryProcessor ...... 69 3.20 Geometry Preparation Pipeline (1st Configuration) ...... 72 3.21 Geometry Preparation Pipeline (2nd Configuration) . . . . . 72 3.22 Cost per Pixel for Coherent Motion ...... 75 3.23 Skeleton Animation: Frame-rate ...... 75 3.24 Cost per Pixel for Incoherent Motion ...... 76 3.25 Linear Complexity of Ray Casting in Randomly Moving Scenes 76

xvii xviii LIST OF FIGURES

4.1 Shading Processor Registers Diagram ...... 83 4.2 Computing a Branch Condition from a 4D Vector ...... 89 4.3 ShadingProcessor: AbstractView ...... 92 4.4 Shading Processor: Register Stack ...... 98 4.5 Shading Processor: Implementation View ...... 101 4.6 MandelbrotSet ...... 106

5.1 Skeleton Subspace Deformation ...... 109 5.2 SkinningProcessor ...... 112 5.3 Skinning Processor: Set Matrix Row Instruction ...... 113 5.4 Skinning Processor: Transform and Accumulate Instruction . 113 5.5 SkinningProcessor: LinearRuntime ...... 115

6.1 Register with Clock Enable ...... 120

7.1 DRPU Prototype FPGA Platform ...... 131 7.2 UsageStatistics...... 137 7.3 Usage Dependent on the Number of Packets ...... 138 7.4 DRPU Prototype Scalability with Scene Size (but Constant WorkingSet) ...... 149 7.5 DRPU Prototype Scalability with Working Set Size ...... 149

8.1 Post Synthesis Area depending on Packet Size ...... 153 8.2 Post Synthesis Area depending on Number of Packets . . . . 153 8.3 Efficiency of the DRPU depending on Packet Size ...... 154 8.4 Efficiency of the DRPU depending on the Number of Packets 154 8.5 Size of different Parts of the DRPU ASIC ...... 155 8.6 PlotoftheDRPUASIC ...... 157 List of Tables

1.1 Comparison of Ray Tracing and Rasterization ...... 12

3.1 KD Trees, B-KD Trees, and AABVH Comparison ...... 43 3.2 SplitinthemiddleviaSAH ...... 45 3.3 UpdateProcessorResults ...... 61 3.4 Cost of a Single Traversal Step for KD Trees, B-KD Trees, andAABVHs...... 67 3.5 Complexity of Geometry Processor ...... 71 3.6 Frame-rate Comparison of SaarCOR, RPU, and OpenRT . . 73

4.1 ShadingProcessorRegisters ...... 82 4.2 ArithmeticInstructions ...... 85 4.3 Memory, Control, and Trace Instructions ...... 86 4.4 Logic/Integer Instructions ...... 87 4.5 ControlFlowTag...... 96 4.6 Shading Processor Performance Evaluation ...... 105 4.7 Comparison: Shading Processor versus CPU ...... 107

5.1 SkinningProcessorResults ...... 114

7.1 Latency and Throughput of the FPGA Prototype Implemen- tation ...... 136 7.2 Instruction Packeting Efficiency and Hardware Usage . . . . . 139 7.3 Number of Floating Point Units ...... 140 7.4 Frames per Second for DRPU FPGA ...... 141 7.5 Memory Packing Efficiency in the Shading Processor . . . . . 143 7.6 Cache Hit-Rates and Bandwidth Reduction ...... 145 7.7 DRAM Memory Statistics ...... 146 7.8 DRPU Prototype Scalability with Number of FPGAs . . . . . 148

8.1 Standard Cell Counts After Synthesis of the DRPU Architecture156 8.2 Post Synthesis Area Results of the DRPU Architecture . . . . 156 8.3 Post Synthesis Timing Results of the DRPU Architecture . . 156 8.4 Post Place and Route Cell Counts ...... 158

xix xx LIST OF TABLES

8.5 Post Place and Route Area Results ...... 158 8.6 Post Place and Route Timing Results ...... 158 8.7 Comparison of Different Hardware Platforms ...... 159 8.8 Performance Comparison of OpenRT, Cell, and Different DRPU ASICConfigurations ...... 160 8.9 Estimated Performance of the DRPU ASIC Versions . . . . . 162 Chapter 1

Introduction

In computer graphics, rendering is the process of generating a two-dimensional image of a three-dimensional virtual scene from a camera’s perspective. On the one hand, the generation of highly realistic images that look like pho- tographs of real-world scenarios is challenging, as this requires simulating highly complex physical light effects such as reflections, refractions, and in- direct illumination. On the other hand, real-time renderings require more than twenty frames per second to be computed at a constant rate to guar- antee fluent animations. Unfortunately, combining the realism and perfor- mance aspect of graphics proves to be very difficult and is possible only with trade-offs. Because of the mass market, three-dimensional computer games are the most important application for real-time rendering and are pushing the development towards faster and more capable graphics hard- ware, found in commodity PCs today. Despite computer gamers’ primary focus on high frame-rates (of about 60 fps) there is also the trend towards more realistic graphics in games. Recent 3D games, like Elder Scrolls IV: Oblivision [LLC06], require an enormous throughput of geometry, texture, and fragment data in order to achieve high realism. They increasingly im- prove realism by using shadows, reflections, and further multi-pass lighting effects that are important for observing the spatial relationship of objects. However, these effects become increasingly difficult to implement due to fundamental limitations of the rasterization algorithm as will be explained in the following sections. These difficulties raise the question of whether the performance advantages of rasterization still prevail over algorithmic advantages of other rendering techniques such as ray tracing. No matter which algorithm is used for computer graphics, their goal typically is to solve the light transport [Kaj86] or rendering equation respec- tively:

′ ′ ′ ′ Lo(x, ~ω)= Le(x, ~ω)+ b(~ω , x, ~ω) · (~ω · ~n) · Li(x, ~ω ) · d~ω (1.1) ZΩ

1 2 CHAPTER 1. INTRODUCTION

This equation is a simplified model of the physical properties of light. It defines the light distribution in a scene consisting of surfaces by modeling the way light is emitted and reflected multiple times on the surfaces.

The radiance Lo(x, ~ω) that leaves a point x on a surface in direction ~ω is computed as the sum of the self-emmited radiance Le(x, ~ω) from point x in direction ~ω and the integral over the radiance from all directions ~ω′ weighted by the BRDF function b(~ω′, x, ~ω). This BRDF function (Bidirec- tional Reflectance Distribution Function) specifies the reflectance properties of the material of the surface at point x. The term ~ω′ · ~n specifies the dot product between ω′ and the normal n of the surface at point x to measure ′ the radiance perpendicular to the surface normal. The last term Li(x, ~ω ) in the integral yields the incoming radiance for each direction ω′. This incom- ′ ing radiance Li can be obtained by using the ray tracing operator RT (x, ~ω ) to compute visibility. It returns the closest surface point seen from point ′ ′ ′ ′ x in direction ~ω , thus Li(x, ~ω )= Lo(RT (x, ~ω ), −~ω ). This shows that the searched function Lo is used on both sides of the rendering equation (on the left side and in the integral).

This recursive definition of the rendering equation makes recursive approaches solving the light transport possible. However, recursion is not necessary to solve the equation, but recursive approaches have proven to work best for effects like reflections and refractions. However, each algorithm that approximates the rendering equation needs to perform light transports which always requires some visibility computation to determine the start or end point for the transport.

In general there are two classes of rendering algorithm used to ap- proximate the rendering equation: rasterization algorithms and ray tracing algorithms. These algorithms implement different ways to compute visibility, which are in some sense dual to each other. Rasterization algorithms per- form forward projections, where geometry is projected onto an image plane, and ray tracing algorithms perform backward projections, where a ray is pro- jected onto the scene to find the closest hit geometry. Thus rasterization is a geometry centric algorithm as it processes geometry, and ray tracing is a ray centric algorithm as it processes rays. Rasterization is known for high rendering speed due to efficient hardware implementations but has difficul- ties computing many rendering effects accurately. In contrast, ray tracing is known for the best image quality due to its simulation based approach but poor rendering performance.

The next three sections describe and compare these two fundamental rendering algorithms in more detail, to show their advantages and weak- nesses in different scenarios. 1.1. RASTERIZATION 3

Figure 1.1: Rasterization: Each triangle is projected onto the image plane defined by a pinhole-camera model. Each pixel covered by the triangle is shaded and written to the frame-buffer. For visibility testing, not only the color but also a depth value is stored. A pixel gets overwritten only if this would decrease its depth, thus pixels (or fragments respectively) that are closer to the camera overwrite pixels that are farther away. After rasterizing each triangle, each pixel contains the shaded color of the closest triangle to the camera.

1.1 Rasterization

The current state-of-the-art in real-time rendering is the rasterization algo- rithm [FvDFH97] mainly because low-cost and efficient hardware implemen- tations in the form of highly parallel Graphics Processing Units (GPUs) [Cor06d, TI06] are available that achieve remarkable levels of per- formance. These hardware architectures are still developing at a very high rate, extending in particular their programmable shading capabilities and support for highly complex scenes using occlusion queries [GM05]. The basic operation of the rasterization algorithm conceptually is to independently project one triangle at a time onto the image plane to shade the covered pixels (see Figure 1.1). To obtain an image of the complete scene each triangle is processed this way by parallel hardware. Visibility is computed by storing not only a color per pixel, but also a depth information (z-value). A pixel is overwritten by a new incoming pixel (or fragment respectively) only if this new pixel is closer to the camera. This depth test guarantees that after rasterizing all triangles, the color and depth of the closest triangle to the camera is stored per pixel. However, this algorithm can be very inefficient, as a new triangle may overwrite many pixels that have been computed previously. Implementations of the rasterization algorithm are typically limited to triangles as primitives and use a regular grid of screen pixels and a camera projection. While this grid structure greatly simplifies hardware implemen- tations it also limits the hardware to perform regular samplings of a planar perspective projection only. This causes many problems for advanced ren- dering effects such as shadows as the set of shadow rays no longer obeys this restriction. Thus re-sampling is required, which is prone to errors and artifacts especially for thin geometry and at shadow borders. Irregular ras- 4 CHAPTER 1. INTRODUCTION terization would solve this problem, but little research has been performed in this direction [AL04, JMB04]. For efficient hardware implementation, rasterization assumes that tri- angles can be rendered independently, which is fine as long as only the ge- ometry is considered. But when advanced shading is taken into account the assumption that triangles can be processed independently is no longer cor- rect. Obviously, photons hitting a triangle are either absorbed or reflected to a different one, thus shading of triangles often depends on each other. Ras- terization tries to repair this issue by using multi-pass techniques that are inaccurate because of sampling problems, and inefficient especially if looking to the external memory bandwidth. Furthermore, these techniques require intensive manual coding using the procedural interface to the graphics card. This interface can either be OpenGL [Inc06b] developed by SGI [Inc06c] or DirectX [Cor06c] from Microsoft [Cor06b]. Both interfaces are specifically designed to be used for rasterization. In a global render state a current ver- tex shader, pixel shader, textures etc. can be set to be used for rendering through the . Triangles that are specified with the API are directly rendered using this render state to a render target such as a texture or frame-buffer for direct display. The interface is very low level, thus advanced effects such as shadows, reflections, etc. are not directly sup- ported by the API. These need to be computed manually by rendering the geometry several times to textures using appropriate hand-written .

1.2 Ray Tracing

Ray tracing [App68] simulates the physics of light, directly based on the light transport equation. By accurately solving the rendering integral through stochastical integration, ray tracing allows for computing advanced render- ing effects like global illumination or reflections, thus it does not suffer from algorithmic shading limitations. Widely used methods include Monte Carlo integration to gather incoming light [Jen03] or photon mapping to distribute outgoing light from the light sources [JC95]. The results of these compu- tations are high quality, photo-realistic images often indistinguishable from photographs [IRT06]. The basic operation of the ray tracing algorithm directly resembles the ray tracing operator by computing the closest intersection of a ray with the scene, also known as ray casting. The naive ray casting algorithm determines the closest intersection by intersecting the ray sequentially with each primi- tive of the scene, which works only as long as the number of primitives stays very small. For non-trivial scenes fine grained spatial index structures are required, to exclude primitives that are far away from the ray. These spatial index structures subdivide the space into subspaces or cells (see Figure 1.2) that can be traversed along the ray in front to back order, thus enumerating 1.2. RAY TRACING 5

Figure 1.2: Spatial Index Structure: This figure shows a KD Tree as an example for a spatial index structure. A KD Tree recursively subdivides the 3D space into subspaces and stores for each subspace (or cell), the overlap- ping triangles. During the traversal of a ray, the cells pierced by the ray are enumerated front to back and the ray is intersected with each of the contained triangles. The figure shows 3 rays of neighboring screen pixels with their coherent paths through the data structure. the geometry along and close to the ray. Each cell contains a reference to the overlapping geometry that is tested for intersection during traversal. This way, only geometry along the ray is tested for intersection while geometry far away is never touched. Thus only space or geometry that is close to the ray needs to be accessed for rendering, which is called the output-sensitivity of the ray tracing algorithm. Figure 1.2 illustrates that rays of neighbored pixels typically traverse a similar region in 3D space, and consequently also similar cells of the spatial index structure. This is a basic property called coherence that also holds for many shading operations and can be taken ad- vantage of to optimize algorithms and hardware. With different intersection functions, ray tracing can support arbitrary types of primitive geometry, such as: triangles, spheres, spline surfaces, quadrics, etc. Several different spatial index structures have been proposed to speed- up ray tracing. A first class of spatial index structures perform a spatial partitioning of the scene. Grids [CWBV83, AW87b] are the simplest data structure that uniformly subdivide the scene in each dimension. As the cell size is the same for each part of the scene, Grids cannot adapt to un- evenly dense triangulations. Octrees [Gla84, Arv88] subdivide the space recursively, where each node splits the space into 8 subspaces that are ar- ranged as a regular 2x2x2 grid. This recursive scheme can adapt well to the structure of the scene and Octrees are typically much smaller than Grids. A further approach are Binary Space Partitionings (BSPs) [FKN80]. These BSPs recursively subdivide space by splitting the current space using an arbitrary splitting plane into two subspaces. Because the planes can be cho- sen optimally, the data structure adapts very well to the scene geometry. KD Trees [Jan86, SF90] are a special case of a BSP tree where splitting 6 CHAPTER 1. INTRODUCTION

Figure 1.3: Whitted Ray Tracer: For each pixel a primary ray P is shot to compute the amount of light along this ray. Therefore, at the intersection point with the scene a shader gathers the incoming illumination for instance by shooting a shadow ray S towards the light source and a reflection ray R in the reflection direction. The shading of these secondary rays can themselves invoke further rays to be generated recursively during shading, such as the transparency ray T. planes are limited to axis aligned planes to reduce the complexity of single traversal steps and the size of the index structure. A different class of spatial index structures are Bounding Volume Hi- erarchies (BVH) [RW80], because they do not partition space but partition geometry. BVHs define a tree (of arbitrary arity) over the geometry, where each node stores bounds of the contained geometry. These bounds can be of different type, such as: axis aligned bounding boxes, spheres, or slabs. The structure of the Bounding Volume Hierarchy does not necessarily de- pend on the geometry position, thus on scene changes the structure can be maintained and bounds be updated. Even if using spatial index structures ray casting has linear worst case complexity in the number of scene primitives. An example would be a ray hitting a point in space that belongs to each primitive of the scene (for instance for a large triangle fan). As the geometry cannot be separated by their spatial position, they all need to be intersected independent of the kind of spatial index structure. But, in general, spatial index structures greatly improve the average case complexity, which makes them unavoidable for fast ray tracing implementations. Many measurements of the sub-linear average case complexity of ray tracing shows a logarithmic relation between the number of traversal steps and scene size [Hav01]. When speaking about ray tracing one typically refers to the the recur- sive ray tracing algorithm, also known as Whitted-Style Ray Tracing [Whi80] (see Figure 1.3). There, for each pixel of the image a primary ray P is shot into the scene using a pinhole-camera model. For this primary ray, and any 1.3. RAY TRACING VERSUS RASTERIZATION 7 further secondary rays, the amount of light traveling along the ray is com- puted. This is performed by recursively following the most important paths of light backwards towards the light sources. E.g. at the intersection point A of the primary ray with the scene a further shadow ray S and reflection ray R can be used in a shading computation to compute the amount of light along the primary ray. The shadow ray contributes if it hits no objects in front of the light source, and the reflection ray invokes further shading op- erations that recursively spawn new secondary rays, like the transparency ray T . Each shading operation at the intersection point uses local proper- ties of this point (such as color, reflectivity, etc.) and has global access to the scene by shooting sample rays to gather incoming light. Thus shading performs the integration of the hemisphere of a point, as defined by the ren- dering equation. This requires an evaluation of the BRDF function at the surface point and a sampling strategy to be computed in the shader. Shad- ing effects such as shadows, reflections, and refractions from other points can be automatically combined by the recursively shoot rays, which is often referred to as “plug and play” shading. An application only provides the scene geometry plus shaders that describe the material properties, while the ray tracing algorithm automatically computes the image. This is possible without any special treatment by the application, which would be necessary with rasterization. The recursive ray tracing algorithm operates in the opposite direction as light travels in physics, to compute only those light paths that contribute most to the final result. As photons are independent in physics, all shot rays can be handled independently to one another, thus ray tracing is a highly parallel algorithm, which makes fast hardware implementations possible. All hardware algorithms in this thesis are based on this very parallel and flexible recursive ray tracing algorithm.

1.3 Ray Tracing versus Rasterization

Ray tracing and Rasterization mainly differ in the way they perform vis- ibility computations. Ray tracing directly implements the ray tracing op- erator by finding the closest geometry from a point in a given direction. Consequently it can perform single light transports which can be used to stochastically solve the rendering equation by sampling optimal directions. However, these frequent ray queries require spatial index structures to be efficient. The visibility computation in rasterization follows a different approach where visibility is tested from a single point into several directions (one for each pixel). Principally, each triangle is sequentially tested against all of these directions, thus many rays (or directions) are processed in parallel. To 8 CHAPTER 1. INTRODUCTION speed-up this process, rasterization poses some restrictions on the directions that have to go through a regular grid of pixels. This regular structure is a kind of acceleration structure over the directions which makes hardware implementations possible but causes limitations in practice. Consequently, rasterization can compute multiple light transports in parallel, however only for a restricted set of rays. Sampling the optimal direction such as the reflection direction is not possible with this approach. On the other side rasterization algorithms can directly render triangles without any necessary spatial index structure there. At first glance, packet based ray tracing algorithms are very similar to rasterization as they also perform multiple light transports in parallel. The main difference is that most of these packet based ray tracing approaches can process arbitrary packets of rays. This is a big difference as it makes the sampling of arbitrary directions possible without any algorithmic limi- tations. Reflections can still be computed accurately with packet based ray tracing approaches as light can be followed-back exactly in the reflected di- rection. However, some packet based ray tracing approaches also obey some restrictions such a bounding frustum for the rays [RSH05]. The just described basic differences of the kind of possible light trans- ports and required acceleration structures causes advantages and disadvan- tages of both algorithms. Ray tracing has advantages in shading and can easily be parallelized but has difficulties with dynamic scenes. On the other hand, rasterization has difficulties with computing secondary effects. How- ever GPUs can also be parallelized and support arbitrary dynamic scenes very well. These differences are discussed in more detail in the next sections.

1.3.1 Dynamic Scenes As no auxiliary spatial index structures are required for rasterization, the triangle locations can change freely and rasterization is perfectly suited for handling arbitrary types of dynamic scenes. Even random changes of geom- etry can directly be displayed in the next frame by sending the randomized triangle positions to the rasterization pipeline. Due to the lack of spatial in- dex structures this rendering is only possible in linear time as each triangle might be visible and thus needs to be projected onto the screen. In contrast, ray tracing heavily relies on fine grained (hierarchical) spatial index structures to make ray casting efficient. These spatial index structures are costly to compute [WBS03] as a spatial sorting with best case O(n log n) complexity is required. This limits ray tracing in the support for dynamic scenes, and would heavily restrict its application in 3D computer games. To become an alternative to rasterization, performance and even more efficient support for dynamic scene changes are important. Several different approaches exist for the support of dynamic scenes in a ray tracer. Separation of the scene into objects with piece-wise rigid 1.3. RAY TRACING VERSUS RASTERIZATION 9 motion and separate static spatial indices has been suggested by [LAM01] and has been implemented for real-time use on a cluster of PCs [WBS03]. While this technique is very useful for object instantiations and for few mov- ing objects, it cannot handle deforming geometry. Grids allow for the fast insertion of primitives [RSH00] and even fast coherent rendering [WIK+06]. However, Grids typically consume a lot of memory and cause many problems with unevenly dense triangulations as they have a fixed cell size. Bounding Volume Hierarchies can easily be updated by modifying the node bounds and maintaining the tree topology. For this reason they are well suited for coherently changing geometry [TL03, WBS06a]. They can also be used to speed up collision detection of deformable objects [vdB97]. Even a later insertion or deletion of objects to or from the index structure is possible with these three approaches, however sequentially inserting many objects to a Bounding Volume Hierarchy may result in an unoptimal tree. None of these approaches allows for real-time ray tracing of arbitrary dynamic scenes: the first approach only supports few rigid-body objects, Grids cannot support animations with varying density of the triangles, and bounding volumes rely on coherent motions. Fortunately, in most applica- tions only rigid and coherently deforming objects occur. Thus, these objects can easily be handled by the mentioned methods or combinations of them.

1.3.2 Scalability Many parts of the rasterization algorithm such as vertex processing, rasteriz- ing, and fragment processing can efficiently be parallelized on-chip [Cor06d, TI06]. This is possible because triangles are processed completely indepen- dently by the rasterization algorithm, thus rasterization needs to distribute triangles to several processing units. However, results need to be written- back to a common frame-buffer which requires high bandwidth. This band- width can be achieved by partitioning the frame-buffer to several external DRAM chips that are accessed in parallel. On-chip caches can further re- duce the external memory accesses if geometry is sent locally to the graphics chip. The “scalable link interface” is a different technology that even enables the use of few graphics cards in parallel. There the geometry is distributed to the graphics cards and the frame-buffer is combined after each chip has rendered its set of triangles. Ray tracing is “embarrassingly” parallel and scalable due to the inde- pendence of rays (or pixels) which is a consequence from physical properties of light. Thus ray tracing algorithms typically scale by distributing the pix- els of the image to several rendering units. Unfortunately, a rendering unit that is operating on a small part of the image might need access to large parts of the scene (due to secondary rays for instance). As a consequence, performance can only be scaled as long as the bandwidth to the complete 10 CHAPTER 1. INTRODUCTION scene database scales with the number of parallel rendering units. This is simple to achieve if each rendering unit stores the current visible working set of the scene in its own local memory. To put it in a nutshell, both algorithms can be parallelized. Rasteri- zation algorithms are parallel in the processing of triangles and ray tracing algorithms in the processing of rays (or screen pixels). In rasterization the parallel rendering units have their own scene part to work on but need to combine their results into a common frame-buffer. In Ray tracing the par- allel rendering units have their own part of the frame-buffer to compute on but need access to principally the complete scene.

1.3.3 Shading As rasterization treats triangles individually the shading operations are in- herently limited to depend only on local data provided with each triangle and some limited global state. While this purely local computation allows for very efficient hardware implementations, all advanced rendering effects such as shadows, reflection, refraction, or indirect illumination cannot be directly computed this way. While these effects can be approximated using multi-pass rasterization techniques this often results in sampling artifacts and is inefficient with respect to memory bandwidth. For shadow or reflec- tion computations, typically each such render-pass requires rasterizing the triangles of the scene several times to high resolution shadow or environ- mental maps. These maps need to be of high resolution to reduce sampling artifacts due to the fixed sampling directions. These multi-pass techniques are often not obvious and difficult to implement especially if several effects need to work together, like shadows seen through a reflection. However, the combination of several effects is possible with rasterization. Simulation-based ray tracing techniques are necessary to easily and accurately support advanced rendering effects. In contrast to rasterization, ray tracing material shaders have access not only to local shading proper- ties of the surface, but also to the incoming light from all directions. This accurate and direct radiance query makes writing of shaders intuitive, with- out the demand for multi-pass techniques. In contrast to rasterization, Ray tracing supports declarative scene descriptions that are completely evaluated within the renderer, thus it allows for a of the com- plete rendering process. This greatly simplifies content creation compared to the inherently procedural interface of rasterization, where the application must handle all global effects (e.g. through precomputing texture maps). Furthermore, ray tracing is not limited to a fixed Grid of sample points but can choose the sampling positions freely by tracing arbitrary query rays. Even reflections covering only a few pixels can be efficiently computed, which is not possible with rasterization hardware as it is missing a 3D spatial index to quickly locate the relevant triangles seen through the tiny reflection. 1.3. RAY TRACING VERSUS RASTERIZATION 11

1.3.4 Complex Scenes

To handle highly complex scenes of multi-millions of triangles, rasterization also relies on spatial indices to render only potentially visible geometry. In contrast to ray tracing, these spatial indices are typically coarse and managed by the application. The application traverses the spatial index in a front to back order while hardware accelerated occlusion queries determine if any geometry of the cell of the spatial index has been drawn as it passed the depth test. If no geometry of a cell has been drawn and each pixel covered by the cell has been painted with a color no geometry in a cell behind it can contribute to the image as it is occluded. This technique allows the rendering of scenes with several hundred million triangles on current graphics cards as long as occlusion is high or level-of-detail algorithms are applied [GM05]. However, the processing of this spatial index structure requires control by the application and cannot be performed by the GPU alone. For ray tracing, support for highly complex scenes comes for free [WDS04] as the ray casting operation is logarithmic in the scene com- plexity if using spatial index structures and assuming constant memory ac- cess time [Hav01]. As long as the visible set of triangles stays small enough, caching is effective and bandwidth to the scene database is not crucial. But for views of highly complex scenes where many triangles are visible per pixel, caching is no longer effective and memory access patterns get very in- coherent which causes the memory access time to increase. For such highly triangulated scenes raw bandwidth to the large scene database is crucial for achieving high performance. For instance, the performance of the DRPU hardware implementations will drop in these cases as its performance builds on efficient caching and fairly low external memory bandwidth (see Chap- ter 7). If looking to the summary in Table 1.1 ray tracing has advantages in most areas. Thus from the algorithmic aspect, ray tracing clearly dominates rasterization with its capabilities, such as intuitive shading, logarithmic com- plexity, and free sampling positions resulting in high image quality. Support for dynamic scenes is limited, but ray tracing can keep up with rasteriza- tion, as long as dynamics are coherent, which is the case in traditional 3D computer games. As a consequence ray tracing fulfills the requirements to be used as rendering algorithm in computer games, and it would simplify content creation and allow for much higher realism. However, from the implementation aspect, ray tracing could only com- pete with GPUs with its own dedicated hardware solution to achieve frame- rates that are sufficient for game play. Thus this thesis presents the DRPU ray tracing hardware architecture, with similar capabilities as rasterization technology that would even provide sufficient performance on a single chip if manufactured in a 90nm process (see Chapter 8). 12 CHAPTER 1. INTRODUCTION

Rasterization Ray Tracing

Dynamics scenes Comes for free as no spa- Limited through the need tial index is required. of spatial indices.

Complex scenes Possible with occlusion Possible as ray casting culling and level of detail. has logarithmic complex- ity.

Shading Many approximations, Accurate simulation, local per triangle shad- global shading, recursive ing, difficult to combine combination of effects effects

Secondary effects Inaccurate and band- Accurate width consuming

Small secondary Inefficient due to large Efficient due to fine effects grained rendering passes. grained ray casting.

Hardware acceler- Procedural control by the Acceleration of complete ation application. rendering possible.

Performance High due to hardware ac- Low as long as hardware celeration. acceleration is still miss- ing.

Table 1.1: A comparison of the advantages and disadvantages of ray tracing and rasterization.

1.4 Previous Work

This section cannot list all previous work in real-time ray tracing, but shows some important papers. The section is structured into different hardware platforms that have been used (more or less successfully) for fast ray tracing implementations.

1.4.1 Shared Memory Machines Interactive ray tracing has been mapped to many different kinds of super- such as SGI shared memory machines [PSL+99, Muu95], the CRAY T3E [Neb97], the nCUBE2 [ASR98], the Hypercube iPSC/2 [BP90], or the KSR1 [KH95]. All these general purpose platforms provide much flexibility for high speed implementations. Especially the scalability of the 1.4. PREVIOUS WORK 13 ray tracing algorithm was proven, thus by increasing the number of proces- sors one can scale performance mostly linear. This is possible by processing blocks of pixels in parallel on the rendering units without any need for com- munication except for displaying the result.

1.4.2 Commodity PCs More recently, the OpenRT project [WSBW01, WPS+03] brought interac- tive performance also to clusters of standard PCs. A speed-up of a factor of 10 to 30 compared to standard ray tracing implementations was possi- ble with many optimizations, such as: data layout, optimized acceleration structures, assembly coding, and by operating on packets of rays. This packeting approach in particular made high performance on standard CPUs possible, where four rays can easily be processed simultaneously using SIMD instructions. By extending this packet approach to very large packets (up to 64 rays) and a frustum traversal algorithm it was recently possible to increase perfor- mance on standard desktop PCs again by a factor of 10 [RSH05] compared to OpenRT. This frustum traversal algorithm directly traverses the frustum of the ray bundle without considering the single contained rays. Thus its traversal complexity depends only on the shape of the frustum but is in- dependent on the actual number of contained rays. Obviously ray/triangle intersections still need to be performed per ray.

1.4.3 GPUs Ray tracing has also been mapped to programmable GPUs using different kinds of spatial index structures and different methods to work around the GPU limitations. In particular the limited programming model of GPUs makes it difficult to achieve performance levels comparable with standard CPUs, despite GPUs having a much higher core floating point performance and memory bandwidth. It is for instance difficult to build stacks on the GPU and to hold them on-chip. This makes the traversal of the most ef- ficient data structures such as standard KD Trees not possible. The first approaches that map ray tracing to the GPU concentrate on a speed-up of the ray/triangle intersection only [CHH02]. However, the lack of spatial index structures makes these approaches too slow for larger scenes. Grids can easily be used as spatial index structure on the GPUs [Pur04] as these require no stack for cell traversal. Performance can be improved this way, but Grids impose many limitations. The cell size of the Grid is equal all over the scene, which causes problems if many small triangles are located in a small part of the scene. These triangles would all be in a single cell, mak- ing the intersection with this cell expensive. Furthermore, girds can easily become very large because there size increase with O(n3) in the number of 14 CHAPTER 1. INTRODUCTION cells in one dimension. KD Trees don’t have these limitations and achieve good performance in static scenes, but typical implementations require a stack. This stack can be emulated with multiple rendering passes and dif- ferent approaches recompute the stack (instead of storing it) to implement pop operations [FS05]. A further speed-up of a stack-free traversal can be achieved by storing extra pointers in the leaf nodes that point to neighboring inner nodes [Pop06]. However, despite the intensive research in mapping ray tracing to GPUs, none of the known approaches can efficiently exploit the possible floating point performance of the GPUs. High speed software implementa- tions mostly beat the GPU ray tracers, because of the high flexibility of the CPUs which allows for advanced implementations, such as frustum traversal. The situation could change with the new Shader Model 4.0 as there a maximal number of 4,096 temporal registers can be used by a shader. This would allow to hold a large computational state in the register file, such as a stack for instance. However, it is still unclear how this large register file will be internally managed by the hardware and if the data would really stay on-chip or be written to external memory if some internal resources are exhausted.

1.4.4 Cell A different platform of interest for real-time ray tracing is the Cell Broad- band Engine [Zim03]. The Cell consists of 8 Synergistic Processing Ele- ments (SPEs) and a Power PC Processing Element (PPE). The Power PC is typically used for communicating with the SPEs that perform the actual computation. Each SPE contains 128 SIMD registers with 4 components of 32 bits and a local store with 256 KBytes of data. The SPEs cannot access the main memory directly, but have to load memory blocks to the local store via DMA to operate on it. The Cell chip will be used as the main processor in the Sony Play Station 3. Ray tracing has recently been implemented on the Cell architecture [BWSF06]. However, the Cell makes the implementation of a high performance ray tracer challenging, especially because all memory access needs to be handled in software (including caching). Such a software managed cache can cause a high overhead as it needs to be accessed fre- quently during traversal (and shading). Amortizing the memory lookup to many rays in a packet helps in the implementation, but still the size of the local store can easily become a limitation. However, the paper shows the highest ray tracing performance ever achieved on a single chip, thus the Cell can be used for realtime ray tracing. However, dedicated ray tracing hardware can easily be faster than general purpose approaches which is shown in Chapter 8 by a performance com- parison of the DRPU approach and the ray tracing implementation on the 1.4. PREVIOUS WORK 15

Cell.

1.4.5 Custom Hardware Multiple hardware projects proposed different custom hardware architec- tures for the ray tracing algorithm. This section summarizes the most im- portant projects for visualization of volume and surface models. Many hardware architectures have been published for volume visu- alization, however here only two projects are discussed. The first is the VIZZARD II board [MKS98] which is an extension of the VIZZARD ar- chitecture [KS97] and implements the complete volume visualization using ray casting of regular data sets. To improve image quality, oversampling can be performed in each dimension and gradients are precomputed for a fast gradient interpolation. To get high bandwidth to the volume data, the design contains caches and accesses 4 DRAM chips in an interleaved burst mode to access 2x2x2 blocks of the volume. A different hardware project implemented the VolumePro visualiza- tion chip [PHK+99] that is commercially available. The VolumePro design is manufactured as an ASIC (the vg500 chip) and implements ray casting to render the volume. By applying the shear-warp algorithm, the hardware processes the volume slice by slice and is limited to parallel projections. Us- ing the slice based approach, data can optimally be reused as the hardware always stores the last 2 slices in internal buffers. This makes a fast inter- polation of the gradient possible without any preprocessing. The hardware supports per sample Phong illumination, super-sampling and cut planes. Up to 500 million interpolated voxels can be processed with the vg500 chip, which is sufficient to render 2563 volumes at 30 fps. The company ART-VPS distributes the AR350 chip specifically de- signed for ray tracing of surface models [Hal01]. Their hardware solution never focused on the interactive market, but on accelerating existing ren- dering systems. The hardware design is no more up-to-date as it is using a 0.22 micron process and operates at a clock frequency of only 33 MHz. The PURE PCI-X boards contains 16 AR350 chips, being capable of perform- ing 1.1 billion ray/triangle intersections and shading operations per second. An AR350 chip contains two rendering cores; each is capable of doing one ray/triangle intersection per clock cycle and shading operations. Unfortu- nately, no detailed information about the shading processor is given. As their design is missing a ray traversal accelerator, real-time performance cannot be achieved with this technology. A very unusual ray tracing hardware architecture has been simulated in the paper [KiSSO02]. It describes a ray tracing memory that stores tri- angles and contains additional circuitry for Grid traversal and ray/triangle intersection. This approach is no typical Von Neumann architecture with a central processor and separate memory as the 3DCGiRAM uses the object- 16 CHAPTER 1. INTRODUCTION space parallel processing mode. Thus each memory cell stores and processes its part of the scene. In addition to its set of triangles it also stores a Grid as spatial index structure and contains a 3DDDA line generator for Grid traversal and 16 ray/triangle intersection units (ICUs) for ray/triangle intersection. As one ICU performs one ray/triangle intersection each 32 cy- cles, a peak number of one triangle intersection finishes each 2 cycles. There exists a global Ray Distributor that is responsible for transferring rays from one processing element to the next if necessary. Software simulations show that the 3DCGiRAM would achieve about 1 fps for the Conference scene at a resolution of 512x512 pixels and a clock frequency of 200 MHz. It requires a huge bandwidth of 6.4 GB/s between the ICUs and the scene memory, however this bandwidth can easily be achieved on-chip. The Conference scene will be rendered at 10.6 fps using the DRPU FPGA prototype, despite containing only a 1/3 of the ray/triangle intersection performance. This in- dicates that the used Grid data structure (64x64x64 cells) which is used in the 3DCGiRAM is too coarse and causes many unnecessary ray/triangle in- tersections. Hierarchical spatial index structures would better adapt to the scene structure and result in fewer ray/triangle intersections. Furthermore, the object-space subdivision scheme does not scale very well, because ray data needs to be transferred between possibly many 3DCDiRAM cells. On chip this should be possible, but between different chips this could easily cause bandwidth problems and is difficult to scale. It is not clearly stated in the paper how this should work in practice. Additionally, the object-space partition achieves poor performance if the view is only directed onto a small part of the scene. Then only one of the processing elements will compute the image, and it will probably be too slow. That is one of the main reasons why the DRPU approach does not use the object-space subdivision, but performs image-space subdivision of the computation based on the pixels. However, this causes a rendering unit to process principally the entire scene database, but this problem can be solved using caches (and possibly a virtual memory management). Processors with a reconfigurable instruction set are becomming more and more popular. One of these processors is the MorphoSys chip [Mor06] which consists of a simplified general purpose MIPS-like RISC processor, called TinyRISC, and 8x8 coarse grained reconfigurable cells, organized as SIMD architecture. MorphoSys is realized using 0.13µm technology and runs at 450 MHz. The paper [SEDT+03] describes a ray tracing architec- ture mapped to this processor. Shading and ray/triangle intersections are performed using the TinyRISC processor, but traversal through an Octree is speed-up by leveraging the reconfigurable cells. The MorphoSys chip runs at 300 MHz and is connected to 32 MB of DRAM clocked at 133 MHz. Their results for rendering scenes with 69k triangles is about 12 fps at a resolution of 256x256 pixels. This performance is roughly a 1/4 of the performance the DRPU FPGA prototype achieves at a clock frequency of 66 MHz. One 1.4. PREVIOUS WORK 17 reason for the lower performance may be either the Octree based spatial index structure, which cannot optimally adapt to the scene structure, and the very limited floating point performance of only 0.3 GFlops in the RISC processor. However, the paper shows that reconfigurable processors like the MorphoSys chip can speed-up ray tracing by implementing some parts of the algorithm using reconfigurable logic. A hardware project at the University of Toronto focused the develop- ment of a ray tracing hardware on an FPGA [FR03]. The paper describes a hardware architecture consisting of an intersection unit that is used for ray/triangle intersection and for traversal of a three level Bounding Volume Hierarchy. During traversal the intersection unit intersects the ray against the bounding planes that define the volumes of the node. The hardware works on packets of 3 rays to balance memory speed with computation speed and performs multi-threading using two of these 3-ray packets. Their prototype platform consists of two Xilinx Virtex 2000E FPGA clocked at 50 MHz, one of which connected to external SRAM. Their render perfor- mance is between 3 to 7 frames per second at low resolution (no exact data is provided in the paper) for scenes between 1k and 51k triangles. Another ray tracing hardware project at Jesus College uses an FPGA exclusively for intersection calculations of rays and spheres [Sri02]. Spheres have been used as primitive because of their simple intersection function. The only scene that is considered in that thesis is the Sphere-flake scene which is rendered including multi-level reflections and Phong shading. As no spatial index structure is used the approach only scales linearly with the scene size, thus large scenes cannot be rendered efficiently with this approach. Furthermore, the limitation to spheres as primitives makes the system incompatible to the state-of-the-art approaches that use triangles. A recent short paper [KZK+06] sketches a ray tracing hardware archi- tecture that can efficiently cast primary rays through a grid of pixels. They build planes for each of the M columns and each of the N rows of the screen and traverse these planes through a bounding sphere hierarchy. This results in M + N planes to be processed, rather than M · N rays for traditional methods. The planes get intersected with the bounding sphere hierarchy of the scene, and for each plane the intersected leaf cells are stored. In a second pass, this information can be used to efficiently determine the clos- est intersection of each primary ray. The paper contains some preliminary results that indicate frame-rates of 0.1 to 0.2 fps for a system operating at 100 MHz. Despite no resolution for the images given, the results indicate that performance from this system will not be very high. One reason might be that the traversal of planes is difficult to terminate early and the inter- sected leaf cells per plane need to be stored in a suitable data structure. To determine the intersection for the single rays, this data structure needs to be queried which causes additional overhead. However, the paper does not exactly describe how these algorithmic problems are solved. 18 CHAPTER 1. INTRODUCTION

A next ray tracing hardware project has been started in Delhi [Pat03]. Their goal is to speed-up RayLab [Gee06] an open source ray tracing system. The report describes computationally expensive parts of the RayLab code (such as ray/sphere intersection, ray/box intersection, etc.). The project plans to speed-up these parts of the computation using FPGA technology. Bandwidth to the FPGA might be a problem with this approach, but newer results for the project are not available.

1.4.6 SaarCOR Project This thesis was written in the context of the SaarCOR project (Saarbr¨ucken’s Coherence Optimized Ray Tracer) at the Computer Graphics Lab [Slu06] of the Saarland University [Uni06]. The project was founded by J¨org Schmit- tler in direct relation to the OpenRT project. J¨org Schmittler, designed the SaarCOR real-time ray tracing hardware architecture for static scenes and made performance estimations using soft- ware based simulations [Sch06]. The SaarCOR hardware design [SWS02] consists of a traversal unit and a ray/triangle intersection unit. The de- sign uses large packets of 64 rays to reduce bandwidth during rendering. However, a number of 16 of these 64 ray packets are required to hide the computation and memory latencies. The paper discusses hardware for ray casting where the memory requirements of the resulting 1024 rays are still reasonable. However, the most computational state is required during shad- ing thus a reduction of the number of rays per pipeline is necessary. Thus the first FPGA implementation of SaarCOR [SWS02] uses 32 smaller packets of 4 rays. The smaller granularity increases the fraction of active rays in a packet, thus a total number of 128 rays are sufficient for achieving good usage rates in the FPGA. The FPGA prototype contains hardware for ray traversal, ray/triangle intersection, and a fixed function shader. It is directly connected to several low latency SRAMs. A minimal shading processor for the SaarCOR prototype was implemented [Dre05] but never connected. To allow for programmable shading, the RPU design [WSS05] has been developed. The design contains a fixed function traversal unit and a fully programmable shading processor that is also used for ray/geometry intersection. As geometry is programmable, even higher order primitives such as spheres or spline surfaces are supported. A prototype of the RPU was implemented on an FPGA, configured to 32 packets of 4 rays. It shows that this configuration is sufficient to hide latencies in the RPU, even as it is connected to slow and higher latency external DRAM. In addition, the SaarCOR project also focused research on a virtual memory management [SLS03] system. There, the external DRAM is used as a large cache to store the current working set of the scene. It shows that a large cache (over several hundred MB) can easily hold the working set, such 1.4. PREVIOUS WORK 19 that a standard PCI bus would suffice to transfer missed data to the cache. However, no hardware implementation of the virtual memory management was performed. Problems with virtual memory could occur if the working set changes drastically. For instance, looking around a corner could result in a short hanging of the image. Prefetching could solve this problem, but no research was performed in this direction.

1.4.7 OpenRT Project The OpenRT software ray tracing project started in the year 2000, with the goal to develop a high performance software ray tracing system. On a single PC, a speed-up of a factor of 30 compared to competitive ray tracing implementations [Ray06, Kol] was achieved. It showed that using a cluster of PCs connected by a network, rendering performance scales linear with the number of processors and even real-time global illumination [WKB+02] and photon mapping [WGS04] are possible with OpenRT. The OpenRT programming interface [DWBS03] for the ray tracer resembles the OpenGL interface where possible to make porting OpenGL code easy. This thesis benefited from the OpenRT project as it gave access to efficient algorithms for KD Tree construction and helpful discussions. The OpenRT project focuses not only on triangular surfaces, but also on direct ray tracing of spline surfaces [BWS04]. These spline surfaces al- low for a more accurate representation of curved surfaces and a massive reduction of the size of the scene representation. Using a large shared memory machine (16 CPU cores, 64 GB memory) it is possible to ray trace highly complex outdoor scenes, such as a forest consisting of 1.5 billion (partially instantiated) triangles [DCDS05]. Render- ing of highly complex models, such as a Boeing 777 model with 300 million individual triangles, can be rendered interactively even on a single PC with 6 GB of main memory using ray tracing and a non-blocking virtual memory management [WDS04]. Volume visualization of various kinds of volumes is also performed in the OpenRT project. A KD Tree where the nodes store the minimum and maximum of all contained cells can be used to speed-up iso-surface visualiza- tion. By comparing the searched iso-surface against the min/max interval in the nodes large parts of the model can be skipped during traversal [MBFS06]. The paper [MS06] considers the rendering of unstructured volume data sets, such as tetrahedral and hexahedral meshes. The traversal approach here is a new algorithm that performs Pl¨ucker test with the edges of the tetrahedra (or hexahedra) to determine the correct exit surface of the ray. Using ray tracing for volume visualization it is possible to mix different types of geometry together, such as: regular volumes, tetrahedral volumes, hexahedral volumes, and even triangles (including reflecting ones) [MFS05]. This property is one of the largest advantages of ray tracing based volume 20 CHAPTER 1. INTRODUCTION visualization tools. However, problems always occur if different types of geometry overlap, such as a regular volume and a tetrahedral volume.

The main contributions of this thesis to the area of computer graphics and hardware architecture are the following:

1. Ray Tracing of Dynamic Scenes: The thesis describes B-KD trees, a kind of Bounding Volume Hierarchy, that combines the advantages of KD Trees and AABVHs in a single data structure that can effi- ciently be used for hardware ray tracing of coherent dynamic scenes. Hardware units for all performance critical aspects of B-KD trees are described, including: traversal, ray/triangle intersection, and updating of the data structure on dynamic scene changes.

2. Programmable Shading Processor: A Programmable Shading Processor with a GPU-like instruction set is described, that achieves high efficiency due to massive multi-threading and synchronous execu- tion of packets of threads. Efficient tracing of rays is possible directly from a shader, by using the Ray Casting Unit via a special “trace” in- struction and hardware managed recursion makes recursive ray tracing possible.

3. HWML: A structural hardware description library has been imple- mented and used to efficiently describe the implementation of the DRPU at a high level of abstraction. HWML made it possible to implement the complete hardware design in only 8000 lines of code in a parameterized form, that allows number of threads, packet size, floating point accuracy, and many more design parameters to be easily adjusted.

4. FPGA prototype: A working FPGA prototype of the complete DRPU architecture proves the applicability of the concepts and effi- ciency of the architecture. The prototype achieves performance levels comparable to commodity PCs, even when clocked at a 50 times lower frequency.

5. ASIC implementation: Post layout timing and area results from an ASIC implementation of the DRPU on a 130nm CMOS process allow accurate performance estimates. The results are extrapolated to a 90nm process to get the performance that would be achievable with up-to-date ASIC technology. It shows that performance rates sufficient for game play would be possible with the concepts described in this thesis. 1.5. OUTLINE OF THIS THESIS 21

1.5 Outline of this Thesis

In Chapter 2, an overview of the complete DRPU hardware architecture is given, while the remaining chapters describe architectural details. Thus Chapter 3 describes the spatial index structure used for rendering and hard- ware units for ray traversal, ray/triangle intersection and updating of the spatial index structure on geometry changes. Chapter 4 describes a pro- grammable Shading Processor with similar instruction set as current GPUs that additionally can trace rays using the ray casting units. Chapter 5 de- scribes a Skinning Processor, that implements a generalization of the widely used SSD skinning model. The hardware description library HWML used to implement the DRPU is presented in Chapter 6. An FPGA prototype of the architecture with detailed statistics is described in Chapter 7 while Chap- ter 8 analyses size, frequency, and performance aspects of different ASIC implementations of the architecture. Chapter 9 finally concludes this thesis. 22 CHAPTER 1. INTRODUCTION Chapter 2

Overview of the DRPU Architecture

The DRPU is a highly efficient real-time ray tracing hardware architecture with capabilities comparable to the traditional rasterization pipeline, which makes it usable for 3D computer games. Its capabilities include full support for programmable material shaders that can recursively spawn secondary rays, support for the most important kinds of dynamic geometries including animated characters and hardware accelerated skinning. The DRPU does not include extra hardware for texture filtering (except of an addressing mode) but similar texturing units as in GPUs could easily be included. This chapter gives a short overview of the DRPU and describes ar- chitectural principles that are used at different parts of the design. An analysis of the ray tracing algorithm shows that it is a data parallel algo- rithm as pixels can be computed independently to each other. Furthermore, ray tracing is also task parallel as it can be subdivided into different kinds of computations: ray traversal, geometry intersection, and shading. The DRPU hardware takes advantage of both: the data parallelism by operating on several pixels in parallel and the task parallelism by subdividing the ray tracer into several functional units. Applying task parallelism makes sense as highly optimized hardware units can be developed for the different tasks, and data parallelism can be exploited to increase the usage of the different hardware units by multi-threading. The scalability of the design is based on this data parallelism as it allows several identical hardware units to operate on different parts of the image. Furthermore, the computation of neighbor- ing rays is highly coherent for ray tracing, which makes the synchronous operation of packets of rays on replicated hardware possible. This packet based approach allows to reduce hardware complexity and external memory bandwidth. Programmable shading is the most expensive operation in a ray tracer, as it is floating point intensive, often requires a complex recursive control

23 24 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE

Figure 2.1: The DRPU architecture contains several dedicated hardware units, for skinning, updating of spatial index structures, shading, and ray casting including traversal and geometry intersection. On dynamic scene changes, first the Skinning Processor needs to recompute vertices, then the spatial index structure is updated, and later the scene can be rendered.

flow, and needs to perform many unstructured memory fetches. A Shading Processor (SP) optimized to recursion and floating point shading operations is used in the DRPU architecture to cope with these shading demands (see Chapter 4). The Shading Processor has some similarities with GPU frag- ment shaders (such as similar instruction set) but differs as it supports a more flexible memory access and recursion, but currently contains no dedi- cated hardware for texture filtering. As ray tracing is a ray centric algorithm, the basic ray casting op- eration needs special attention (similar as the rasterization operation in rasterization hardware). For game play frame-rates of over 40 fps, at resolu- tions of more than one million pixels, and special effects that require several rays per pixel are required. This high number of rays can best be treated by dedicated hardware units. These include ray traversal performed by a Traversal Processor (TP) and ray/triangle intersection performed by a Ge- ometry Processor (GP). These two units are closely connected together to perform efficient ray casting through dynamic scenes (see Chapter 3). The connection of both units will be called the Ray Casting Unit in the rest of this thesis. For ray casting, ray tracing relies on fine grained spatial index struc- tures, thus the choice of the right index structure needs special attention. On the one hand it should support dynamic scenes and on the other, allow for an efficient traversal during rendering. In the DRPU, the support for dy- namic scenes is possible by using a Bounding Volume Hierarchy-like spatial index structure that is small and easy to update on scene changes. Using a Bounding Volume Hierarchy poses some limitations on the supported dy- 25

Figure 2.2: A DRPU chip is connected to the host bus and external mem- ory and may contain multiple independent Rendering Units. A Thread Scheduler coherently schedules pixels to them. A Rendering Unit contains a Shading Processor (SP) for programmable shading, a Traversal Processor (TP) for spatial index structure traversal, and a Geometry Processor (GP) for ray/triangle intersection. To reduce external memory bandwidth, these units are all connected to separate first level caches. A single Skinning Pro- cessor shares arithmetic units with one of the Geometry Processors. The Update Processor is responsible for updating the nodes of the spatial in- dex structure on dynamic scene changes. Via the Host Bus Controller the software driver can control all units on the chip. namic scenes, but most typical animations can be rendered this way. The DRPU contains a Skinning Processor to speed-up the computation of dy- namic scene changes (see Chapter 5) by recomputing the triangle vertex positions and applying the Skeleton Subspace Deformation skinning model. After such geometry changes, updates of parts of the spatial index structure are required, which are performed by a dedicated Update Processor. All these optimized hardware units are required to compute a single frame as illustrated in Figure 2.1. Before rendering, the host PC needs to send the scene specification (vertices, update instructions, shading data, ...) to the main memory of the rendering hardware. For each frame, new ver- tex positions and normals are either uploaded or computed by the Skinning Processor for the relevant parts (objects) of the scene. In a second stage, the Update Processor uses the computed triangle vertices to update a dynamic spatial index structure, required for later ray casting. Now the data struc- tures for ray tracing are completely setup for rendering, which is performed 26 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE by a recursive ray tracing program running on the Shading Processor. Via a special trace instruction, this program calls the Ray Casting Unit, to efficiently cast rays through the scene. There the Traversal Processor tra- verses the packet of rays through the spatial index structure. This index structure is divided into two parts, a top-level part over instantiated objects and a bottom-level part per object. Each time the traversal operation of the top-level index structure reaches a leaf node, the Geometry Processor is invoked to transform the packet of rays to the coordinate space of the object. Then the Traversal Processor is called again to traverse the trans- formed ray further through the object and the Geometry Processor can be called to intersect with triangles. If the closest intersection with the scene is found, the Ray Casting Unit sends the hit information to some special input registers of the Shading Processor that can continue its operation on this hit-data. More implementation specific details about the DRPU architecture are given in Figure 2.2. It shows the organization and connections of the fixed function hardware units on a single chip. Several Rendering Units and a global Thread Generator are drawn that coherently schedule screen pixels to them. For coherent scheduling, this Thread Generator contains a Hilbert curve generator [MWM01] to sample the screen pixels in optimal Hilbert order. Alternatively, initialization data for a thread can also come from external memory to perform general purpose . The Shading Processor, Traversal Processor, and Geometry Processor are all connected to the main memory via separate first level caches, that store shading data, spatial index structure, and vertices, respectively. A single Skinning Processor shares arithmetic logic with the Geometry Processor to affinely transform vertices. This increases the efficiency as the Geometry Processor can otherwise not be used during skinning. The single Update Processor is located close to the memory interface for fast updating of the dynamic spatial index structure. A second Update or Skinning Processor would only make sense in cases of very high external memory bandwidths, because both units require high speed memory connections. All hardware units are controlled via the Host Bus Controller, which has a connection to all Rendering Units, and the Skinning and Update Processor. The Host Bus Controller can be of any type (PCI, AGP, PCI-Express) and handles either single requests or DMA requests to achieve high bandwidths for large grained data packets. The DRPU board is connected to a host PC where a driver application performs control tasks and scene data management. It prepares the scene data by building spatial index structures (that are possibly precomputed), compiling shaders to byte-code, and managing materials. For dynamic ob- jects, two instruction sequences need to be computed, one for the Skinning Processor and one for the Update Processor. The complete scene data and shaders can efficiently be uploaded via DMA before rendering. Camera set- 2.1. PIPELINING 27 tings are updated to special “constant registers” of the Shading Processor before rendering. During rendering the Skinning and Update Processor as well as the Rendering Units are scheduled optimally by the driver. For ren- dering on several Rendering Units, the screen is subdivided into small tiles that can be scheduled to them. Frame buffer contents are finally down- loaded via DMA for display by the standard graphics API in each frame. During rendering the driver performs no expensive computations, despite the reconstruction of a top-level spatial index structure, which is built over object instances of the scene.

2.1 Pipelining

Pipelining is a widely used technique in hardware design [WH05] to increase performance of a hardware unit. It subdivides the computation into partial computations by inserting register levels. This increases the overall execu- tion time of one instruction (because of extra capacities and delays caused by the registers) but also increases the operating frequency and consequently the number of instructions that can be executed per second. Pipelining does not always help to increase speed, because it also increases the cycle latency of an instruction, thus it takes more cycles until the instruction leaves the pipeline. Consequently pipelining does not speed-up the execution of depen- dent instructions. However, independent instructions can be executed very efficiently as these can be processed in parallel in the pipeline. For these parallel computations the additional pipeline registers pay-off. As ray tracing is a highly parallel algorithm, pipelining is applied to each of the DRPU hardware units. Different techniques are used to provide enough independent instructions for the pipelines. The update operation of the tree-like spatial index structure is highly parallel if performed in a breadth-first-order. Only the updating of consecutive levels of the tree de- pend on each other, thus it is possible to generate an instruction stream for the Update Processor, whose instructions are mainly independent to each other and optimally scheduled. For more details see Section 3. Similarly, the Skinning Processor can be feed with an instruction stream of mainly independent instructions as the skinning computations of a vertex does not depend on neighboring vertices in the applied skinning model. See Section 5 for more details.

2.2 Multi-Threading

The Rendering Units (Shading Processor, Traversal Processor, and Geome- try Processor) use a different method, called multi-threading [Cor02] (similar to hyper-threading in current high end CPUs), to fill their deep pipelines with independent instructions. In multi-threading several independent in- 28 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE struction streams, called threads, are executed and managed completely in hardware. If dependencies of the instructions in one thread would tradition- ally cause a pipeline stall, the hardware immediately continues executing a different thread, keeping the pipeline busy. Due to this massive multi- threading approach, the writing of efficient highly parallel code is trivial and even bad code can achieve good processor usage. As these threads operate in parallel, the execution state for each of them needs to be stored on-chip, which is an issue of multi-threading and causes high on-chip memory requirements (see Chapter 8). However, to- day’s high end CPUs also require an enormous amount of memory struc- tures on-chip (mainly caches) to achieve high efficiency. The used massive multi-threading is only possible for a limited range of algorithms that are easily parallelizable, which is the case for ray tracing. Some researchers even call ray tracing embarrassingly parallel as each ray can be computed com- pletely independently to other rays because of the independence of photons in physics. The DRPU design does not execute secondary rays in parallel, but parallelizes the computations of screen pixels, which shows to be more general and more efficient by holding the computational state compact.

2.3 Packets of Rays / Threads

The computations performed by adjacent pixels of the screen are typically highly coherent, as adjacent rays traverse similar space (and thus similar spa- tial index nodes) and often hit the same triangles. Even if different triangles are hit they mostly represent the same material and thus contain the same shader. Ray tracing implementations can take advantage of this coherence by operating on packets of rays [Sch06] or packets of threads [WSS05] in general. In the DRPU, such a packet is executed synchronously, thus always the same shading instruction (in the Shading Processor), spatial index node (in the Traversal Processor), or triangle (in the Geometry Processor) is pro- cessed at a time. Synchronous execution of packets, on the one hand reduces hardware complexity as some logic (such as caches, instruction scheduling, etc.) can be shared per packet and on the other hand memory requests can be performed once per packet (on the Traversal Processor and Geome- try Processor), or can often be combined to a single packet-request (in the Shading Processor). The synchronous execution can be performed in paral- lel by parallel hardware in the Shading Processor and Traversal Processor, or sequentially as in the Geometry Processor to balance the utilization of the hardware units via their throughput (see Chapter 7). The Rendering Units contain a fixed number of packets, each one identified by a unique ID. For the DRPU ASIC design a number of 32 packets is a good compromise (see Chapter 8). If a packet has to continue 2.4. MEMORY INTERFACE 29 its execution in a different unit, its ID (mostly together with some data) is sent to an input FIFO of that unit, to start with the computation. To prevent FIFO overflows these packet FIFOs can hold as many packet IDs (plus data) as the maximal number of packets in the system. The computational state of the threads of the packet is distributed over several small memories, each containing a part of the computational state of all threads. These possibly multi-ported state memories are addressed using the concatenation of the thread ID and the local item-address that a thread wants to access, which separates thread states from each other. For replicated hardware units, such as the parallel traversal of the 4-ray packets, state memories are obviously also replicated. In the Shading Processor and Traversal Processor incoherent cases need to be treated where the control flow of the threads diverges because they take different branches or go to different children of the spatial index. Therefore, an activity mask specifies which threads execute the current in- struction or are inactive and will be executed later. In incoherent cases synchronous execution can reduce the performance due to few active threads per packet. Fortunately, coherence is typically high for ray tracing [WSBW01] thus packeting pays-off on average (see Chap- ter 7). To be efficient also in incoherent situations, such as highly trian- gulated scenes or diverging shaders, only a small number of 4 threads are synchronized in the DRPU architecture which is considered to be a good trade-off. Thus in the worst case, packeting could reduce performance to 1/4, however this never occurred in the test scenes even for complex ones (see Chapter 7). Other researchers propose much larger packets of 64 rays [Sch06] that are processed sequentially in hardware. However, such large packets restrict the use of the hardware to coherent scenes. Obviously, compared to ap- proaches with smaller packets, fewer large packets can be stored on-chip. A packet can typically perform only one memory request after another. Thus, if the latency to the memory is higher than the number of available packets, there are not enough packets to effectively take advantage of the available memory bandwidth. More smaller packets can perform frequent fine grained memory requests for incoherent scenes, and use the available memory inter- face to its limits to achieve good performance.

2.4 Memory Interface

The DRPU architecture is designed to be scalable in the number of on- chip Rendering Units, thus a memory interface that can deliver sufficient bandwidth to all of these units is important. Traversal Processor, Geometry Processor, and Shading Processor can all request one 128bit data word from memory per clock cycle. The naive 30 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE way of providing this huge memory bandwidth would be to connect each of the units to their own external DRAM chips. However, this kind of design would require an expensive chip package with many pins and a complex PCB board. As especially the number of pins is very limited (maximally 1000 to 2000 pins), the design could not be scaled on-chip this way. The power consumption of such interfaces would be huge which is a second reason to optimize the required bandwidth. As coherence is typically high for ray tracing, the working set of a small part of the image has a reasonable size and can efficiently be stored temporarily in an on-chip cache. Thus the Shading Processor, Traversal Processor, and Geometry Processor are all equipped with their own first level caches to reduce memory bandwidth by temporarily storing shading data, spatial index nodes, or vertices. Caching this data is efficient only as long as coherence is high. In the case of the vertex cache, for instance, triangles need to cover many screen pixels so that vertices can be reused frequently. If triangle depth is high (e.g. multiple triangles project onto the pixels) the primary rays would all hit completely different triangles and first-level geometry caches do not help any more. The situation is similar for the other caches, but the traversal cache mostly achieves good hit rates for the coherent higher level nodes. These worst case scenarios always require raw memory bandwidth and limit the scalability of the design (see Chapter 7). Level-of-detail techniques help in these situations, as they reduce triangle depth of far-away geometry. The Update Processor contains no cache, as the statically compiled instruction streams store and re-use data optimally from local registers. The Skinning Processor would also not benefit from caching, as it is a pure stream processing unit (see Figure 2.2). All hardware units or caches respectively are connected to a shared memory bus. The bus arbiter gives control to a connected hardware unit for possibly several consecutive memory requests which can improve the ef- ficiency of the external DRAM, as it is likely that these consecutive memory requests from the same unit access the same DRAM memory page. For im- proved performance, the memory bus is split into two unidirectional busses, for concurrent communication in both directions. Each memory bus has a data width of 128bits plus some control bits and address line, which allows the storing of four 32bit wide pixels or to load one spatial index node or triangle vertex per clock cycle. A major advantage of ray tracing is its output sensitivity, as only spatial index nodes or triangles that are pierced by query rays need to be accessed during rendering. Thus only geometry that contributes to the final image needs to be accessed - geometry behind an occluding wall would never be fetched. This makes the usage of Virtual Memory Management (VMM) efficiently possible as analyzed by Schmittler at al. [SLS03]. Here, the scene description is contained in virtual memory, which may be distributed across 2.5. PERFORMANCE SCALABILITY 31 the host’s main memory system or other rendering boards. A part of the lo- cal DRAM is used as a big 2nd level cache to store the working set required to render a frame. As long as this working set fits into the local DRAM cache, the virtual memory management performs well, because only work- ing set differences between frames need to be transfered, which typically remain small. As Virtual Memory Management for ray tracing hardware was already analyzed by previous researchers [Sch06], it is not considered in this thesis, but could easily be added. In contrast the DRPU is directly con- nected to external DRAM that stores the entire scene database. However, the driver could still perform some virtual memory management, by modi- fying the higher nodes of the spatial index and loading only visible subtrees to the DRAM. The external DRAM is controlled by a DRAM controller that main- tains the order of the memory requests. Being out-of-order could reduce the overhead of DRAM page accesses, as requests to an already opened bank row could pass requests to a still closed row of a different bank [Sch06]. Such an out-of-order DRAM controller would cause no problems for the Rendering Units, because memory requests from different packets can be executed in an arbitrary order as packets are executed independently. Problems would occur in steam based units (such as the Skinning Processor or DMA con- troller) as they rely on an in-order memory interface and would consequently become more complicated due to the requirement of additional buffers for later re-ordering of the requests.

2.5 Performance Scalability

Ray tracing is an “embarrassingly parallel” algorithm, due to the indepen- dence of pixels. This property should also apply to hardware implementa- tions of the ray tracing algorithm, allowing higher performance if using a higher amount of hardware resources. The DRPU architecture is highly scalable as it allows scaling render- ing performance in several different ways by using multiple Rendering Units per chip, multiple chips per board, multiple boards in a PC, or multiple PCs connected by a fast network. Multiple Rendering Units per chip are analyzed in the ASIC implementation in Chapter 8, while the connection of two chips are analyzed in the FPGA prototype implementation in Chap- ter 7. Despite only two FPGAs could be used to test the scalability, the architecture should scale well as is often proven in software implementations for ray tracing [PSL+99, Muu95, Neb97, ASR98]. Scaling the rendering performance for static scenes is possible as long as the parallel rendering engines have enough bandwidth to the scene database. Thus if scaling the number of Rendering Units per chip, the memory bandwidth to the external DRAM will become the limiting factor. 32 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE

Thus Chapter 8 will show, that mostly two Rendering Units can be con- nected to the 128 bit wide DRAM memory interface for typical test scenes. If using more Rendering Units on-chip or several chips in parallel the scene data can be replicated to several DRAM chips, thus the bandwidth to the scene database can be linearly scaled this way, together with the render- ing performance. The bandwidth required to read rendered pixels back for display might become a bottleneck. However, it can be ignored for typical frame-rates of 40fps when using a PCI Express interface. Scaling the performance in the case of many dynamic scene changes is more challenging, as these dynamic changes need to be computed by or sent to the different parallel DRPU chips for rendering. An approach where each chip computes all scene changes would only allow for scaling the rendering and not the setup time to compute the dynamics. A different approach would be to distribute the skinning and update computations to the parallel DRPU chips followed by a final distribution of all updated scene parts to all other chips. However, this may result in a communication bottleneck as skinning and updating together produce a very high output bandwidth. To solve this parallelization issue, skinning and updating could be performed on demand during rendering. The parallel DRPU chips then only compute dynamic changes that are required for the pixels rendered by them. This approach could be used if the application provides a conservative bound of the objects that are updated on demand. Because this second technique is difficult to implement as it requires much more control by DRPU to switch between rendering, skinning, and updating, the two FPGA prototype boards described in this thesis will recompute all changes of the dynamic parts. For the test scenes, this only sightly limits scalability, because skinning and updating together show themselves to be one to two orders of magnitude faster than rendering.

2.6 Programming Model

Similar to GPUs, the DRPU is a stream processor operating directly on a coherent stream of pixel locations, rather than a stream of vertices and fragments. Unlike GPUs, the stream kernels of the DRPU can be general purpose due to the support for arbitrary memory reads (and writes) and hardware support for recursion. But due to the small on-chip caches the hardware is optimized to computations with high data-coherence between the threads, such as for ray tracing. Similar to GPUs, the threads of the Rendering Units cannot communicate between each other because of pos- sible deadlocks due to the hardware managed activity mask of the packets (see Chapter 4). However, such communication between threads is not re- quired to efficiently perform ray tracing because pixels can be processed independently with the ray tracing algorithm. 2.6. PROGRAMMING MODEL 33

The DRPU architecture has fundamental advantages over GPU pro- gramming because of an improved shading model. Traditional GPU frag- ment shaders are limited to performing local per triangle shading operations while the DRPU can compute global shading effects by querying the scene with secondary rays directly in the shader program. This global access to the scene via query rays, makes writing shaders intuitive and allows for an accurate simulation of physics. For compatibility, most shading techniques used with GPUs, such as shadow maps or reflection maps, can also be performed the same way with the DRPU architecture. But using secondary rays these effects can be com- puted much simpler, more accurately, and more efficiently as no memory bandwidth consuming multi-pass techniques are required. Thus reflection maps transform to the use of reflection rays, and shadow maps to the use of shadow rays in a ray tracer. For instance, when computing a reflective object on a GPU, one needs to capture the environment of the object first by rendering it onto the 6 faces of a cube. This cube map gets stored in a texture, and can be used in the final render pass to look up the color of the environment in the reflected direction. This look up is similar to the shooting of a ray in ray tracing, but inaccurate as the look up is always performed from the center of the object and the resolution of the texture can cause sampling artifacts. However, these environmental maps could principally also be computed and used by a ray tracer. Full support of recursive function calls allows for performing recur- sive ray tracing directly in the shader. This is an important advantage of the DRPU programming model as it completely separates different material shaders from each other. This is possible as material shaders transfer infor- mation (such as radiance) over clearly defined interfaces on the recursively traced rays. Thus a material shader can gather light from different directions by recursively shooting rays that return light computed by possibly many different shaders. This allows the supporting of fully declarative scene de- scriptions that can be completely evaluated by the Shading Processor. Function calls can be used to separate common functionality into additional shaders. One particularly interesting example is the separate computation of global illumination that can be called from any material shader [SS95, SPS95]. Instead of computing illumination itself by tracing rays directly, a global lighting shader can be called that iteratively com- putes any incident light contribution at the point of interest. This allows for easily modifying the lighting scenario without having to change any ma- terial shaders. Another good example are BRDF shaders that encapsulate the evaluation of BRDFs and separate it from normal surface shaders that simply specify BRDF parameters [PH04]. In regard to vertex processing, the programming model of the DRPU is similar to that of GPUs. As the Shading Processor is a stream processor it can also be used for skinning or vertex shading. In addition a general 34 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE skinning model is supported by a high performance Skinning Processor. A principal architectural difference to GPUs is that the DRPU architecture needs to write the skinned vertices and vertex colors to the main memory for later update of the spatial index structure and rendering. In contrast, the GPU pipeline can directly render the triangles from the vertices computed by the vertex pipeline.

2.7 Hardware Description Language HWML

Implementing a complex piece of hardware such as the DRPU architecture is non-trivial and requires significant effort for specifying its functionality. Various hardware description tools, like the commonly used VHDL [Lan06b] or Verilog [Lan06a] languages, are available but they all operate on a low- level of abstraction with little expressiveness. It requires extensive manual coding work for little return in functionality. To increase the productivity for implementing the DRPU, a specially designed high-level hardware de- scription language HWML (Hardware Meta Language) [WBS06b] was used and implemented in parallel to the DRPU development. HWML is a functional structural hardware description language, im- plemented as an ML library [MTH90]. This allows to take advantage of the expressiveness of ML for achieving highly compact hardware specification. Circuits are described as ML functions, which allows for using circuits as parameters and outputs of other circuits (functions). HWML also allows for using parameterizable abstract data types. One can, for instance, build a general delay line or a FIFO that operates on values of arbitrary type. This allows the design to be fully parameterized at an abstract level. Special fea- tures of HWML are automatic pipelining of functional blocks for achieving high clock rates, recursive circuit descriptions using functional notation, and multi-ported typed memories. All HWML descriptions are directly synthesizable both for the Xilinx FPGA platform [Xil06b] as well as for the used 130nm ASIC process from UMC [UMC06] through VHDL as an intermediate language. Special param- eterized template circuits allow existing Block RAM and Block Multipliers on the Xilinx FPGAs to be taken advantage of. For the ASIC platform these multipliers are synthesized and on-chip memories are automatically generated using an ASIC memory compiler from Virtual Silicon [Sil06]. With HWML it was possible to implement the complete DRPU as a highly parameterized design in only 8000 lines of code. The parameter- ized description allows to quickly create custom variation by changing, for instance, the number of threads, the packet size, floating point accuracy, and latencies of automatically pipelined units. This flexibility of the hard- ware description made it possible to analyze different configurations of the DRPU to find the optimal configuration for an ASIC implementation (see 2.8. FPGA PROTOTYPE PLATFORM 35

Chapter 8). Furthermore, HWML made frequent design changes easy be- cause many abstract library units such as FIFOs can automatically handle arbitrary types of input data.

2.8 FPGA Prototype Platform

This thesis will present several measurements of prototype FPGA imple- mentations. The prototype platform used (see Figure 2.3) consists of two Xilinx Virtex-4 LX 160 FPGAs [Xil06b] that are hosted on the Alpha Data ADM-XRC-4 PCI-board [Alp06]. Each FPGA has access to four 16-bit wide DDR-DRAM memory chips used in parallel to make a 64-bit wide memory interface that can deliver a peak bandwidth of 1.0 GB/s at 66 MHz (16 bytes per FPGA clock cycle). The FPGA is connected via a 64 bit wide PCI bus to the host PC. This configuration and a of 66 MHz is used for all FPGA prototype designs in this thesis. Both FPGAs are used together in Chapter 7 to show the scalability of the design in number of FPGAs.

Figure 2.3: DRPU Prototype Platform from Alpha-Data equipped with one Xilinx Virtex-4 LX 160 FPGA

2.8.1 Floating Point Format The hardware does not operate on IEEE 754 compliant floating point num- bers internally but on its own internal format. This format consists of a sign bit, a 2th complement exponent, and a normalized mantissa (stored without leading one). For 24 bit floating point numbers (FPGA prototype), the exponent is 7 bits wide and the mantissa 16 bits. For 32 bit floating point numbers (ASIC implementation), the exponent is 8 bits wide and the mantissa 23 bits. The floating point format used has no special numbers such as NaN and even no zero is available. This makes the implementation simpler and removes special treatment of NaN values or a division by zero. Scene data is stored IEEE compliant to simplify the driver application thus the DRPU needs to convert to its internal format on memory reads and to 36 CHAPTER 2. OVERVIEW OF THE DRPU ARCHITECTURE

IEEE on writes.

While this chapter gave a short overview of the DRPU architecture the following three chapters describe the Ray Casting Unit and Update Processor (see Chapter 3), the Shading Processor (see Chapter 4), and the Skinning Processor (see Chapter 5) in detail. Chapter 3

Ray Casting Hardware for Dynamic Scenes

3.1 Introduction

Ray tracing algorithms cannot directly render the scene geometry but rely on spatial index structures to efficiently cast rays through the scene. As these spatial index structures are costly to compute [WPS+03, WBS03] they limit the support of ray tracing for dynamic scenes. However, for a long pe- riod ray tracing was known for its low rendering speed, requiring minutes to hours to render a single frame. As a consequence, there was no inter- est for data structures to efficiently handle dynamic scene changes. This situation has changed over the past few years, as great progress has been achieved in accelerating ray tracing to real-time performance levels both in software [WSBW01, WBS06a, WIK+06, RSH05] and hardware implemen- tations [SWW+04, WSS05]. This development made research for spatial index structures for ray tracing of dynamic scenes interesting, because real- time ray tracing systems that only support static scenes would be strongly limited in their application. Games in particular and many other real-time rendering applications require frequent changes to the scene geometry and high rendering frame-rates. A spatial index structure for handling dynamic scene changes has to cope with several different types of dynamics or motion respectively.

Static Motion: In static scenes, all geometry stays constant during rendering. Ray tracing these scenes is understood well and high quality spatial indices can be precomputed that never need to be modified during rendering. KD Trees have proven to perform best for static scenes even if geometry is not evenly distributed [Hav01].

37 38 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

Rigid-body Motion: Many animations contain motions of rigid- body parts that do not deform themselves but only change their position and orientation. A separation of the scene into such objects with piece- wise rigid motion and separate static spatial indices has been suggested by [LAM01] and has been implemented for real-time use on a single or cluster of PCs [WBS03]. As rigid-body parts of the scene do not change, a spatial index can be precomputed for them and never needs to be modified later. A dynamic spatial index is required only on top of these rigid parts, called top-level spatial index. For many applications the number of rigid parts stays comparatively small, making brute force reconstruction of the top-level spatial index possible [WBS03].

Coherent Motion: Motion is coherent if nearby geometry moves in a similar way. In reality this kind of motion occurs frequently because objects or materials such as gum or skin are connected. Bounding Vol- ume Hierarchies have proven to perform well in scenes with coherent mo- tion [WBS06a, WMS06] especially if geometry in the deeper levels of the tree stay close together during the course of changes. A good structure of the Bounding Volume Hierarchy can be determined by analyzing scene graphs and skinning models, or by brute force top-down [KK86, Smi98, WBS06a] or bottom-up [GS87] construction approaches. Bounding volume approaches even allow an insertion of objects after they have been constructed. After scene changes, the structure or topology of the hierarchy can be maintained and cell bounds be updated from bottom up, which is a parallel operation that can be implemented very efficiently. Many different Bounding Volume Hierarchy approaches have been proposed that vary mainly in the number of child nodes and the type of the node bounding [KHM+98, TL03]. This thesis proposes a new kind of Bounding Volume Hierarchy which is well suited for hardware implementation as traversal cost is reasonable.

Random Motion: Random motion is the worst case, where geome- try can be at any new random position in the next time step. In practice, random motion occurs rarely but could for instance appear if a fast motion is sampled at a very low rate. Using grids as spatial index structure, objects can be inserted very quickly and then be rendered in real-time by exploit- ing coherence [WIK+06]. While this approach principally allows random motion to be supported, Grids cause many problems with uneven triangu- lations [Hav01], which make them less appealing for many scenes.

This chapter presents hardware units to handle ray casting in most dy- namic scenes by using a new spatial index structure called Bounded KD Tree (B-KD Tree). B-KD Trees are a kind of Bounding Volume Hierarchy and can be used for efficient hardware ray casting in scenes with coherent motion. 3.1. INTRODUCTION 39

Via a two-level approach the hardware supports object instantiations that can either be used to build large scenes or to support rigid-body motion. A top-level B-KD Tree over the object instances is recomputed in every frame by the driver application, while the bounds of the bottom-level B-KD Tree of the dynamic objects can be updated efficiently in hardware. By storing bounds only in a single dimension, B-KD Trees are much smaller than other bounding volume approaches and can even be used to encode optimized KD Trees for static scenes without too much memory over- head. This makes B-KD Trees a very flexible approach for hardware imple- mentation, because only one data structure is required to treat many dif- ferent kinds of scenes. The described hardware units can efficiently be used for static motion, rigid-body motion, and coherent motion. Even random motion is supported for object instances, because the top-level B-KD Tree is reconstructed in each frame by the driver application, which is possible as long as the number of object instances is in the range of several hun- dred. Random motion in the objects cannot be handled well, because they typically consist of too many triangles for a reconstruction. 40 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

3.2 Bounded KD Trees (B-KD Trees)

The Bounded-KD Tree (B-KD Tree) is a new hybrid spatial index structure that combines the advantages of Bounding Volume Hierarchies with those of KD Trees in a single homogeneous data structure. From bounding volumes it inherits the efficient support for coherent dynamic scenes, while maintain- ing the simplicity and efficiency of KD Tree traversal.

Definition: A Bounded KD Tree (B-KD Tree) is a binary tree, where each node recursively subdivides the geometry primitives of the scene into two (typically disjoint) subsets represented by its two children. Each node stores the geometric extent of both children in a single dimension. They are stored in the form of two intervals often also referred to as slabs (see Figure 3.1). Each leaf node stores a reference to a single primitive of the scene.

Figure 3.1: B-KD Trees: A B-KD Tree node divides the geometry into two subsets, represented by its two children. For each child, the node stores along a single dimension the extend of the geometry in the form of two bounding intervals, which possibly overlap. The geometry is recursively subdivided until there is only a single primitive per node.

If the geometry subdivision is performed disjointly, a B-KD Tree has exactly N leaf nodes that point to primitives (triangles), and N − 1 inner nodes that store child bounds. This makes the size of the spatial index structure predictable in contrast to KD Trees. As a consequence of this compact representation the number of traver- sal steps is sometimes higher compared to KD Trees if the bounds of the children overlap for many nodes. For such an overlapping region both chil- dren have to be traversed which increases the number of traversal steps. Conventional KD Trees disjointly subdivide space, causing no overlap and thus sometimes higher traversal performance due to fewer traversal steps (see Table 3.1). 3.2. BOUNDED KD TREES (B-KD TREES) 41

Figure 3.2: Implicit B-KD Tree Box: To each node of the B-KD Tree an implicit box can be assigned that is computed by recursively clipping against the bounding intervals. In the figure the node T10 and its corresponding implicit box is marked.

Each inner node of the B-KD Tree stores only one-dimensional bounds for its children, but each node can be assigned a closed implicit box. This implicit box is defined by the B-KD Tree and can be determined by recur- sively clipping from top-down with the bounding intervals of the nodes (see Figure 3.2). Implicit boxes of nodes may overlap which makes updates of the bounds possible on scene changes. The traversal algorithm traverses a B-KD tree node only if its implicit box is pierced by the ray (and the ray is not occluded earlier) thus possibly several nodes may be processed for a single piece in space because of overlaps. Storing the bounding intervals of the children in their parent node allows a higher parallelization of the algorithm because both child bounds can be tested for intersection in parallel. This reduces the number of nodes to process and prevents child nodes being fetched if their implicit box is not hit by the ray. For many highly detailed scenes, the ability to instantiate multiple copies of an “object” is important in order to minimize memory require- ments. This is supported through a two-level B-KD Tree approach. A top-level B-KD Tree is built over object instances that consist of a refer- ence to an affine transformation matrix M and a bottom-level B-KD Tree representing the object (see Figure 3.3). By modifying the matrix one can transform the complete object, which may consist of thousands of triangles, while only changing parts of the top- level B-KD Tree. A modification of the object’s geometry is not required, as the ray is transformed during rendering to the local coordinate frame of the object. This approach is understood well [LAM01, WBS03] and allows rigid-body motion to be handled efficiently. 42 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

Figure 3.3: Two-Level B-KD Trees: The scene consists of a top-level B- KD Tree over object instances. Each object instance consists of a reference to an object and an affine transformation matrix Mi to position it. The objects can be referenced possibly several times and contain their own bottom-level B-KD Trees.

The Bounding Interval Hierarchy is a different approach that extends KD trees with more general bounds. However, the difference is that the Bounding Interval Hierarchy stores only two bounding planes to bound the left most (or right most) position of the geometry contained in the right (or left) child. This approach is used to implement ray tracing of dynamics scenes but by reconstructing the complete hierarchy [WK06]. An updating as for B-KD Trees is not efficiently possible because the stored bounds are too inaccurate. If the two subtrees swap their positions for instance, they will overlap completely as they are now bound from an non-optimal direction (the right child that moved to the left is still bounded from the left and vice versa). Because B-KD Trees have properties of both KD Trees [Jan86, SF90] that perform a split of space along an axis aligned splitting plane, and AAB- VHs [WBS06a] that store an axis aligned box for each node, a comparison of all three data structures is of interest. Size: For size comparison we assume a KD Tree node to be 8 bytes wide, a B-KD Tree node to be 16 bytes wide (see Section 3.3.1), and AABVH nodes to be 48 bytes wide. For B-KD Trees and AABVHs, the precision of the floating point bounds needs to be reduced slightly to fit into that size. Table 3.1 shows that the B-KD Tree is the smallest data structure because it does not suffer from duplicated geometry in both subtrees and stores bounds in a single dimension only. AABVHs consume much more memory as a complete axis aligned box needs to be stored for each child node, which often causes unnecessary bounds to be stored. The volume represented by a KD Tree node is represented implicitly by splitting the space using a splitting plane into two disjunct subspaces. At first glance 3.2. BOUNDED KD TREES (B-KD TREES) 43

# trav. trav. # ray/tri Scene index type index size steps cost intersections Helix KD Tree n/a n/a n/a n/a 78k tris B-KD 2.5 MB 25.3 101.4 6.2 AABVH 5.0 MB 13.7 164.5 4.9 Gael KD Tree 2.8 MB 62.5 62.5 6.9 52k tris B-KD 1.7 MB 96.7 386.8 15.9 AABVH 3.4 MB 67.0 804.4 11.9 Office KD Tree 1.4 MB 31.2 31.2 4.8 34k tris B-KD 1.1 MB 29.1 116.4 6.8 AABVH 2.2 MB 21.1 253.6 5.3

Table 3.1: Comparison between KD Trees, B-KD Trees, and AABVHs for the Helix scene (78k triangles), Gael scene (52k triangles), and the Office scene (34k triangles). The table shows the index structure size in MB (ex- cluding triangles), the average number of traversal steps per pixel, the traver- sal cost (average number of ray/plane distance computations) and the inter- section cost per pixel (average number of ray/triangle intersections). The size of the B-KD Tree is the smallest for all of the test scenes. KD Trees require the fewest number of ray/plane intersections during traversal, while AABVHs require the most. this seems to be a more compact representation, but KD Trees suffer from additional storage overhead for geometry lying in both sub-volumes resulting in 7 to 10 times more nodes than there are triangles in the scene. Number of Traversal Steps: In the number of traversal steps, the KD Tree outperforms both B-KD Trees and AABVHs because it does not suffer from overlapping child bounds as in the Gael scene. However, sometimes the B-KD Tree and AABVH can outperform the KD Tree because they process more clipping planes per traversal step. However, the AABVH always requires fewer traversal steps as the B-KD Tree as its boxes are much more accurate. Traversal Cost: If looking at the traversal cost, defined as the required number of ray/plane intersections for traversal, the KD Tree out- performs the other data structures by a factor of up to 5, because only a single ray/plane intersection is required per traversal step. The other data structures perform worse because of more and often unnecessary ray/plane intersections. The sometimes decreased number of traversal steps can not compensate for this effect. B-KD Trees outperform the AABVH because one B-KD Tree traversal step requires only 4 ray/plane intersections (see Section 3.2.4), while one AABVH traversal step requires 12. Again the reduced number of traversal steps for AABVHs can not compensate this effect. As all the planes need 44 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES to be fetched from memory during traversal, the bandwidth required for traversal is correlated with the traversal cost. Thus AABVHs require higher traversal bandwidth than B-KD Trees. Dynamic Scenes: KD Trees are inflexible in regard to dynamic scenes because they are costly to compute [WBS03] and cannot efficiently be updated after scene changes. Similar AABVHs are also costly to compute initially, but very easy to update during rendering by recomputing the node boxes and maintaining the tree structure. This makes AABVHs efficient for handling coherent dynamic scenes [WBS06a]. Similar B-KD Trees can also efficiently be updated from bottom-up and support coherent dynamic scenes (see Section 3.2.3). As a bonus fewer bounding planes need to be updated for B-KD Trees, reducing the required update bandwidth. Compared to AABVHs, B-KD Trees have one issue regarding dynamic scenes. As their implicit boxes are less accurate, they overlap with a higher probability during the course of changes. This may result in a traversal of the wrong child first, causing a traversal behind the closest point of intersection. Rotations of geometry in particular can become a problem, as bounds are then no longer oriented optimally. However, by leveraging the two-level B- KD Tree approach, one can sometimes map rotations to the transformation matrix of objects which often solves the problem. However, B-KD Trees have been chosen as data structure, because they have several advantages for hardware implementation. As B-KD Trees require fewer ray/plane intersections, less data needs to be fetched from the scene database and fewer costly floating point operations need to be performed and less power is consumed. Especially the lower data bandwidth is important, because even the B-KD Tree traversal will consume the most bandwidth of the Rendering Units (see Chapter 7). But also the ray/plane intersections are quite expensive (especially on the FPGA) as each maps to a floating point subtraction followed by a floating point multiplication. As the B-KD Tree subdivides the geometry up to one primitive per leaf node, a special handling of lists of primitives as in conventional KD Trees is not required. This simplifies hardware implementations significantly, because special hardware units are not required to fetch these lists and to perform mail-boxing [AW87a] to ignore previously processed duplicate list items.

3.2.1 B-KD Tree Construction This section focuses on the initial construction of B-KD Trees for static and dynamic geometry. As the initial computation of the B-KD tree is a difficult task it is performed completely in software, without any later hardware acceleration. We have also considered performing the construction of the spatial index structure in hardware, but this would perform roughly one order of magnitude slower. The reason is that construction algorithms for recursive spatial index structures have typically a runtime of O(n log n). A 3.2. BOUNDED KD TREES (B-KD TREES) 45

Scene index type traversal steps intersections Helix center split 35.7 6.8 SAH 25.3 6.2 Gael center split 224.8 23.1 SAH 96.7 15.9 Office center split 42.8 10.1 SAH 29.1 6.8

Table 3.2: Comparison of the traversal and intersection cost of a B-KD Tree created with a split in the middle heuristic, and the Surface Area Heuristic. The Surface Area Heuristic outperforms the center split for each test scene. number of several thousand triangles can quickly cause the logarithmic term to cause an overhead of 10 to 20. However, with the updating approach, the construction speed causes no problems, as it only needs to be performed initially. The tree bounds can be updated later efficiently during rendering. For KD Trees the differ- ence between a good and a bad tree can cause a factor of 2 in rendering speed [RSH05]. For B-KD Trees this is the same (see Table 3.2) thus con- struction algorithms that generate good B-KD Trees are important. Furthermore, it is considered that dynamic objects are in a good initial pose for construction. A good pose, in this case, is characterized as geometry that will be far away at a time step of the animation and is also far away in the initial pose. A character that is folding his hands, is a bad initial pose because the B-KD Tree construction would probably put both hands close together deep in the hierarchy. Moving the hands to different locations would cause many large bounds to be spanned by the hands. As a B-KD Tree is a kind of Bounding Volume Hierarchy, similar construction algorithms can also be applied. Bounding Volume Hierar- chies can be created in a top-down or bottom-up fashion. Goldsmith and Salmon [GS87] optimize the bottom-up construction using a cost model that minimizes an estimate for the traversal cost, called the Surface Area Heuris- tic. For top-down approaches a median split of geometry [KK86] and a middle split of the volume [Smi98] have been analyzed. We use a top-down approach similar to [WBS06a] that uses the Surface Area Heuristic to select an optimal partitioning of the triangles into two dis- joint sets, similar to top-down construction algorithms for KD Trees [Wal04]. Possible partitionings of the geometry are determined by looking at the ge- ometry sorted in all three dimensions according to their center. These sorted lists define possible partitionings G = G0 ∪ G1 of the geometry into two dis- joint sets (G0 ∩ G1 = ∅), by splitting one of these lists at a cost optimal position. The list that is selected also determines the splitting dimension of the node. Sorting the geometry in the dimensions needs to be performed 46 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES only once initially, which improves the runtime of the algorithm [Wal04]. Compared to construction algorithms for KD Trees, no geometry overlaps the splitting position which makes an in-place construction possible. The Surface Area Heuristic is used to select the partitioning with the smallest cost estimate. It is computed by a probabilistic model assuming uniformly distributed rays, no ray occlusion, and constant cost Ctrav for one traversal step and cost Cint for one ray/triangle intersection. The cost of a partitioning is the atomic traversal cost plus the intersection cost of each child (approximated by the intersection cost for the contained primi- tives) multiplied with the conditional probability p(Ti)|T that a child Ti is traversed. These probabilities can be computed by the ratio between the surface areas SA(Ti) of the child’s bounding and the surface area SA(T ) of the parent node [San76]. As we build the data structure for the B-KD Trees the implicit box of the children needs to be considered for surface area com- putations to achieve the best results.

SA(T0) SA(T1) SAH(T )= C + · |G0| · C + · |G1| · C trav SA(T ) int SA(T ) int

A traversal cost of Ctrav = 14 and intersection cost of Cint = 36 (that both correspond to the latency of the associated hardware units) showed to be a good choice for the hardware architecture described later. Table 3.2 shows a comparison of the just described surface area con- struction algorithm for B-KD trees, and a naive algorithm that performs “spatial median splitting” along a round robin axis. The Surface Area Heuristic performs better for each of the three test scenes. For the Helix character that contains triangles of similar size the improvement is not that big. However, for static environments such as the Office or the even larger Gael scene, the Surface Area Heuristic has a high influence on performance.

3.2.2 KD Tree Transformation An advantage of KD Trees over B-KD Trees is the sometimes faster traversal due to fewer traversal steps because of non-overlapping child bounds. Thus the highly optimized KD Trees used in the OpenRT software ray tracer sometimes achieve higher rendering speed than the B-KD trees constructed with the surface area heuristic of the previous section (see Gael scene in Table 3.1). To achieve the best rendering performance in the later described hardware architecture, the OpenRT KD Trees are used for static geometry by converting them to B-KD Trees if this improves rendering speed. This conversion is straightforward for inner nodes of the KD Tree, where the same dimension as the KD Tree node is selected as B-KD Tree node axis, and the bounding intervals ]−∞, d] and [d, +∞[ can easily emulate the splitting position d of the KD Tree node. Supporting empty B-KD Tree leaf nodes is required to convert empty KD tree leaves, and lists in KD Tree 3.2. BOUNDED KD TREES (B-KD TREES) 47 leaf nodes are compiled to small B-KD Trees using the algorithm from the previous section. Because our data layout for B-KD Trees is 16 bytes per node, and a KD tree node typically 8 bytes wide, twice as much memory is required to store the inner KD Tree nodes, plus some overhead for KD Tree lists that are typically short (1 to 3 entries).

3.2.3 B-KD Tree Update While the previously described initial construction algorithm is applied only once, the updating of the B-KD Tree requires high performance as it needs to be performed in each rendered frame if some geometry changes. This updating procedure merges the different bounds of the nodes from bottom- up through the tree and updates for each B-KD Tree node the extend of the two children along the axis of the node (see Figure 3.4). Despite only one dimensional bounding intervals are updated for child nodes, the updat- ing procedure needs to merge the full axis aligned boxes of the nodes from bottom-up through the tree. As a two-level B-KD Tree approach is used, each object of the scene can be updated separately if it changes, which makes the handling of large static environments that include dynamic objects pos- sible.

Figure 3.4: B-KD Tree Update: The updating procedure merges a full axis aligned box from bottom up through the tree. For each node the bounding intervals are updated for both children.

The algorithm neither touches the axis of the bounds nor the structure of the B-KD Tree. Furthermore, this algorithm is linear in the size of the tree (and the number of triangles) and can be implemented highly parallel, as the update of all nodes in a single-level of the tree are independent of each other. The computations per inner or leaf node are not expensive. Comput- ing the box for a leaf node from 3 vertices requires only 3 minimum and 48 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

3 maximum operations per dimension, making a total number of 18 simple floating point comparisons. Merging two axis aligned boxes of the children requires only 1 minimum and 1 maximum operation per dimension, making a total number of 6 comparisons. The update algorithm for AABVHs, looks the same as for B-KD Trees, however a full axis aligned box needs to be updated for the child nodes. This obviously causes up to 3 times higher memory traffic for box updates, making updating AABVHs significantly more expensive than B-KD trees, especially as even the B-KD Tree update is bandwidth bound in its hardware implementation. After such an update the B-KD Tree can be used to render the changed geometry efficiently. However, for best results the overall structure of a B- KD Tree should “match” the geometry and its dynamics. This means that geometry in a sub-tree should stay as close together as possible during the course of changes. A mismatch can result in significant overlap of the bounds of child nodes. This leads to redundant traversal and missed opportunities for early ray termination, as both child nodes must be traversed if a ray enters an overlap region. As a consequence, only dynamic scenes that show some coherent motion can be handled efficiently with B-KD Trees. Many typical motions, like skinned meshes, obey this restriction as will be shown in the result section. However, even a random movement of triangles can principally be handled correctly with the data structure. Obviously, it would not perform fast because of significant overlaps of B-KD Tree nodes which would result in a traversal of most parts of the tree. To optimize this random motion case, a non-optimal structure of the tree could in principle be detected by looking at the overlap of bounds. In the case of a mismatch it might be necessary to re-compute the structure of parts of the tree. Such computations can be performed by a software driver as they are usually rare and their cost can ideally be amortized over many frames, thus dedicated hardware would not be required for this operation. In this thesis a complete reconstruction of the B-KD Tree per frame is only performed for the top-level nodes over the instantiated objects, while the B-KD Trees of the objects are updated on scene changes.

3.2.4 B-KD Tree Traversal For rendering, the B-KD Tree needs to be traversed by enumerating all nodes whose implicit box is pierced by the ray. As B-KD Trees are a recursive data structure the B-KD Tree traversal algorithm is a recursive algorithm on the binary tree data structure and requires a traversal stack. The traversal algorithm is similar to that of standard KD Trees [SF90] where it is also required to recurse with a traversal interval representing the entry and exit point of the ray with the implicitly defined box of the current node. The recursive traversal function traverses the scene along 3.2. BOUNDED KD TREES (B-KD TREES) 49

Figure 3.5: B-KD Tree Traversal: The ray is intersected with the bounding planes of both children and clipped against the traversal interval [near,far] giving two intersection intervals I0 and I1 along the ray. A child is traversed only if its intersection interval is not empty. The closer child is always traversed first to improve performance by early ray termination. this traversal interval I = [near, far]. First the early ray termination test determines if the current closest hit distance is before the near-value, thus no closer intersection can be found and the recursive call returns. In this case or if the previous node has been a leaf node the traver- sal continues with the next entry on the stack. If the stack is empty, the traversal operation is finished. If not terminated early, the 4 intersection distances of the ray with the bounding planes defined by the bounding intervals for each child are computed. This gives an interval for each child that are clipped against the current traversal interval resulting in two intersection intervals I0 and I1 for the two leaf nodes (see Figure 3.5). If such an intersection interval is not empty, the corresponding child is traversed. If both children need to be traversed, the child whose intersection interval is partially before the other child’s interval is traversed first to improve performance by early ray termination. The traversal is partially ordered along the ray to improve perfor- mance. But the traversal order does not influence the correctness of the algorithm, because all data that is required to traverse a subtree is put onto the stack. Thus the closer subtree could still be traversed correctly if the farther one had been selected first. This order independence is important for extending the algorithm to packets of rays, where the complete packet needs 50 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES to traverse the tree nodes in a common order that might differ from the opti- mal order for some of the rays in the packet. This makes an easy ray-packet based implementation of the algorithm possible by performing exactly the same computations on each ray. However, the decision for the next subtree to traverse is common for each ray of the packet. The packet will traverse the left child first if there exists a ray that traverses the left child first. It behaves in a similar manner with the other child, but see Section 3.3.3.1 for details. 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 51

3.3 Hardware Implementation of B-KD Trees

While software implementations of B-KD Trees should be easily realizable and should achieve good performance, they suffer from the comparatively low computational power of CPUs and difficult low-level programming in order to achieve high performance [RSH05]. In contrast, hardware imple- mentations are highly parallel and can implement a very high peak floating point performance. Because of their simple traversal and update procedure, B-KD Trees are well suited for hardware implementations. Such a dedicated hardware implementation is described in this section that achieves high efficiency and performance. Described are an Update Processor to update the bounding intervals of the B-KD Tree nodes on scene changes, and a Ray Casting Unit consisting of a Traversal Processor and Geometry Processor to traverse rays through B-KD Trees.

3.3.1 Data Layout All hardware units described in the next sections operate on a fixed B- KD Tree data layout, where inner-nodes (that store bounding intervals), and leaf nodes (that point to geometry), are 128 bits (16 bytes) wide.

Figure 3.6: B-KD Tree Inner-Node Layout

An inner B-KD Tree node (see Figure 3.6) stores the bounds of the child nodes in a single specified dimension at reduced 23 bit floating point accuracy in form of two min/max intervals. There is only one 32 bit wide pointer stored, pointing to the pair of both children (see Figure 3.7). While this technique simplifies the data layout by removing one pointer, it requires some additional attention for the instructions of the Update Processor, to update pairs of child nodes (see Section 3.3.2). The chosen data layout for inner nodes is important for the update of the nodes by the Update Processor. As the tree structure does not change during update, the addresses to the child nodes will not be modified during update. With this data layout this can easily be achieved by disabling the write-back of the first word (painted red in Figure 3.6) to store only the bounding intervals (plus leaf and dimension bits). This prevents the pointer to the children to be overwritten, which otherwise would need to be fetched from memory first. 52 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

Figure 3.7: Pointer Layout: Each inner B-KD Tree node stores a pointer to its pair of children. This saves the pointer to the second child (drawn dotted) as it is implicitly stored.

Figure 3.8: B-KD Tree Leaf-Node Layout

The leaf nodes contain a pointer to the shading data of the triangle and 3 pointers to the 3 vertex positions the triangle is build from (see Figure 3.8). The vertex pointers are later used by the Geometry Processor to fetch the vertices of the triangle that needs to be intersected with the ray.

Figure 3.9: Vertex Position Layout

The layout of a vertex position (see Figure 3.9) is aligned to 128 bit because of the 128 bit wide memory interface of the prototype board. This wastes 32 unused bits but allows one vertex to be fetched in a single memory request. All three components of the vertex are stored as IEEE floating point numbers. The hardware supports a top-level B-KD Tree, which stores no tri- angles in its leaf nodes, but object instances that consist of a matrix and a pointer to the root node of the object. The data layout for these trans- formation nodes is the same as for normal leaf nodes despite the “pointer to shading data” points to the object’s root node, and the “vertex 0 addr” points to the 3 matrix rows of the object instance. The shading data pointer of the leaf node points to a per triangle data structure that is only used during shading. For completeness the most important aspects of this data structure are described here. The layout of this triangle data structure contains many pointers to efficiently store 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 53

Figure 3.10: Triangle Layout indexed face sets. Vertex colors and vertex normals, for instance, can often be shared for several vertices if using pointers. The first 128 bits of the triangle data structure (see Figure 3.10) al- ways contain 4 pointers, for triangle vertices, triangle normals, vertex colors, and per vertex texture coordinates of the triangle. Each of these pointers points to a memory location that stores an array of 3 pointers (one for each triangle vertex) that point to the vertex positions, normals, colors, or texture coordinates (see Figure 3.11 for the case of vertices).

Figure 3.11: Triangle Vertex Position Pointers

In addition, the triangle data structure stores an arbitrary number of 128 bit data words (painted blue in Figure 3.10) for user defined shading data per triangle. Here the user can store material properties like textures or pointers to shared material data. The pointers to the triangle vertex positions are stored twice with the presented data layout: once in the B-KD Tree leaf nodes and a second time for shading (see Figure 3.11). This redundancy could easily be avoided if the shader could also access the pointers in the leaf node (see Figure 3.8). However, the “is leaf” bit would influence the first 32 bit address. To fix this, the shader could either reset this bit manually or one could negate the semantics of the “is leaf” bit in the hardware, such that it would be 0 for leaf nodes and 1 for inner nodes. 54 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

3.3.2 Update Processor

Today’s computer games are highly dynamic and animate many characters that change their position and shape in each frame. After each modification of a character’s geometry the B-KD Tree of the corresponding object needs to be updated in the DRPU framework. Because this updating has to be performed per character (each consisting of thousands of triangles) and typ- ically in each frame (by frames rates of more than 30 fps) very high update performance is required. This high update performance can be achieved by the optimized Update Processor described in this section that is capable of updating a peak number of one B-KD Tree node per clock cycle. As described earlier, the update procedure merges the axis-aligned boxes of the nodes from bottom-up through the tree and updates the bound- ing intervals of the B-KD Tree nodes. Principally this updating can be per- formed by recursively traversing the B-KD Tree top-down in a special hard- ware unit. However, this would mean coping with many data dependent memory fetches of B-KD Tree nodes and the resulting latencies. This would result in high memory traffic, as each node (128 bit) of the tree must be read to traverse it and be written again to update it. Furthermore, a cache would be required to cache the often shared vertices and a multi-threading approach for high usage of the hardware unit at the cost of unsequential memory requests. This would result in a complicated unit, while the Update Processor described in the following is more simple, programmable and powerful. As it is programmable all the complexity is shifted to a compilation process of an instruction stream in the driver, where more complex optimizations can be performed. The Update Processor is a special processor with in-order execution optimized to the update procedure for B-KD Trees. Its instruction set con- tains instructions to read vertices and to compute boxes of leaf nodes and inner nodes in a single cycle. In regular meshes a vertex position is typi- cally shared by 6 triangles, which results in a very compact representation where the Update Processor can directly operate on. These shared vertices are stored to one of 64 internal vertex registers, where they can be used for computing the box of several triangles. All partial results, such as com- puted node boxes are stored to one of 64 special box registers to minimize the external memory requests to only the required updates of the nodes, vertex fetches, and additional instruction fetches. The architecture needs no caches, as no temporary values are stored in memory and the vertices are usually read only once and reused optimally from the register file. The order the B-KD tree nodes are stored in memory are determined by the update instruction sequence. Thus, a compilation step (see Sec- tion 3.3.2.1) generates an optimized instruction stream for a B-KD Tree. The B-KD tree nodes are stored in memory in the reverse order as these are 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 55 processed by this instruction stream. Thus the first instruction of this in- struction stream processes the last B-KD Tree node and the last instruction processes the first B-KD Tree node. With this data layout the pointer to the next node to update can be computed by simple decrement operations in the hardware. An alternative would be to decouple the order the nodes are updated and the order they are stored in memory. But this would increase the size of the instruction stream, as one would have to store the 32 bit address of the node to update in the update instructions. This would result in lower performance because more instruction data would need to be fetched and the nodes would not be updated in-order. The in-order processing of the presented approach can better exploit the peak bandwidth of DRAMs. A limitation of the proposed method is that it makes the structure of the B-KD Tree fixed. Thus inserting objects afterwards is not efficiently possible, as the instruction stream needs to be recompiled.

Figure 3.12: Update Instruction Set: The instruction set of the Update Processor includes instructions for loading vertices to vertex registers, com- puting the box of a leaf node, and merging two boxes to the box of an inner node plus simultaneous updating of its bounds.

Figure 3.12 shows the encoding of the three different kinds of instruc- tions supported by the Update Processor. All of these instructions are 32 bit wide, thus 4 instructions can be fetched per clock cycle from the 128 bit wide memory interface (see Figure 3.13). The update operation for a leaf node and inner node is completely pipelined thus can be computed with through- put 1 and a small latency of only 6 cycles in the implementation. A load vertex instruction is used to load one triangle vertex from memory to some destination vertex register. If vertices are shared between multiple triangles, these registers can be reused to reduce the required mem- ory bandwidth. The maximal mesh size that can be handled with the 24 bit wide vertex pointer of the instruction is limited to 16 million vertices. This 56 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES is no limitation in practice as characters of current computer games have a typical number of 5000 to 10000 triangles [Gam06]. However, to make use of the 32 bit address space of the DRPU, the 24 bit vertex address points to a memory location relative to a 32 bit base register in the hardware. The hardware internally keeps track of the address of the current node that is processed. Each time a process leaf or update node instruction has been processed this current node pointer is decremented to get the next node to process. The process leaf instruction computes the box of the triangle of a leaf node represented by the three vertices in the specified vertex registers. The computed box is stored to the specified box target register. Inner nodes are updated by the update node instruction, which merges the boxes from two box registers and stores the merged box to a destina- tion box register. In parallel, the bounding intervals of the current node are updated using both fetched child boxes. The “dimension” bits of the instruction specify in which dimension (x,y, or z) the bounding intervals of the current node need to be updated. As explained earlier the write-back of the first 32 bit of the B-KD Tree node gets disabled, to prevent overwriting the child pointer address. Figure 3.13 shows the structure of the Update Processor design. All memory requests performed by the processor are 128 bit wide, thus the Instruction Fetch Unit fetches 4 instructions at a time from memory and schedules them sequentially to the pipeline. The Instruction Scheduler tests if the arguments (vertices or boxes) for the currently processed instruction are completely computed and written back to their register file. Therefore it stores for each vertex and box register a single state bit that indicates if a computation in the pipeline will modify the register. Each scheduled instruction will set the state bit of its destination register to true, while a finishing instruction will reset this bit again. An instruction only gets scheduled if all state bits of all required arguments are set to false. The “load vertex” instructions are processed by the Fetch Vertex Unit, which computes absolute vertex addresses and invokes a 128 bit memory re- quest for the vertex. If data comes back from memory the vertex is converted to the internal floating point format and written back to the correct register in the vertex register file. For “process leaf” instructions (for leaf nodes) and “update node” instruction (for inner nodes) the register file needs to be accessed in the Register Access Unit, to load the 3 vertex operants or 2 box operants, re- spectively. This requires a vertex register file with 3 read ports, and a box register file with 2 read ports to perform in a single cycle. The Merge Unit either merges the two boxes in the inner node case or computes the small- est box containing the 3 vertices for leaf nodes. The computed box is then written to the destination box register in the Box Writeback Unit. If processing an inner node the currently processed node is updated 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 57

Figure 3.13: Update Processor: The “Instruction Fetch Unit” fetches the three types of supported instructions. The “Instruction Scheduler” checks if all arguments (such as vertices or boxes) have already been computed by the pipeline, otherwise the pipeline is stalled. Vertex fetch instructions are executed in the red units, while leaf nodes and inner nodes are processed in the remaining units. After accessing the register file the node bounds are updated for inner nodes. The “Merge Unit” merges two child boxes or three vertices to a new box that is stored to the register file again. using both fetched boxes of the children. The Node Update Unit updates the bounding intervals of the current node, in the dimension encoded in the instruction set. As the instruction stream processes the nodes of the tree in the reverse order as they lie in memory, the Node Update Unit can compute the target address of the node, by simple decrement operations. A comparison between the number of instructions and memory re- quests that are performed per B-KD Tree node shows a good balance be- tween peak computation power and peak memory bandwidth. If one has to update a B-KD Tree with N leaf nodes, then the tree has N − 1 inner nodes that need to be updated. Furthermore, exactly N − 1+ N = 2N − 1 update instructions for inner nodes and leaf nodes need to be fetched. If we further assume a B-KD Tree over a regular mesh then the update pro- N N cessor performs about 3 · 6 = 2 vertex fetch instructions (if vertices are 58 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES shared between 6 triangles) and reads that many vertices. In total this makes node instructions+vertex instructions = (2N −1)+N/2 = 2.5N −1 many instructions or cycles for the computation and instructions + vertices + 2N−1+N/2 N updates = 4 + 2 + (N − 1) = 2.125N − 1.25 many 128 bit wide memory requests to perform. These numbers show that the processor is slightly computation limited, as asymptotically per triangle 2.5 cycles are required for the computation and only 2.125 memory requests (of 128 bits) need to be performed. However, because the DRAM does not deliver its peak performance of one 128 bit word per clock cycle, the computation gets bandwidth limited in practice (see Section 3.3.2.2). The update instructions for inner and leaf nodes make 1/4 additional memory requirement to the storage of the B-KD Tree. Also the vertex fetch instructions make about 1/4 of the memory required to store the ver- tices. This overhead of 25% that is introduced by the instruction stream is reasonable. The alternative approach of traversing the B-KD Tree for up- date would not require any additional memory, but would produce a higher bandwidth during updating because no 32 bit instructions but rather 128 bit nodes would have to be fetched. All data is arranged in such a way in memory that it is processed mainly sequentially by the hardware. The instruction stream obviously lies sequentially in memory, the nodes to update are processes sequentially (in reverse order), and the vertices are stored according to the order of their first access. All this together allows for an efficient usage of the external DRAM (see Section 3.3.2.2).

3.3.2.1 Compiling B-KD Trees Compiling the instruction stream for the update processor is a non-trivial task that is performed by the software driver of the hardware. Many opti- mizations are possible to improve the speed of the update procedure without any changes to the hardware, by selecting an optimal data layout of the nodes and optimal static scheduling of the instructions. There are two main strategies to update the bounds of a B-KD Tree: depth-first or breadth-first. While depth-first approaches require the mini- mal number of temporary results, they are not very fast because they process the nodes in a long dependency chain. If the depth first traversal reaches a node, it puts the address of one child onto a stack and fetches the other one to recursively continue there. If a leaf node is reached, the algorithm continues with the top item on the stack. However, the algorithm often has to wait until the next child node is fetched from memory which would result in poor performance. As one address is put onto the stack each time the algorithm descends the tree, the minimal required stack size is small and matches the maximal depth of the tree. Breadth-first approaches generate enough independent instructions to 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 59 achieve high usage rates of the hardware but would require a large tem- porary state to be stored during the computation. This temporary state requirements are linear in the breadth of the tree, thus cannot be stored in a register file of reasonable size or cause additional bandwidth to store immediate results to memory. To cope with the limited temporary state that can be stored in the register file, and to achieve high usage rates, the compiler generates an instruction stream that performs a kind of breadth-limited breadth first processing of the tree. For a tree of depth d at least d temporary box registers are required to update it (for a depth first processing). As the hardware has 64 such registers it can process trees of depth 64 maximally. If the depth of the tree is smaller the traversal can be performed more parallel (and thus faster) because enough box registers are available. In Section 3.3.2.2 B-KD Trees with a depth of 16 to 20 are used to evaluate the Update Processor and achieve good performance.

Figure 3.14: Update Algorithm: This figure shows the update order for a tree if only 5 boxes can be stored to temporary registers. The nodes are labeled in the order they are processed (and consequently the reverse order they are stored), and have the same color if they are processed in the same pass of the algorithm. Note that both child nodes are always processed together.

The compilation iterates in several passes over the tree in the same depth first order to generate update instructions for a number of ready nodes, whose children have already been processed. If reaching a leaf node, it can directly be processed if a box register is free. For this leaf node, vertex fetch instructions are generated if the required vertices are not already available in vertex registers. These vertex fetch instructions load the vertices from memory to vertex registers that have been least recently used. Thus the vertex registers act like a static full associative cache. The number of nodes that can be processed per pass depends on the 60 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES number of available free box registers. However, each inner node N that is ready for processing (both children are already processed) can always be processed because its two children bind 2 box registers that are freed and one of them can be used to store the box of their parent node N. The algorithm iterates over the tree until all nodes have been processed. The iterations over the tree can be optimized to operate only on the few nodes of the tree that might be processed next (the predecessors of nodes being processed previously). In detail, in each pass the compilation generates update instruction sequences for pairs of child nodes, not for single nodes as described in the previous paragraphs. Such a pair of child nodes is only processed if the children of both nodes have already been processed. After such an update, 2 of the 4 box registers required to store the boxes of the 4 children of the pair are free again and can be used to process new leaf nodes of the tree. The handling of pairs of child nodes during compilation guarantees that the Update Processor always processes the two child nodes of a par- ent node sequentially, and stores them as a pair to memory. Figure 3.14 illustrates the order in which the nodes are processed at a simple example. In the last step, the algorithm iterates over the generated instructions and moves the vertex fetch instructions backwards to perform them as early as possible to hide latencies due to memory fetches. This simple optimization improves performance by a factor of up to 2.4 in our experiments as pipeline stalls due to vertex fetches are reduced. The order in which the nodes are handled by the algorithm (or later updated in the hardware) defines the reverse order they need to be stored in the memory such that the Update Processor can compute the next node address by a decrement operation. In the example of Figure 3.14 the nodes are stored in the order 18,17,16, ..., 1,0 in memory. The order needs to be in this reverse way, because only forward pointers are possible in the B-KD Tree node layout. The fixed depth first order that the tree is processed in, influences the overall performance. Slightly better results are achieved if the recursion always follows the deepest subtree first to collapse the deep parts as early as possible, which is a standard technique in compilation. An early handling of the shallow parts of the tree would bind many partial results to registers, that can only be used much later if the deeper parts of the tree are computed.

3.3.2.2 Results This section describes a prototype implementation of the Update Processor that has been implemented on the prototype FPGA board (see Section 2.8). The latency of the Update Processor instructions is 6 cycles in the implemen- tation, with a peak throughput of one instruction per clock cycle. Despite this peak throughput, Table 3.3 shows that about 2 cycles are required per 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 61 instruction, because the computation is bandwidth limited in the prototype. The reason for this is the external DRAM which causes some overhead through opening and closing of memory rows and switching from read to write operation and vice versa. For the DRAM, the DRAM usage is defined as the percentage of cycles the DRAM is busy (cycles where commands are send to DRAM or the controller has to wait due to the DRAM timing specification). Thus if DRAM usage is 100% the DRAM is busy in all cycles and no additional memory requests can be handled. The DRAM efficiency is defined as the percentage of read (or write) instructions that are executed relative to the number of cycles the DRAM is busy. Thus if DRAM usage is 100% and DRAM efficiency 20% only each 5th cycle some data is read from the memory (or written to). The remaining cycles are required to open DRAM rows, to refresh rows, or to wait due to the timing specification.

instruc- cycles per DRAM DRAM Scene # tris tions cycles instruction efficiency usage Hand 17k 45k 110k 2.4 35.9% 99.9% Skeleton 16k 43k 108k 2.5 35.7% 99.5% Helix 78k 227k 575k 2.5 35.8% 99.9% Pipe 1k 2.6k 6.1k 2.3 37.1% 98.9% Morph 1.2k 3.3k 7.9k 2.4 36.7% 99.9%

Table 3.3: Update Processor Results: The Table shows number of update instructions that need to be executed for a number of dynamic test scenes with varying complexity. An instruction can be scheduled each two cycles to the pipeline because the computation is bandwidth limited as DRAM is used with an efficiency of 36% only.

Thus despite the DRAM usage of the Update Processor is about 99%, only in 36% of the cycles (the DRAM efficiency) can data be read or written. The resulting bandwidth is not sufficient to optimally use the processor, however some improvements can be performed. A more clever memory interface with large buffers that prefetches a long sequence of instructions, and writes longer sequences of updated nodes to memory could improve this efficiency. However, this is not simply possible for vertices because they can not always be fetched sequentially. Besides the bandwidth limitation, Table 3.3 further shows an indepen- dency of the cycles required to execute an intruction from the total number of instructions. Thus the number of internal registers is sufficient to hide computation latencies for the evaluated range of the B-KD tree size. Con- sequently the updating is linear in the number of triangles of the scene (if one is assuming a similar fraction of shared vertices). With 110k cycles the Hand model could be updated 600 times per second and even the much larger Helix robot could still be upated 114 times 62 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES per second at a clock frequency of 66 MHz. If comparing these numbers to the total cycles required for rendering (see Chapter 7) updating is one to two orders of magnitude faster than rendering, thus mostly not the limiting factor of the DRPU architecture.

3.3.3 Ray Casting Unit (RCU) All ray tracing algorithms build on the basic operation of casting rays through a scene to determine the closest geometry hit by a ray. This casting operation needs to traverse the spatial index structure to determine poten- tially hit triangles and to intersect with these triangles. In recursive ray tracing this operation needs to be performed for several rays per pixel, to compute reflections, refractions, pixel accurate shadows, and anti-aliasing. These demands for real-time rendering of images via ray tracing, make high ray casting performance necessary that can be achieved with the highly parallel fixed-function Ray Casting Unit described in this section.

Figure 3.15: Ray Casting Unit: The Ray Casting Unit consists of a high performance Traversal Processor (TP) to traverse the B-KD Tree and a Geometry Processor (GP) to intersect rays with triangles or to transform them to the local coordinate space defined by an object instance.

The Ray Casting Unit as shown in Figure 3.15 mainly consists of a ded- icated Traversal Processor (TP) to handle the traversal of B-KD Tree nodes and a Geometry Processor (GP) for transforming rays for object instance intersection in the top-level B-KD Tree or to intersect rays with triangles in the bottom-level B-KD Tree. These subunits are explained in more detail in the next sections. 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 63

At a high-level of abstraction these units operate as follows. A ray to be traced is send via path a of Figure 3.15 to the Ray Casting Unit and stored locally in the Geometry and Traversal Processor for later use. The Traversal Processor starts traversing it through the top-level B-KD Tree until a leaf node (containing an object instance) is reached. As this traversal operation is recursive it requires a stack to store nodes to process later. The address to the transformation matrix of the object instance is sent via path b to the Geometry Processor that fetches the matrix rows from memory. The top-level ray gets transformed to the objects coordinate frame by the Geometry Processor. The transformed ray is stored locally (for later triangle intersection) and sent via path c to the Traversal Processor to be traversed through the object. If this object traversal reaches a leaf node in the bottom- level B-KD Tree the addresses to the triangle vertices are then sent via path b to the Geometry Processor that fetches the triangle vertices from memory. Then the ray is intersected with the triangle and the intersection is stored and sent to the Shading Processor via path e if it is closer than a previously found one. This process continues again via path c until the Traversal Processor determines that no closer hit can be found, and then terminates the operation via path d. In order to hide memory and computation latencies due to deep pipelin- ing multiple rays are processed simultaneously using a wide multi-threading approach [SWS02]. Furthermore, packets of four rays are used to reduce the memory bandwidth as the rays in the packet always perform the same memory request. These packets of four rays are traversed synchronously in parallel on four traversal slices (see next section) in the Traversal Processor but are intersected and transformed sequentially in the Geometry Proces- sor for optimal balancing between both units. This multi-threading and packet-based approach performs well because of the high coherence between adjacent rays. The memory bandwidth is reduced further by using dedicated first level caches to store B-KD Tree nodes and vertices (see Figure 3.15). A major advantage over previous hardware architectures as described in [SWW+04] is that neither a List nor Mailbox Unit are required, as each triangle is only contained in one single B-KD tree leaf node. In contrast to [SWS02, SWW+04, WSS05] the Intersection or Geometry Processor re- spectively operates directly on shared vertices (e.g. indexed face sets) in- stead of precomputing triangle data. This greatly reduces the triangle work- ing set.

3.3.3.1 Traversal Processor (TP) The Traversal Processor is a fixed function unit optimized to perform ray traversal steps through B-KD Trees. The Traversal Processor is fully pipelined thus completes one packet traversal step per clock cycle (through- put 1) and has a latency of about 17 cycles in the FPGA implementation. 64 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

To improve performance the unit handles packets of 4 rays in parallel and is highly multi-threaded, allowing several such packets to be processed si- multaneously in the pipeline. The applied algorithm is very similar to the algorithm from Section 3.2.4, but operates on packets of rays. The packet traversal algorithm always operates on the same B-KD Tree node per packet.

Figure 3.16: Traversal Processor: The Traversal Processor traverses packets of 4 rays in parallel through the B-KD Tree. Nodes are read from memory through a cache, the traversal decision for the 4 rays is computed in paral- lel by 4 traversal Slices and later combined to a traversal decision for the complete packet in the Packet Decision Unit.

Figure 3.16 shows a detailed view of the Traversal Processor. The Stack Control Unit performs the high-level control for each packet to tra- verse. It manages the stack and terminates the computation if the stack got empty. The address of the next node to process is computed, which may either come from the stack or being the next node to traverse if descending the tree. The Memory Access Unit loads this next B-KD Tree node from mem- ory through a cache. As the 4 rays in the packet are always synchronized to the same node of the B-KD Tree, memory requests are always per packet, 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 65

Figure 3.17: Traversal Slice and Decision Unit: A Traversal Slice needs to in- tersect its ray with the bounding planes at the position T 0min,T 0max,T 1min, and T 1max corresponding to both children (child0 and child1). Therefore it reads the ray origin and direction in the node dimension k, computes the reciprocal direction, and computes the distances to the planes along the ray (requiring a subtraction and multiplication each). For each child this gives two intersection distances with its planes, that need to be sorted. The sorted distances (min0,max0) and (min1,max1) are then sent to the Decision Unit that computes the traversal decision for that ray. The ray traverses a child if the current traversal interval (near/far interval) overlaps with the min/max interval of a child and the child can contribute to a closer hit (if min ≤ hit dist). This computation gives a bit l and r that indicate if the ray traverses the left child (child0) or right child (child1). The traversal order (l2r bit) is determined by comparing the min0 and min1 distances. If min0 ≤ min1 the child0 lies partially before child1, thus it is traversed first. 66 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES reducing the memory bandwidth to the cache. The node is broadcast to 4 Slices that operate in parallel on the 4 rays of the packet. Each Slice first loads some local state of its ray, such as the ray itself or the current traversal interval, and computes the distance of the ray to the clipping planes defined by the bounding intervals of the node (see Figure 3.17). As an inner node stores a bounding interval for each child, 4 such ray/plane intersections are performed in parallel in each of the Slices. Each of these total number of 16 ray/plane distance computations requires one floating point subtraction and one floating point multiplication (with the reciprocal). To compute the reciprocal of one of the components of the ray direction, one reciprocal computation needs to be performed in each slice during traversal. The 4 Decision Units intersect the current traversal interval of their ray against the intersection interval computed by the traversal slice. Each such interval intersection requires only a scalar minimum and maximum op- eration. One more floating point comparison is required to determine which of both children is the closer and to be traversed first. This information can be used to derive a traversal decision that consists of 3 bits (l,r,l2r) for each ray that indicates if the ray traverses the left (l bit) and/or the right child (r bit) and in which order it traverses them. For left-to-right order the l2r bit is true, or false otherwise. The Packet Decision Unit merges the four ray traversal decisions com- puted by the decision units to a joint traversal decision for the complete packet of rays as follows: The packet goes to the left child and/or right child if there exists a ray in the packet that wants to go there, and the packet goes from left to right if there is a ray that traverses both children in this order.

lpacket = l0 ∨ l1 ∨ l2 ∨ l3

rpacket = r0 ∨ r1 ∨ r2 ∨ r3

l2rpacket = (l0 ∧ r0 ∧ l2r0) ∨ (l1 ∧ r1 ∧ l2r1) ∨

(l2 ∧ r2 ∧ l2r2) ∨

(l3 ∧ r3 ∧ l2r3)

Figure 3.18: Derivation of Packet Traversal Decision

Rays may disagree on the traversal order but this is irrelevant for the correctness of the algorithm and only influences its performance. Conse- quently, inconsistent rays cause no problems in the packet traversal of B- KD Trees. On the other side, KD Tree based hardware architectures [SWS02] 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 67 have problems with inconsistent rays, as they typically optimize the stack memory size by only storing a single far-value on the stack instead of the complete near/far interval as done for B-KD Trees. From the same reason, no active vector as for KD Trees [WSS05, SWS02] is required to determine if a ray is active (by intersecting the current node). For B-KD Trees this active state is implicitly encoded in the near/far interval being empty (near > far) or not (near ≤ far). The traversal stack of the Traversal Processor has a depth of 32 entries and is used for the top-level and bottom-level traversal. This stack size is sufficient for scenes with several hundred million triangles, and stores the address of the node (1 pointer) to traverse later, the traversal intervals for that node of each ray (8 floats), and the current stack pointer of a different ray stack. This ray stack is used for storing transformed rays to support top-level and bottom-level traversal, thus its depth is limited to 2. The ray stack pointer is used by the computation to always access the top ray on this stack. The ray stack pointer is increased if a leaf node in the top-level tree is reached, to continue traversal with the bottom-level ray transformed by the Geometry Unit. Through a pop operation of the traversal stack it is possible to continue traversal in the top-level tree again. Compared to a traversal unit for KD Trees more resources are required in terms of on-chip memory and arithmetic units, see Table 3.4. For B- KD Trees, stack memory requirements are about twice as high as an interval instead of a single scalar needs to be stored per ray [SWW+04]. Also the logic complexity is four times higher as four ray/plane distance computations are computed versus one for the KD Trees. This higher cost pays off as the B-KD Tree is much more flexible and enables the handling of coherent dynamic scenes. For traversing an AABVH only an address stack is required as a full box is always available to intersect the ray with. As shown earlier this storage of full axis aligned boxes is a large memory overhead to store the index structure. Furthermore, with 12 ray/plane intersections a single

# fmul # fadd stack KD Tree 1 1 1 pointer + 1 float B-KD Tree 4 4 1 pointer + 2 floats AABVH 12 12 1 pointer

Table 3.4: Comparison of the cost of one traversal step for KD Trees, B- KD Trees, and AABVHs. Compared to KD Trees, B-KD Trees require more floating point operations per traversal step and need to store a near/far interval onto the stack. In contrast, AABVHs store full child boxes explicitly in the nodes that can directly be intersected with a ray, thus no near and far values need to be stored on the stack. As a consequence the traversal step of the AABVH is much more expensive. 68 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

AABVH traversal step requires 3 times more floating point computations than a B-KD Tree traversal step.

3.3.3.2 Geometry Processor (GP) The previous section described the Traversal Processor for traversing the B-KD Tree index structure. If reaching a leaf node of the top-level B- KD Tree, or a leaf node of the bottom-level B-KD Tree, rays must either be transformed to an objects coordinate frame or be intersected with a triangle. The Geometry Processor performs these tasks sequentially for the rays in a packet. Like the Traversal Processor, the Geometry Processor operates on packets of size 4 and is multi-threaded supporting the same number of packets. The Geometry Processor is partially pipelined and can perform one ray/triangle intersection or ray transformation each two clock cycles. Thus for intersecting a packet of 4 rays, the Geometry Processor requires 8 cycles, which causes a 1 to 8 balancing between Traversal Processor and Geome- try Processor, which is optimal because on average a triangle needs to be intersected every 8 traversal steps [Sch06, SWS02]. Compared to different approaches [WSS05] that re-use a programmable shading processor also for geometry intersection, the dedicated Geometry Processor allows for higher parallelization giving a speed-up of 40 to 100% for the usual case of triangle geometry (see Chapter 7). Consequently, the later described Shading Pro- cessor can be used for more complex shading computations as it does not need to compute any geometry intersections. Higher order surfaces, such as spheres or splines, can still be ray traced by placing dummy geometry around them and refining the intersection point later during shading, with possible re-casting of rays if the geometry is missed. Per triangle pre-computations are commonly used for reducing the cost and complexity of the ray/triangle intersection, e.g. for the Fast Ray Trian- gle Intersection [Wal04] or Arenbergs algorithm [Are88]. However, these pre- computations require difficult and expensive matrix inversions (that need to be computed in each frame for dynamic geometry) and blow up the size of the geometry because vertices can no longer be shared. In a regular mesh vertices are typically shared by about six triangles, thus only 1/6-th of the memory to store the triangles would be required compared to other ap- proaches that perform pre-computations [SWS02, SWW+04, WSS05]. Fur- thermore, the reduced triangle working set makes the geometry cache more efficient, because more triangles (vertices) could be stored. Because of these advantages additional logic for a more expensive vertex based ray/triangle intersection pays off for the Geometry Processor and holds the scene repre- sentation compact. However, for ray/triangle intersection this means solving a linear sys- tem of equations with three unknowns, which is computationally expen- 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 69

Figure 3.19: Geometry Processor: In intersection mode, the Geometry Processor fetches three vertices of a triangle from memory. The Geome- try Preparation pipeline is configured to compute two partial results (see text) in consecutive cycles for each ray. These partial results are combined and the Geometry Projection pipeline computes the hit distance and (u,v)- coordinates. These are tested for a valid intersection, and the hit is stored in the Collect Hits Unit, if it is closer than a previously found one. sive. Fortunately, by rearranging the terms, the computation can be trans- formed to a compact form requiring only 27 floating point multiplications, 23 floating point additions, and one reciprocal computation (see M¨oller Trum- bore [MT97]) which is feasible for a hardware implementation. In compar- ison the Fast Ray Triangle Intersection requires only 8 additions, 7 multi- plications, and one reciprocal computation, which is more than a factor of 3 less expensive because all values/terms that only depend on the triangle vertices are precomputed. The Geometry Processor as shown in Figure 3.19 is called from the Traversal Processor with three pointers addr(V0), addr(V1), and addr(V2) (plus some additional data like thread-id and shading address) that point to the triangle vertices or matrix rows respectively. The Vertex Fetch Unit 70 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES buffers this input data into a FIFO before accessing the memory. Three consecutive memory requests are sent to the cache, which may be answered out-of-order because of possible cache misses. A reorder buffer collects per ray/packet the three vertices (or matrix columns) and sends them syn- chronously to the next stage. As the memory requests are performed se- quentially, each 3 cycles a triangle (or matrix) is read from memory. The remaining part of the pipeline requires 8 cycles to intersect (or transform) the 4 rays in the packet, thus the pipeline is compute limited. For intersecting rays with triangles and transforming them, the re- maining units operate in two different modes: the intersection mode and the transformation mode. For improved performance the pipeline can be in a different mode for each packet.

Intersection mode: In the intersection mode, the M¨oller-Trumbore ray/triangle intersection is performed [MT97] which is computed by the following. Let O be the ray origin, D be the ray direction and V0, V1, V2 be the triangle vertices then the intersection can be computed as:

t Q · E2 1  u  =  P · T  P · E1 v Q · D     with E1 = V1 − V0, E2 = V2 − V0, T = O − V0, P = D × E2 and Q = T × E1. For optimal balancing of the DRPU architecture, it is sufficient if this computation takes 2 cycles, which can be used to reduce hardware cost by sharing hardware units to compute different terms. The intersection can be rewritten to the following 3 computations:

1. a = (D × E2) · E1, c = (D × E2) · T

2. b = (T × E1) · E2, d = (T × E1) · D

T 1 T 3. (t, u, v) = a (b, c, d) The computations (1) and (2) are computed in the Geometry Prepa- ration pipeline (see Figure 3.20 and Figure 3.21) in two consecutive cycles. The values a, b, c, and d are then adjusted to the same latency and sent to the Geometry Projection pipeline that performs the computation (3) (see Figure 3.19). A division by zero can not occur, because the internal floating point representation cannot represent a zero value (see Section 2.8.1). Now the hit distance t and the barycentric hit coordinates can be used in the Hit Test Unit to check if a valid hit occurred. If so, the new hit is stored in the Collect Hits Unit if it is closer than a previously found hit. If so, the new hit distance is sent to the Traversal Processor for early ray termination and all hit data (including shading address of the triangle, barycentric coordinates, etc.) is sent to the Shading Processor for later use. 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 71

Transformation mode: If using the Geometry Processor in trans- formation mode, the arithmetic units of the intersection computation are reused to transform the ray in two cycles. Some arithmetic units match per- fectly, like the two dot product computations that need to be extended to a 4-component dot product for ray origin transformations (and later Skin- ning transformations). The third dot product is performed by sharing some arithmetic units with the cross product and adding one additional adder. These three dot product units are used to first transform the ray origin and then the ray direction without requiring a separate transformation unit. The total number of required floating point units are listed in Table 3.5 for the Geometry Preparation and Geometry Projection Unit. Despite being subdivided into two steps, the geometry preparation step is still the most expensive one. The test for a valid hit point requires only one additional floating point adder to compute u + v. The Test Hit Unit and the Collect Hits Unit further perform only simple floating point comparisons.

#fmul #fadd #finv Total Geometry Preparation 14 19 0 33 Geometry Projection 3 1 1 5 Hit Test 0 1 0 1 Collect Hits 0 0 0 0 Total 17 21 1 39

Table 3.5: Floating point operations of Geometry Preparation Unit and Geometry Projection Unit.

3.3.4 Results

A high performance prototype of the Ray Casting Unit has been imple- mented on our prototype FPGA platform. This section gives an overview of the most important results of this prototype, to compare the performance of the just described architecture with different approaches and to show its applicability and limitations for dynamic scenes. More scenes and detailed results, such as cache hit-rates, DRAM efficiency, influence of scene com- plexity etc., are given in Chapter 7 where the complete DRPU system is evaluated.

3.3.4.1 Performance Comparison

To show the efficiency of the Ray Casting Unit for B-KD Trees, we com- pare the rendering performance against three different ray tracing engines: 72 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

Figure 3.20: The first configuration of the Geometry Preparation pipeline computes the values a = (D×E2)·E1 and c = (D×E2)·T . The 4-component dot product is used as a 3-component dot product in this computation. The 4 component dot product is required to affinely transform the ray origin, or for the homogeneous transformation of the Skinning Processor (see Chap- ter 5). The cross product is extended with one additional adder to make a further 4-component dot product for the affine and homogeneous transfor- mations.

Figure 3.21: The second configuration of the Geometry Preparation pipeline computed the values b = (T × E1) · E2 and d = (T × E1) · D. 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 73 the fixed-function SaarCOR ray tracing prototype [SWW+04], the pro- grammable RPU architecture [WSS05], and the OpenRT ray tracing sys- tem [WBS03]. We use three static scenes of different complexity for the comparison: Scene6, Office, and the Gael level (see Table 3.6). To make that comparison as fair as possible only one FPGA clocked at 66 MHz, flat triangle shading, and the same KD Tree is used. For the DRPU architecture the KD Tree has been converted to a B-KD Tree as described in Section 3.2.2 and an additional unit performs ray generation and flat triangle shading. The Table 3.6 shows that the rendering performance of the DRPU architecture is slightly lower than that of the fixed-function SaarCOR pro- totype, which is mainly due to the lower triangle intersection performance of 8 cycles per packet/triangle intersection compared to 5 cycles for SaarCOR. The RPU architecture performs worse than the DRPU due to the same rea- son. Intersecting a triangle on the RPU requires 13 instructions (or cycles) per packet which is much more than the 8 cycles for the Geometry Proces- sor of the DRPU design. Furthermore, the RPU driver needs to perform pre-computations to speed-up the ray/triangle intersection. Using a M¨oller Trumbore intersector would reduce the RPU performance results even fur- ther. Scene triangles DRPU SaarCOR RPU OpenRT Scene6 0.5k 40.1 44.6 23.5 12.9 Office 34k 26.1 35.9 17.4 10.4 Gael 52k 16.3 18.6 9.9 8.0

Table 3.6: Frame-rate comparison of different ray tracing architectures using flat triangle shading only.

Compared to OpenRT, the DRPU ray tracer is 2 to 3 times faster than the used Pentium-4 running at 40 times the clock frequency (2.66 GHz). This shows the efficiency of the Ray Casting Unit compared to general purpose designs. However, frustum based traversal can perform even an order of magnitude faster on the CPU [RSH05] but still the FPGA would operate one 1/3 of the CPU speed. However, shading can not be amortized over the packet but needs to be computed per ray, thus the performance of frustum- based ray tracers can break down dramatically if complex shading is enabled. 74 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

3.3.4.2 Dynamic Scenes B-KD Trees have been used as data-structure to support coherent dynamic scenes. Most typical animated objects, such as character models, water surface, and even typical morphing sequences obey this coherence restriction (see results in Chapter 7). As an example, Figure 3.22 shows a typical character model rendered in three different poses using B-KD Trees. The B-KD tree is constructed for the left pose, where it is optimal. However, even if deforming the mesh to a different pose, the quality of the tree still stays high. In particular the empty space besides the model is still traversed without much overhead. The intersection cost of parts of the model stay mainly constant during the course of changes, which can for instance be seen at the similar color of the hands during the animation. This can also be seen in Figure 3.23 where the frame-rates stay almost constant during the animation. While most dynamic scenes obey the coherence restriction, it is very easy to build a scene where coherence is too low to be rendered efficiently with B-KD Trees (or any other Bounding Volume approach). Figure 3.24 shows a morphing sequence between a cube (initial state), a random triangle distribution, and a cylinder with similar topology to the cube. In the initial state the number of traversal steps and number of ray/triangle intersections are low, but if morphing towards the random distribution they increase strongly (see color of center images of Figure 3.24). The reason is that the structure of the B-KD Tree no longer matches the structure of the geometry. This results in many node overlaps, such that large parts of the B-KD Tree need to be traversed for a ray. In the worst case, the bounds can overlap completely and all B- KD Tree nodes (and nearly all triangles) have to be traversed (or inter- sected) for a single ray. Figure 3.25 illustrates this linear complexity of ray casting for randomly moving geometry. The figure shows the rendering time of the morphing sequence, in the initial cube stage and the random stage with a varying number of triangles. In the cube stage the rendering time is nearly independent of the number of triangles, while it is strongly linear for the random stage of the animation. 3.3. HARDWARE IMPLEMENTATION OF B-KD TREES 75

Figure 3.22: The quality of the spatial index structure stays close to optimal during the entire animation, which is illustrated by these images showing a skeleton model (16k triangles) in three different poses (from left to right). The rendered image as well as the number of traversal steps and ray triangle intersections are visualized from top down. The scene renders at a resolution of 512x384 pixels on one FPGA with about 25 fps.

Figure 3.23: Frame-rates of the complete skeleton animation. Despite the geometry being animated, the frame-rate varies little. 76 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES

Figure 3.24: The quality of the spatial index structure breaks down if some geometry (2.2k triangles) is moving randomly through the scene. The B- KD Tree is constructed for the first pose, where the number of traversal steps and ray/triangle intersections stay low, but morphing towards a random triangle distribution causes much higher traversal and intersectional cost because the structure of the B-KD Tree no longer matches the structure of the geometry. The cost reduces again if morphing towards the cylinder, as it has the same topology as the cube. The center image renders at a resolution of 512x384 pixels on one FPGA with 0.34 fps.

Figure 3.25: This figure shows the linear complexity of ray casting in ran- domly moving scenes using the B-KD Tree approach. The same morphing sequence as in Figure 3.24 is rendered in the initial cube stage and the ran- dom stage. For the cube stage the rendering time is nearly independent of the number of triangles, while there is a linear relation for the random distribution. 3.4. CONCLUSIONS AND FUTURE WORK 77

3.4 Conclusions and Future Work

This chapter presented the novel B-KD Trees together with a hardware ar- chitecture that significantly closes the largest remaining gap between ray tracing and rasterization hardware: the handling of dynamic scenes. B- KD Trees combine the advantages of KD Trees and AABVH into a single, simple to handle data structure. B-KD Trees fit well with a hardware imple- mentation as they are more homogeneous than KD Trees, making the special treatment of lists or mail-boxing unnecessary. The B-KD Tree hardware de- sign perform up to 80% faster than the RPU architecture mainly because of higher intersection performance. Similar to AABVHs, B-KD Trees can be used to handle coherent dynamic scenes by maintaining the structure of the tree while updating the bounding intervals of each node when geometry changes. The chosen update approach used a special processor that requires an instruction stream for each object it has to update. This instruction stream increases the scene size by only 25% but allows for an efficient up- dating of the tree. However, later insertion of objects is not easily possible, as this would require a recompilation of the instruction stream. Most important types of dynamic motion can efficiently be handled with the B-KD tree approach, such as characters, morphing sequences, or water surfaces. However, fast random movements are problematic as they result in large overlaps among the B-KD Tree nodes which causes a traversal of large parts of the B-KD Tree. However, the hardware supports a top-level B-KD Tree over object instances and bottom-level B-KD Trees for each ob- ject. This enables the support for randomly moving object instances as long as the driver application can rebuild the top-level B-KD Tree fast enough. Fast reconstruction algorithms for KD Trees [Ben06, WBS03] could also be applied to B-KD Trees, making possible the support of several thousands of object instances. In the paper [WSS05] we proposed the RPU hardware design that also uses its shading processor for ray/geometry intersection. This property has the advantage that arbitrary (even higher order primitives) can be sup- ported by implementing a small intersection function. This thesis described a different approach where a Geometry Processor optimized to ray/triangle intersections was used to speed-up the intersection computation with trian- gles. This approach could easily be extended with programmable geometry if the Traversal Processor could call intersection shaders on the Shading Pro- cessor and get hit distances back for early ray termination. Such a mixed approach would allow for a fast processing of the common case of triangles, but would also support higher order surfaces. However, despite B-KD Trees performing very well and being easy to implement in hardware there are many alternatives with different algorith- mic advantages. AABVHs [WBS06a] can be used to trace large packets of rays efficiently without the need of a near/far stack, because a closed box 78 CHAPTER 3. RAY CASTING HARDWARE FOR DYNAMIC SCENES is known and can directly be intersected for each child node. However, the larger size of AABVHs was a mayor reason for choosing B-KD Trees as spatial index structure as well as the lower computational cost. However, choosing a closed box representation with lower accuracy could reduce the size and still eliminate parts of the traversal stack which shows to be quite expensive in the ASIC implementation (see Chapter 8). From the algorithmic side, the techniques presented in this chapter put ray tracing close to rasterization in the support for dynamic scenes. However, an important part is still missing: shading. Thus the next chapter will describe a fully programmable Shading Processor, that together with the Ray Casting Unit described here builds a powerful rendering engine with similar capabilities as GPUs. Chapter 4

Programmable Shading Processor (SP)

4.1 Introduction

The previous chapter described hardware implementations of the Ray Cast- ing Unit to perform visibility queries in a scene and an Update Processor to handle dynamic scene changes. Besides these visibility queries the second important operation in a ray tracer is shading, which is performed frequently at each intersection point of a ray with the scene and is the process of com- puting the amount of light (or radiance respectively) that travels back along this ray. Shading makes images realistic and typically tries to simulate physical effects as accurately as possible. Thus researchers have developed many different surface shading models, such as the empirical Phong and Blinn illumination models [Pho75, Pho75], the Ward reflection model [War92], the physically inspired Cook-Torrance reflection model [CT82], any many more. Fixed function hardware cannot support this variety of shading models, thus a programmable Shading Processor is required. In this thesis, all shaders for this Shading Processor are handwritten in assembly code. However, assembly code is not the best interface to a processor as it does not provide high flexibility. Todays animation studios, like Pixar or Blue Sky, use RenderMan [AM90] to specify advanced shading effects. However, RenderMan is designed for offline rendering but uses a high level of abstraction which makes it platform independently. Thus most features of the language could easily be compiled for the Shading Processor of the DRPU. For interactive applications GLSL (OpenGL Shading Lan- guage) and Cg (C for graphics) have been developed. As these languages are specially designed to work with rasterization, they are not well suited to program a ray tracer. A project with the goal to implement a RenderMan compiler for the DRPU has been started [Weg06] but some limitations do

79 80 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP) still not allow an evaluation on the DRPU. In rasterization hardware, GPU pixel shaders perform the shading computations, but they are strongly limited to perform only local shading operations that depend on some local data per triangle. This forces GPU applications to perform multiple rendering passes to approximate reflection and lighting effects. These multi-pass techniques are on the one side inac- curate (due to sampling artifacts) and inefficient (if looking to the memory bandwidth) and on the other side they complicate the shading process by splitting it between the application and the GPU processor. To cope with these problems, this chapter describes the hardware ar- chitecture of the Shading Processor (SP) which combines the flexibility of general purpose CPUs with the efficiency of current GPUs for data parallel computations. In terms of the instruction set the Shading Processor is sim- ilar to current GPUs but differs as it supports recursive function calls via a hardware managed register stack, arbitrary memory reads (and writes), and provides global access to the scene via a trace instruction for ray casting. These capabilities allow the Shading Processor to compute shading effects such as reflections, refractions, and shadows directly in the shading process. Multiple effects from different material shaders can automatically be com- bined by a recursive ray shooting approach, e.g. a refraction seen through a reflection can easily be computed. A massive multi-threading approach and the synchronous execution on packets of threads make the architecture efficient with respect to hardware resources. This chapter focuses on the Shading Processor and its connection to the Ray Casting Unit (RCU) described in the previous chapter. A com- parison of multi-GHz CPUs with an FPGA prototype of the Shading Pro- cessor shows its efficiency and high performance for shading computations. The evaluation of the complete DRPU system including Shading Proces- sor and Ray Casting Unit is done in Chapter 7. The next sections first describe the Shading Processor at the instruction set level and later the micro-architectural level describes implementation aspects such as multi- threading and packeting.

4.2 Instruction Set Architecture

Typically the first step of designing a processor is the definition of the in- struction set and types of registers to operate on. For general purpose processors, this instruction set must execute all different kinds of user appli- cations efficiently, thus today’s CPUs [Cor06a, Inc06a] provide a very rich instruction set. In contrast, a shading processor executes a special class of shading algorithms and needs to be optimized to this task, which poses different demands. Shading algorithms have a high density of floating point operations 4.2. INSTRUCTION SET ARCHITECTURE 81 and mostly operate on colors, points, vectors, or normals. All these data types are three or four component floating point vectors, thus vector op- erations are frequently used and pay-off for a shading processor. Similar to current GPUs, the Shading Processor uses 4-component single precision floating point vectors as its basic data type. The use of 4-vectors takes ad- vantage of the available instruction level parallelism, results in fewer memory requests of larger size, and significantly reduces the size of shader programs compared to a scalar program. For many operations 3-component floating point vectors would be sufficient, but for homogeneous or affine transforma- tions 4 components are advantageous. Furthermore, most shading models need to compute the amount of incoming light illuminating the point to shade from a specified direction. At this point the Shading Processor uses recursive ray tracing to compute this incoming light which requires on the one hand visibility queries, and on the other hand recursion. The visibility queries are possible by a special trace instruction that accesses the Ray Casting Unit described in the previous chapter to shoot a ray through the scene. Recursion is supported by a hardware managed register stack to optimally handle the state of the recursion. This offers maximum flexibility to shader writers that can recursively shoot rays from any location through the scene. Other approaches that only allow for tail-recursion [Kaj86] are too limiting for practical use. There a shader computes the direct influence of the surface to the pixel, and up- dates it. To compute secondary effects, it further generates all secondary rays with some weight, puts them in a queue, and terminates. If there are free resources the ray tracing system takes the next ray from this queue, traces it, and executes a new shader at the intersection point. This shader again computes a color for the ray that gets combined with its weight, and accumulated to the frame-buffer. The shader can again generate new sec- ondary rays with weights that influence the pixel. While these tail recursive approaches are principally sufficient to solve the rendering equation, they are very limited. Especially as the secondary rays are treated in parallel, the shading computation can not depend on a shoot ray. However, for adaptive oversampling techniques, it can be advan- tageous to shoot sample rays as long as the variance of radiance is above some threshold. With a recursive approach one can perform such advanced computations, as the secondary rays are shot sequentially, and not in paral- lel. Thus the computation of the next secondary ray can of course depend on the previous ones. One further advantage of recursion, in contrast to a parallel tail ex- ecution of secondary rays, lies in the more compact computation state for the sequential execution. The tail recursive approaches need to generate all secondary rays at once and need to store them somewhere. Recursive approaches only generate the next secondary ray and continue only if it has 82 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

Name Ports Description R0 to R15 3r, 1w General registers S0 to S7 1r, 1w Stack frame registers C0 to C31 1r Global constant registers P 1r Parameter register S 1r, 1w Special operation register I0 to I3 1r Memory input registers H0 1r Barycentric hit coordi- nates (u,v), hit distance t and material shader H1 1r Hit triangle address regis- ter H2 1r Hit object address regis- ter OUT 1w Output register

Table 4.1: General registers of the Shading Processor with the available read (r) and write (w) ports for the Shading Processor. All registers are 4-component (floating point) vectors except for the hit registers H1 and H2 that contain a single address only. been computed completely.

4.2.1 Shading Processor Registers Different types of registers are supported by the Shading Processor both for internal use and for communicating with the Ray Casting Unit (see Table 4.1 and Figure 4.1). The Shading Processor contains 16 general 4- component registers R0 to R15 that are used as temporary registers during computation. This general register file has three read ports, thus it can be accessed at maximally three different source locations in the instructions (e.g. mad instruction: R0 := R1 · R2+ R3). Depending on the type of executed instruction the register content gets interpreted as a floating point or integer number. The 8 stack frame registers S0 to S7 give access to the 8 top registers on a hardware managed register stack. By writing to these stack registers the application can efficiently save some state to the stack. A later “call” instruction transfers the lower n registers of the stack frame onto the hard- ware managed stack. In the implementation this requires no data copies as an offset addressed stack memory is used (see Section 4.3.1.5). This tech- nique allows the performing of recursive function calls in a single cycle, as no additional instructions are required to push data onto the stack. A value which is required after a recursion can be written to a free stack register in the same instruction it is computed in. To reduce the chip size of the 4.2. INSTRUCTION SET ARCHITECTURE 83

Figure 4.1: Overview of the 4-component registers available to the Shading Processor. The general registers R0 to R15 are used for temporary results, the stack frame registers S0 to S7 provide access to the top part of a stack, the input registers I0 to I3 hold data read from memory, some constant registers C0 to C31 can be set by the application before rendering, the parameter register P is set by the Thread Generator to the pixel location, and hit data registers hold hit information after ray casting. The special register S is used as the target register for reciprocal and reciprocal square root computations. stack register file it has only a single read port, which is sufficient for its application. The constant registers C0 to C31 are written by the application and contain global shading constants, such as camera transformation or the ad- dress of the B-KD Tree of the scene. While these constant registers are constant over all threads, the parameter register P is a constant register that is only constant for a single thread (but my differ between threads). It contains the pixel location or different parameters for the thread and is initialized by the Thread Generator. Data read from main memory is always written to one or more of the four input registers I0 to I3. Writing such fetched data to the main register file would cause problems because it only has a single write port that is already used by the arithmetic pipeline for write-back. Thus if again using the main register file the processor could schedule two instructions per clock (arithmetic and read) but could write only one result back to the register file. For a similar reason there is a sep- arate register S that stores the result of a special operation (reciprocal or 84 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP) reciprocal square root) which can additionally be computed in parallel to an arbitrary instruction. Furthermore after executing a trace instruction some special hit data registers H1,H2, and H3 contain the barycentric hit coordinates (u,v), hit distance t, material shader to execute, and triangle as well as object instance address of the hit geometry (see Table 4.1). These special hit registers can be written only by the Ray Casting Unit, but ac- cessed by the Shading Processor as a standard source register without any overhead. The output register OUT is used to hold the color value of the computed pixel. After termination of the thread this color is written into the frame-buffer.

4.2.2 Shading Processor Instruction Set The instruction set of the Shading Processor is strongly based on that of cur- rent GPUs, that are highly optimized to shading operations and to achieve a high floating point instruction density and throughput. The Tables 4.2, 4.4, and 4.3 show a summary of all instructions supported by the Shading Pro- cessor, including a short definition. Supported instructions are simple assignment (mov); per component addition (add), multiplication (mul), multiply and add (mad) and computa- tion of the fractional part (frac) (see Table 4.2). The MAD instruction is especially important for performing a weighted addition of values, which is often performed in shading computations. MAD capabilities further raise the peak floating point operations per clock cycle to 4 multiplications plus 4 additions at the cost of three instead of two read ports at the general register file. Different types of dot products are required for performing cosine com- putations between two normalized vectors and to implement the multiplica- tion of a vector with a possibly homogeneous matrix. One of the source registers can be replaced by an immediate con- stant that is replicated to all 4 components. Thus the instruction mul R0,3.14,R1, multiplies each component of R1 with the scalar value 3.14 and stores the result to R0. Arbitrary memory reads and writes relative to a register plus a static offset are supported for flexible memory access (see Table 4.3). The write instruction should only be used to write computation results to the main memory, temporarily storing computational state in the main memory can easily become a performance bottleneck in real-time shading and should be prevented. Thus, all example shaders in this thesis hold their computation state exclusively in registers (or the hardware managed stack). Even the external storage of a computational stage of only two 128 bit vectors per pixel would result in an additional 12 MB of data transfer for an 512x386 pixel image (the stored state also needs to be read back again). The typical amount of data that is transfered per pixel is typically up to 4.2. INSTRUCTION SET ARCHITECTURE 85

Syntax Semantics mov Rd,Rs Rd := Rs frac Rd,Rs Rd.x := frac(Rs.x), Rd.y := frac(Rs.y) Rd.z := frac(Rs.z), Rd.w := frac(Rs.w) add Rd,Rs0,Rs1 Rd := Rs0 + Rs1 mul Rd,Rs0,Rs1 Rd := Rs0 · Rs1 mad Rd,Rs0,Rs1, Rs2 Rd := Rs0 · Rs1 + Rs2 dp3 Rd,Rs0,Rs1 v = Rs0.x · Rs1.x + Rs0.y · Rs1.y + Rs0.z · Rs1.z Rd.x := Rd.y := Rd.z := Rd.w := v dp2h Rd,Rs0,Rs1 v = Rs0.x · Rs1.x + Rs0.y · Rs1.y + Rs0.z Rd.x := Rd.y := Rd.z := Rd.w := v dp4 Rd,Rs0,Rs1 v = Rs0.x·Rs1.x+Rs0.y ·Rs1.y +Rs0.z ·Rs1.z +Rs0.w·Rs1.w Rd.x := Rd.y := Rd.z := Rd.w := v dp3h Rd,Rs0,Rs1 v = Rs0.x · Rs1.x + Rs0.y · Rs1.y + Rs0.z · Rs1.z + Rs0.w Rd.x := Rd.y := Rd.z := Rd.w := v

Table 4.2: Arithmetic Instructions: Rs denotes an arbitrary register used as source register and Rd an arbitrary destination register (R0 to R15). Per component multiplication and addition is denoted by · and +.

17 MB for the used example scenes (compare with Table 7.6). Thus even the external storage of the computational state of only two floating point vectors would double the amount of data to be transferred per frame. Furthermore, many more expensive switches from read to write mode of the external DRAM would be introduced. A solution to this problem is to store the computation state on-chip, either in the hardware managed register stack or in an additional big cache. However, the large amount of 128 threads in the system makes it very ex- pensive to increase the available on-chip storage per thread. To improve the speed of texture accesses, a special texture addressing mode allows memory locations to be addressed as a 2D array indexed by two floating point coordinates. It can also load 4 adjacent texels for later bilinear interpolation in software. Besides these addressing modes, the Shading Pro- cessor provides no special hardware functionality for texture filtering which gives more space for functional Shading Processors on the chip. However, textures are often incoherent and may pollute the cache of the Shading Pro- cessor, which would make a dedicated texturing unit for the frequently case of 32 bit RGBA textures useful. Modern GPUs implement such texturing units that are even capable of performing anisotropic filtering to improve the image quality. Such advanced filtering algorithms in particular could not efficiently be implemented in software on the Shading Processor. How- ever, bilinear interpolation can be performed with only a few instructions and is enabled for all textures in this thesis. 86 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

Syntax Semantics load Ii,Rs.j,n Ii := memory[Rs.j + n], n ∈ {0 · · · 4095} load4x Rs.j,n I0 := memory[Rs.j + n + 0], I1 := memory[Rs.j + n + 1], I2 := memory[Rs.j + n + 2], I3 := memory[Rs.j + n + 3] store Rs.j,n,Rs0 memory[Rs.j + n] := Rs0 texload IX ,Rs0,Rs1 IX := texture(Rs0) [Rs1.x][Rs1.y] texload4x Rs0,Rs1 I0 := texture(Rs0) [Rs1.x + 0][Rs1.y + 0], I1 := texture(Rs0) [Rs1.x + 1][Rs1.y + 0], I2 := texture(Rs0) [Rs1.x + 0][Rs1.y + 1], I3 := texture(Rs0) [Rs1.x + 1][Rs1.y + 1]

[if ] Jumps depending on the condition. jmp label

[if ] Calls a function relative to the current in- call label [push n] struction pointer if the condition is ful- filled.

[if ] Calls a function (or shader) whose abso- call Rs [push n] lute address is stored in register Rs.

[if ] Return from a function call if condition is return fulfilled.

trace Rs0, Rs1, Rs2 Traces the ray with origin Rs0 and direc- tion Rs1 through the spatial index speci- fied by register Rs2.

Table 4.3: Memory, Control, and Trace Instructions: Rs.j is the j-th com- ponent of the source register Rs, and Ii one of the four input registers. The conditions are evaluated using the result of a paired instruction. The num- ber of registers to be saved onto the stack is in the range n ∈ {0,..., 8}. The modifiers fp16, fp32, fix8 and int32 can be used to set the load or store instruction to 16 or 32 bit floating point mode, 8 bit fixed point mode, or 32 bit integer mode to interpret data in memory differently. 4.2. INSTRUCTION SET ARCHITECTURE 87

Syntax Semantics f2i Rd,Rs Rd.w := (int)Rs.w i2f Rd,Rs Rd.w := (float)Rs.w iadd Rd,Rs0,Rs1 Rd.w := Rs0.w + Rs1.w imul Rd,Rs0,Rs1 Rd.w := Rs0.w · Rs1.w not Rd,Rs Rd.w := not Rs.w (bitwise) and Rd,Rs0,Rs1 Rd.w := Rs0.w and Rs1.w (bitwise) or Rd,Rs0,Rs1 Rd.w := Rs0.w or Rs1.w (bitwise) lrs Rd,Rs0,shift Rd.w := Rs >> shift lls Rd,Rs0,shift Rd.w := Rs << shift

Table 4.4: Logic/Integer Instructions: Converting from a floating point in- struction to an integer (f2i instruction) always rounds to the next smaller integer value.

All memory accesses are executed asynchronously, which allows for hiding memory latency by overlapping it with other instructions. Up to four independent memory requests can be outstanding per thread because four input registers are available. An auto-increment addressing mode al- lows the performing of a burst access of four vectors into consecutive input registers using a single load4x instruction. This can significantly reduce the instruction count when more than one vector needs to be read (e.g. a matrix). Memory loads support several memory formats: Full and half precision floating point vectors as well as 8 bit fixed point data that is scaled to the interval [0, 1]. These memory formats allow the programmer to choose the right precision for the shading data to reduce the scene size. In shading, integer and logical computations are not frequently used, but sometimes required for implementing counters or to perform address computations. None of the sample shaders of this thesis required integer arithmetics because many computations can also be performed with float- ing points. To avoid spending too much logic on rarely used integer and logic operations, they are only possible on the forth component of the 4- component vectors (see Table 4.4). Similar to GPUs powerful source modifiers are a main property of the Shading Processor instruction set as well. The swizzling modifier, which allows the four components of a source vector to be reordered arbitrarily, is especially important as otherwise the 4 vector components would be strongly separated. Swizzling can be used to implement a vector multiplication with a scalar value, or to compute a cross product (between vectors R1 and R2) with only two instructions:

mul R0,R1.yzx,R2.zxy mad R0,-R1.zxy,R2.yzx,R0 88 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

Different source modifiers are negation and multiply with a power of 2. These modifiers allow for a multiplication with the frequently used constants ±0.5,±1,±2, and ±4 for free which can speed up code for some computations (e.g. to compute the reflected direction D − 2 · (D · N) · N). Using the instruction modifier sat the result of an operation can be clamped to the interval [0, 1] to limit its range. Thus the instruction mov sat R0,2.0 writes the value 1.0 to each component of R0. As floating point comparisons to 0 or 1 are very simple this saturation operation can easily be implemented in hardware. A write mask specifies which components are written to the destina- tion, thus not each instruction needs to operate on all 4 vector components. For instance mov R0.xz,R1 only moves the x and z component of the register R1 to the register R0, while the other components of R0 are not modified.

4.2.2.1 Paired Instructions In order to improve the usage of their hardware units, current CPUs provide super-scalar dynamic instruction scheduling that submits instructions from a large instruction window to several arithmetic units in parallel. As such dynamic scheduling is complex and requires significant hardware resources the Shading Processor provides only static scheduling with two slots per instruction. An instruction is divided into a primary and a secondary part mak- ing the Shading Processor a kind of large instruction word (LIW) ma- chine [Fis83]. A few instructions, such as trace or store, require both the primary and secondary slot, while others are restricted to the secondary slot only, like the load, call, return, and jmp instruction. The main reg- ister file has 3 read ports only, which causes some limitations in pairing instructions. A primary instruction that already uses all three ports (such as mad) cannot be paired with a secondary instruction that also accesses the register file (such as a load instruction). Furthermore, an instruction that uses an immediate constant can not be paired with any other instruction, because the immediate constant is encoded in the secondary part of the instruction. As vector SIMD units are utilized inefficiently for operations on scalars or too short vectors, the Shading Processor supports splitting each vector into 2/2 or 3/1 components. The two instruction slots can then be used to execute arbitrary arithmetic operations on each sub-vector. An example for a valid paired instruction that splits the vector in a 3/1 ratio is:

dp3 R0.xz,R1,R2 + mov R4.w,R5.w.

This instruction writes the 3-component dot product of registers R1 and R2 to the x and z component of R0 and moves the w component of R5 4.2. INSTRUCTION SET ARCHITECTURE 89

Figure 4.2: Computing a Branch Condition from a 4D Vector. to the w component of R4. Optimizing compilers can take advantage of this capability and explore instruction level parallelism in shaders for improved static scheduling even for scalar code.

4.2.2.2 Control Instructions The Shading Processor allows for flexible control flow to make general com- putations possible. The control instructions (call, return, jmp) must al- ways be paired with an arithmetic instruction and are executed under some control condition that is evaluated on the result of this paired instruction. In contrast to other SIMD designs, all 4 components of the arithmetic result can contribute to this condition, which is evaluated as follows. First, each component is compared to be smaller than 0 (or ≥) and smaller than 1 (or ≥). Each component then forwards either one of these two results or combines them with and or or. A mask selects a subset of all component results and performs a final and or or reduction on these selected components to derive the final condition (see Figure 4.2). For inte- ger comparison the < 0 test is evaluated by interpreting the register as 2th complement, and the > 1 test is replaced by an equal zero test for integers. These flexible branch conditions can improve the performance of the architecture if combining several conditionals into only one. For instance combined with a mad instruction the branch condition allows for comparing a 4D point against an arbitrary 4D hypercube with little overhead, tradi- tionally requiring 8 comparisons. One algorithm we tried that could take advantage of a combined branch condition was a ray/triangle intersection, where one can test the (u,v) coordinates against a unit bounding square (u < 0 ∨ u > 1 ∨ v < 0 ∨ v > 1) for a weak in-triangle test that is refined later.

4.2.2.3 Special Instructions Besides standard arithmetic instructions, such as additions and multiplica- tions, shading sparsely requires some complex operations such as sine and 90 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP) cosine (to generate regular patterns), power (for arbitrary specular expo- nent), reciprocal and reciprocal square root (for vector normalizations). The sine, cosine, and power function are not supported by the Shading Processor directly, but can be approximated by texture lookups, like often performed on GPUs. The reciprocal and reciprocal square root are supported by the Shading Processor as special instruction modifiers rcp and rsq. When at- taching such an instruction modifier to an arithmetic instruction, its result is first computed normally and written to its destination. In parallel the special operation is applied to one specified component of the result and written back to the same component of the special operation destination register S. These uncommon special instruction semantics minimize the number of register file accesses and instructions as mostly special operations can be attached to a previous instruction, keeping the operand in the pipeline. This can for instance be used for the frequent case of vector normalization, taking only two instructions: dp3 rsq w R3.w,R3,R3 mul R3,S.w,R3 The first instruction computes the squared length of R3, and stores it to the w-component of R3. Now, the reciprocal computation is applied to the w-component of R3 by the rcp w modifier and written to the w- component of S. The second instruction multiplies R3 with its reciprocal length in S.w which gives the normalized vector in R3.xyz and the length of the original vector in R3.w.

4.2.2.4 Function Calls Single cycle recursive function (and shader) calls are supported for increased efficiency and flexibility. The recursion is completely managed by the hard- ware using a special register stack (see Section 4.3.1.5). Each call instruction specifies a number of the stack registers S0 to S7 to push onto the stack, while after executing a return instruction these registers are restored from the stack. For instance, if executing the instruction call function push 3 the three registers S0,S1, and S2 are pushed onto a stack and fresh registers S0 to S7 are available again. Now the function is called and the general registers can be used to pass or return arguments to/from the called function. If the called function executes a return instruction, the previous values are restored to S0,S1, and S2 again while S3 to S7 are undefined.

4.2.2.5 Trace Instruction The only ray tracing specific instruction of the Shading Processor is the trace instruction for accessing the Ray Casting Unit. After its execution 4.2. INSTRUCTION SET ARCHITECTURE 91 the intersection result can be accessed via hit registers that had previously been written by the Ray Casting Unit. For instance, the execution of the instruction trace R0,R1,R2 traces the ray with origin R0, ray direction R1 through the scene specified in R2 = (root, near, far, shader). Besides the root node of the top-level B-KD tree of the scene R2 also contains an interval [near, far] to traverse on the ray and a ray loss shader address. After execution of the trace instruction the hit registers H0,H1, and H2 contain the hit information, such as barycentric coordinates, hit distance, shader to execute as well as triangle address and object instance address of hit geometry (see Table 4.1). During ray casting, the Shading Processor can perform further shading computations of the same thread as long as it does not access the hit data, which would cause it to wait for the Ray Casting Unit to finish its operation. This parallelism can be taken advantage of by material shaders that directly trace a shadow ray, while performing all other shading computations in parallel. At the end of the shader the shadow test can be performed, and the final color be combined. For scenes where the shading takes roughly the same time as casting the shadow ray (e.g. Shirley6), this technique allows shadows to be cast nearly for free (see Chapter 7). 92 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

4.3 Shading Processor Microarchitecture

This section describes a possible hardware implementation for the instruc- tion set of the Shading Processor. All design decisions are first motivated and described, followed by a detailed description of the implementation. The design is optimized for FPGA technology as the target platform, where memory is inexpensive but logic quite expensive. Thus massive multithread- ing is feasible on the FPGA, whereas logic consuming techniques to reduce latency have been omitted (such as forwarding for instance). Similar to most modern processors, the design is a Harvard architecture by separating the instruction and shading cache for optimal performance.

4.3.1 Design Decisions Ray tracing is a compute intensive, recursive, and highly parallel algorithm with complex control flow requirements. The raw algorithm would per- form a large number of mostly unstructured memory accesses, which can be greatly reduced by exploiting the coherence between rays. Most operations in the ray tracing algorithm are floating point vector operations, especially for shading. These properties of ray tracing result in the following basic design decisions for the Shading Processor.

Figure 4.3: The Shading Processor consists of 4 SIMD ALUs that execute packets of 4 threads synchronously. The execution in the 4-component ALUs and between the ALUs is performed in SIMD fashion. Multiple packets are executed asynchronously on the multi-threaded hardware. A custom Ray Casting Unit synchronously traverses packets of rays through the spatial index and operates independently to the Shading Processor (see previous chapter). 4.3. SHADING PROCESSOR MICROARCHITECTURE 93

4.3.1.1 Vector Operations The vector instruction set is mapped to parallel hardware, being capable of executing one vector operation per clock cycle. This style of implementa- tion improves the shading performance compared to a sequential execution of the vector operations. Furthermore, each vector instruction, even the dot product computation, is fully pipelined, making the execution of one arbitrary vector instruction per clock cycle possible. Thus all arithmetic instructions can be performed with throughput one and constant latency. Only instructions with special modifiers and the load and trace instructions have longer and varying latency. Similar to GPUs the Shading Processor implements dual-issue instructions that can operate on entire 4-vectors or split them into 2/2 or 3/1 elements as required.

4.3.1.2 Threads The Shading Processor takes advantage of the data parallelism in ray trac- ing (or shading) through a multi-threaded hardware design. For every pixel a new independent thread is started for performing all its computations. The state of multiple of these threads is maintained in hardware and the execution in the Shading Processor switches between threads as required. Multi-threading allows hardware utilization to be increased by filling instruc- tion slots that would otherwise not be used due to instruction dependencies or memory latency.

4.3.1.3 Packets The raw bandwidth requirement for the unmodified ray tracing algorithm is huge [SWS02] if no optimizations are applied. It can be reduced considerably by exploiting the high coherence between adjacent rays. To this end, packets of four threads are created in the hardware, where all threads in a packet are executed synchronously in SIMD mode in parallel by four SIMD ALUs (see Figure 4.3). Because all threads always execute the same instruction, identical memory requests are highly likely for coherent threads and can be combined later. Using SIMD mode, much of the infrastructure (e.g. instruction scheduling, caches, etc.) can be shared among the four SIMD units, which greatly reduces the hardware complexity. It is not guaranteed that the logical control flow in all threads of a packet is always identical, even though this would be required by tradi- tional SIMD hardware. Rays hitting different geometry may execute dif- ferent shaders or threads may branch to different instructions. Whenever threads in a packet are about to execute different instructions the packet is split into sub-packets. Each sub-packet maintains a mask of active threads, which determines if a thread of a packet is actively executing instructions or simply sleeps. Only one sub-packet is then executed while the remaining 94 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP) sub-packets are put onto a control stack to be executed later if the control flow returns. The SIMD design assumes that there is enough coherence between threads that on average sub-packets still maintain a high number of active threads. In the worst case instruction packeting efficiency and performance could drop to 25% in which case threads would be executed sequentially. However, for ray tracing coherence in a packet is generally very high (see Section 4.4 and Chapter 7). However, standard communication schemes cannot be applied between the threads. The reason is the hardware managed packeting that might introduce deadlocks if a thread from packet A waits for an inactive thread of packet B and at the same time a thread of packet B waits for an inactive thread of packet A. Such situations are difficult to prevent manually, as the activity state of a thread in a packet depends not only on its own, but also on the computation of the other threads in the packet.

4.3.1.4 Instruction Scheduling The Shading Processor does not perform any out of order executions, no forwarding, no branch speculation and no advanced dependency analysis. In contrast, instructions are statically paired by the compiler, scheduled in-order to the pipeline and registers are classified to simplify dependency analysis. The possibility of pairing two instructions makes the execution of two instructions per clock cycle possible without expensive additional logic for parallel scheduling. However, this technique increases the size of the instruc- tion word to 122 bits (aligned to 128 bits), causing an overhead for simple unpaired instructions. To simplify the dependency analysis, an additional control flow tag per instruction stores some abstraction of the computation flow. These ad- ditional bits specify on which previously scheduled instructions the following instruction depends. If an instruction gets fetched, the control flow tag can be used to de- termine when all dependencies for the next instruction are resolved thus all required operands are written to the register file. Thus if an instruction gets fetched from instruction memory, it is always guaranteed that it can be exe- cuted directly. This property simplifies the hardware design, as no expensive instruction prefetch buffers are required that hold fetched instructions that cannot directly be executed because of dependencies. To abstract the control flow, the instructions are classified into four classes that each execute at a fixed latency and write to different types of registers in the processor:

1. trace instructions, 4.3. SHADING PROCESSOR MICROARCHITECTURE 95

2. memory access instructions,

3. arithmetic instructions,

4. and special instructions.

The memory store instruction is not considered here, as a memory read after a memory write to the same address always reads the correct previously written data as a property of the memory interface. The hardware contains for each thread four small dependency counters that count the number of in-flight instructions of the four instruction classes and a fifth dependency counter that tests if a control instruction is in the pipeline. The counters of a thread are incremented if an instruction of the counters type is fetched and decremented if an instruction of the counters type finishes its execution. For the trace, memory fetch, and special instruction class, the depen- dency tag encodes the dependencies as a single boolean value per instruction class. If set to true, the hardware only continues scheduling instructions if no instruction of this class is still being processed in the pipeline, thus the corresponding counters are zero. Otherwise if any condition is violated, it continues with a different thread and schedules the current thread again, only if all dependencies are resolved. For improved performance, the dependencies of the arithmetic instruc- tions are encoded more generally in the control flow tag as a 3 bit number. For an instruction this number specifies the number of proceeding arith- metic instructions of the next instruction, that this next instruction not depends on. To check for dependencies, the hardware simply compares this 3-bit number against the counter for in-flight arithmetic instructions. If the number of in-flight arithmetic instructions is smaller or equal to the 3-bit number of the control flow tag scheduling can continue because the instruc- tions that caused the dependency terminated. This algorithm works only as the arithmetic instructions are scheduled and terminated in-order. If a control instruction (branch, call, or return) is scheduled a thread switch is always initiated and no further instructions of the previous thread are scheduled to the pipeline, until its new instruction pointer is determined in addition to other dependency checks. A different handling at this point would require some branch prediction which is not performed by the Shading Processor. Table 4.5 shows an instruction sequence to clarify the use of the control flow tags. The columns T, M, S and A show the different parts of the control flow tag for the instructions: dependency on traversal (T), memory fetch (M), special (S), and arithmetic (A) instructions. The control flow tag of the instructions can be computed automatically by a compiler or assembler. However, in the case of branches or shader calls 96 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

T M S A Instruction 1. false false false 7 ld I0,[R0.x+13] 2. false false false 7 add R0,R1,R2 3. false false false 7 add R3,R4,R5 4. false false false 2 add R6,R7,R8 5. false true false 1 mul R9,R1,R0 6. false true false 0 mad R10,R6,R3,I0 7. false false false 0 mad R11,R10,R3,I0

Table 4.5: Control Flow Tag: A simple instruction sequence including con- trol flow tag denoting dependencies of the next instruction to traversal (T), memory (M), special (S), and arithmetic (A) instructions. Highlighted is the control flow tag of instruction 5 that specifies dependencies of the next instruction (instruction 6). Instruction 6 depends on the computation of R6, R3, and I0, thus on the instructions 4,3, and 1. Before it can be scheduled the memory dependencies must be resolved (M=true) and only instruction number 5 may be in the pipeline (A=1). There is no access of the special register or any hit result register thus S=false and T=false. the compiler must be careful in setting the control flow tags as the instruc- tions no longer execute in sequence. If a shader or function is called from a register it is necessary to set the control flow tag of the call instruction to (true,true,true,0) such that all flying instructions in the pipeline are retired before the first instruction of the function is executed. This is necessary as the first instruction of the function is unknown to the static analysis of the compiler and may access any register. But in the case of static branches the compiler knows the two instruction sequences that might follow and can set the dependency tag of the current instruction conservatively. The used scheduling scheme using control flow tags per instruction is quite simple and achieves high efficiency. For the arithmetic instructions it is near to optimal for an in-order scheduling, but is conservative for memory fetch instructions or special operations. Especially for several flying mem- ory fetches it is not possible to wait for a specific one because this would require 4 additional classes to be introduced, one for each possible input register. Furthermore, waiting for a maximal number of in-flight memory read instructions, like for the arithmetic pipeline, is not possible because the memory read instructions may terminate out of order. The main reason why this kind of dependency analysis was chosen, is that it simplifies the implementation, because only one small counter per abstracted class is required per thread. Otherwise, each register component (of each thread) would require a flip-flop to store the state if it will be modified by an instruction in the pipeline or not. This would be a much more general approach, but requires instructions to be fetched to a prefetch buffer, 4.3. SHADING PROCESSOR MICROARCHITECTURE 97 where their dependencies are checked depending to the state of each register. The available logic resources on an FPGA make such general approaches very expensive, especially for massively multi-threaded designs. However, a design optimized to an ASIC would better reduce the number of required threads by applying advanced forwarding and scheduling schemes to use the pipeline optimally. This would increase the area occupied by random logic, but greatly reduce the area required for register memories which is large as shown in Chapter 8. Furthermore, the applied abstraction requires only an additional num- ber of 6 bits to store the control flow tag which is small compared to the total number of 122 bits required to encode the instructions. Together with the multi-threading approach the pipeline is used very efficiently (see Sec- tion 4.4).

4.3.1.5 Recursive Function Calls As specified in the instruction set, recursive function calls need to be trans- parently handled by the hardware. On the one side this requires a handling of the register stack accessible by the stack registers S0 to S7 and a different control stack. These two stacks are kept synchronous by the hardware. The stack registers can be accessed by the application and are simply mapped to a frame in a dedicated stack register file that holds the top part of the register stack (see Figure 4.4). This frame is accessed by an addressing mode relative to a frame pointer of the register stack. The execution of a call instruction modifies this frame pointer only, and no data needs to be copied. A return instruction restores the frame pointer again. The control stack is not visible to the application and stores the return PC, the activity mask of the packet, and a return frame pointer of the register stack. On a function call the return address, the current activity mask of the packet, and the current frame pointer are put onto the control stack. After the execution of a return statement, the original activity mask, frame pointer, and the correct instruction pointer can be restored again by a pop operation. The control stack has a maximal size of 32 entries, thus recursion depth is limited. Besides these push and pop operations of the control stack, the application has no direct access on the instruction pointer, activity mask, or frame pointer. As the current activity mask is restored after a return statement, all threads that have been active before the function call are implicitly synchro- nized again, no matter what kind of potentially incoherent computations are performed in the function. This property of function calls can be used by programmers to explicitly synchronize threads after a certain incoherent code segment. For instance a compiler might not inline a function that computes the absolute value of a number to increase coherence. The control stack is also used to execute incoherent conditional branches, 98 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

Figure 4.4: Register Stack: The top part of the register stack is stored in on-chip memory. The stack registers (white part) are mapped to the top of this on-chip memory using an relative addressing mode. They can be used to quickly push or pop data from the stack without explicitly copying data. If on-chip memory is not sufficient data is transparently written to main memory by the hardware, and restored again. where the threads in the packet disagree in the branch condition. There the hardware splits the packet into a sub-packet that takes the branch and one that does not. For the branching sub-packet the target address of the branch and the activity mask of the branching threads are put onto the control stack. The sub-packet that has not taken the branch is executed further. If a return statement is reached the control stack is popped and the sub-packet that has taken the branch is executed. As the stack register file acts as a kind of dedicated stack cache its size should be larger than the 8 entries that are directly accessible by the shader. A size of 24 entries is sufficient for ray tracing scenes where the recursion depth is in a range of 4 and not more than 4 stack registers are required per recursion. As the recursion depth might get much deeper for few parts of the image the hardware automatically spills the lower parts of the stack register file to main memory if required by a function call. The registers are restored again if the function returns. This spilling out to main memory can be implemented quite efficiently, based on large blocks. To hide latency the stack can be restored if the amount of items stored on-chip falls under some threshold. The number of 8 register S0 to S7 that are accessible on the register stack might seem little for advanced shaders. However, on the one hand registers are all 4-component vector registers, thus up to 32 values can be put onto the stack. On the other hand a shader that needs to store more than 8 registers onto the stack, stores more than half of the 16 general registers. This is very unusual and indicates that the shader is too complex to be executed efficiently within the 16 available general registers. Such code would require spilling of registers to main memory (via store instructions) which causes high additional memory bandwidth and results in low rendering 4.3. SHADING PROCESSOR MICROARCHITECTURE 99 speed. However, in principal even very large shaders could be executed on the Shading Processor by using the external memory as a software managed stack and by spilling out registers. Unlike with CPUs, the data that can not be kept in the register file can not be buffered in the cache because the Shading Processor is missing a large second level cache. But it therefore contains on-chip memories for the stack register file that can be used to store the computational state during recursion. In con- trast to a cache, this special stack management strongly separates the state of the threads from each other. Storing the local state of the computation to a large cache, would require a more complicated read-write cache and would cause cache conflicts between several threads. This would result in a non-optimal handling of the stack semantics of the computational state and additional instructions to access the memory stack contents. Furthermore, the techniques used allow a single cycle function call to be implemented as no additional instructions are required to push or pop data from the stack. Data that has to be stored on the stack, can rather be written to a free stack register by the instruction that computes it. Recursive ray tracing can thus be handled very efficiently.

4.3.1.6 Dedicated Ray Casting Unit

The casting of rays through a spatial index causes a high latency as typi- cally between 50 to 100 traversal steps and 2 to 8 ray/triangle intersections or transformations are required per ray. Using a fully programmable unit for these operations wastes precious cycles, since every step and intersec- tion would correspond to several instructions and cause a high latency and computational overhead. On the other hand, spending extra logic in speeding up the ray casting operation by using the highly optimized Ray Casting Unit from the previous section increases the cache efficiency due to distributed small caches and keeps the hardware efficient because traversal and intersection are highly optimized. Thus the Traversal Processor can perform one packet B-KD Tree traversal step per clock cycle with optimal latency, while the Geometry Processor performs one packet/triangle intersection in all 8 clock cycles, much faster than the Shading Processor could do, especially when operating on triangle vertices. The overhead caused by software implementation of parts of the ray casting operation can be seen if comparing the performance of the Ray Casting Unit to the RPU design, which operates nearly at half the speed for pure ray casting (see Section 3.3.4.1). Some analysis on the additional chip area caused by the Ray Casting Unit can be found in the ASIC implementation in Chapter 8. 100 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

4.3.1.7 Memory Requests

Memory requests are a key problem with multi-core designs. The Shading Processor allows the prevention of memory writes when using the register stack and optimizes memory read operations through packets and caches. The synchronous execution of packets leads to many identical memory re- quests that can be combined and thus reduce bandwidth. Nevertheless inco- herent packets are allowed and cause no overhead but may not see improve- ments either. All memory read accesses go through small dedicated caches in order to further reduce external bandwidth. Coherence and consequently cache hit rates are generally very high (see Chapter 7). To keep the caches small, memory writes are not cached as they would be incoherent due to the many threads and should be prevented if possible. To read some previously written data from main memory again, the cache needs to be disabled, which is possible using a special tag of the read in- struction.

4.3.2 Hardware Implementation

This section describes implementation details of the Shading Processor. The hardware units shown in Figure 4.5 are separated into 6 classes: schedul- ing (white), computation setup (blue), memory request (red), arithmetic pipeline (green), special operation (yellow), and control flow (grey). Each of these units is explained in the following paragraphs, as well as their com- munication. The Instruction Scheduler controls the thread scheduling and the in- struction scheduling using the control flow tags of the instructions. If a thread is activated, it checks each time an instruction of this thread is fetched, if the dependencies for the next instruction are all resolved by com- paring the control flow tag components of the instruction against the five dependency counters as explained in the previous section. If the depen- dency check is ok, the next instruction is fetched by the Instruction Fetch Unit from the Instruction Cache and executed. Using the new dependency tag the dependency check for the following instruction is applied again. This only requires storing the dependency tag in a buffer - no complete instruc- tion. If the dependency check fails, the next instruction cannot be scheduled to the pipeline. Thus the Instruction Scheduler needs to switche to another thread whose dependencies are resolved and its next instruction is fetched. The previous thread is put to a waiting loop and activated only if the de- pendencies of the next instruction have all been resolved. While the instruction scheduling and fetching was shared between the 4.3. SHADING PROCESSOR MICROARCHITECTURE 101

Figure 4.5: Shading Processor: The blue colored units set up the computa- tion by fetching and decoding instructions, accessing the register file, and applying source modifiers. At this point the control either goes to the red units for memory access, to the Ray Casting Unit for ray traversal, or to the main green arithmetic pipeline for arithmetic computations. At the end of the arithmetic pipeline the control flow is computed and the special operations are applied to the result. threads in the packet, the following units are replicated for the 4 threads that are synchronously executed. In the next stage of the pipeline the register files are accessed to read up to 3 operands of the instructions. The register file is separated into different memories. These memories including number of ports that are accessible by the Shading Processor are general register file (3 read, 1 write), stack register file (1 read, 1 write), constant register file (1 read), special register file (1 read, 1 write), and the hit register file (1 read). As the Shading Processor can only access one read port of some memories, there are some limitations in using the different types of registers in instructions. The stack register file is accessed using a stack offset and it is connected to the Control Instruction Unit to store register contents to main memory if all on-chip stack registers are already in use. On the other side the execution of a return instruction might restore some part of the stack registers again. If all operands are fetched, the source modifiers (swizzle, negation, 102 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP) multiply with power of 2) are applied to the source operands. The swizzling in particular is wiring intensive, while the logic complexity of the modifiers is fairly low. The preparation of the instruction is now finished and the real execution of the instruction starts either in the memory access units (red), the arithmetic pipeline (green), or in the external Ray Casting Unit. As the memory request is executed in the second instruction slot the instruction can continue its computation in the arithmetic and memory access pipeline simultaneously. Memory Access: For memory access, first the Memory Request Generator adds the static offset to the address and generates a packet mem- ory request. This packet request contains one request for each active thread in the packet. These requests are packed, such that threads that fetch data from the same memory location only generate one combined memory re- quest. This results in a number of 1 to 4 packed memory read requests, depending on the coherence and number of active threads. The data is now fetched via the Shading Cache from main memory and then decoded in the Decode Memory Contents unit. This read procedure is performed 4 times if the load4x instruction is used. Decoding is necessary, to either simply copy data to the input registers or to interpret it according to the used memory access mode. If high accu- racy data is required, the Shading Processor addresses the memory using a 128 bit addressing mode and reads the full 128 bit data to the register file. If low accuracy floating point data is sufficient, the Shading Processor can access the memory in a 64 bit addressing mode, divide the 64 bit data into 4 floating point values of 16 bit and convert them to the internal floating point format. Textures are typically stored as RGBA data with 8 bit per component. Therefore, a third addressing mode allows the memory to be addressed in a 32 bit addressing mode to convert the RGBA color to a float- ing point vector with the color in a range of 0.0 to 1.0 in each component. In each of the addressing modes the hardware reads 128 bit data from memory, but selects the right 64 or 32 bits for further processing. For memory write requests data can be encoded the reverse way as explained above and be stored in the main memory. Thus a floating point vector (with the components in a range of 0 to 1) can be stored in RGBA format to memory. Similar data can be stored with reduced 64 bit precision. As explained earlier memory can not efficiently be used as storage for computational state, because the additional memory bandwidth could easily slow down the computation to non-interactive rates. The Shading Processor is not equipped with a large second level cache to hold this data on-chip, especially because the large number of threads would make such a second level cache very large to be effective. Thus memory writes are typically only used by the threads to return data such as the computed pixel color. Storing this return data into the first level cache would pollute it unnecessarily, thus written data does not get stored to the first level cache. 4.3. SHADING PROCESSOR MICROARCHITECTURE 103

However, this would strongly limit the use of the external memory for offline rendering, because written data can not safely be read back again. Thus a special flag in the load instructions can be used to disable the cache to fetch data directly from memory.

Arithmetic Pipeline: The main part of the arithmetic pipeline, consists of a large ALU to perform the different instructions like addition, fractional part, multiplication, multiply and add, and the dot products. The ALU is completely pipelined and the 4th component can be switched to integer and logic mode to perform integer and logical operations required for memory address computations. If the saturation destination modifier is applied the Destination Modifiers Unit clamps the result to the interval [0, 1]. Finally the Write Back Unit writes the specified components of the result back to the register file.

Control Instructions: If a control instruction is attached to an arithmetic instruction, the Condition Unit computes the condition under which the control instruction is executed. If the condition evaluates to true the Control Instruction Unit computes a new instruction pointer. In the case of a call and return instruction this might involve pushing or popping the instruction pointer to or from the control stack and to possibly store or restore parts of the stack register file.

Special Operation: A special operation can be applied to each arithmetic instruction. This special operation (reciprocal or reciprocal square root) is computed in the yellow units, using the result of one of the compo- nents of the arithmetic pipeline as input. The special operation computation is again fully pipelined and its result is written back to the same compo- nent of the special register S. For 24-bit floating point values, the special operation is implemented using a lookup memory of 512 entries containing value and first derivative to perform a linear interpolation. The accuracy of this computation is 15 to 16 bits, which is sufficient for the 16-bit mantissa of the 24-bit floating point representation used in the FPGA protype. To achieve higher accuracy for 32-bit floating point configurations, we can also use a quadratic interpolation by additionally storing the 2nd derivative of the function at the 512 sample points.

Trace Instruction: The execution of a trace instruction starts directly after the 3 operands are fetched and source modifiers are applied. The ray origin, ray direction, and scene specification are then written to the Ray Casting Unit which starts the processing of the rays. If the intersection of the ray with the scene is determined, the trace instruction terminates and the hit data can be accessed by the shader using the dedicated hit registers. 104 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

4.4 Implementation Results

Using FPGA technology, a prototype of the Shading Processor has been implemented to evaluate its shading performance. The prototype platform is the same as used for the performance evaluation of the Ray Casting Unit: a Xilinx Virtex-4 LX 160 FPGA [Xil06b], that is hosted on the Alpha Data ADM-XRC-4 PCI-board [Alp06] connected to a DRAM memory interface capable of delivering 1.0 GB/s peak performance at 66 MHz. See Section 2.8 for more information on the prototype platform. This section evaluates the raw shading performance of the Shading Processor without using the connected Ray Casting Unit. An evaluation of the complete DRPU system, including shading and ray casting can be found in Chapter 7. The complete DRPU system will be widely multi-threaded with 32 packets with four threads each. Because many of these threads will later be operating in the Ray Casting Unit, the Shading Processor is configured to 16 packets of four threads in this chapter to show that even this lower number of threads is sufficient for achieving very high usage rates of the arithmetic pipeline. Furthermore, each arithmetic unit of the prototype FPGA computes with 24 bit floating point accuracy and the shading cache is directly mapped and stores 512 items of 128 bits each. The Shading Processor prototype can perform a peak number of 2.3 GFlops (mad rcp instruction: (4 + 4 + 1) · 4 · 66 MHz) and execute 528 million instructions per second (2 · 4 · 66 MHz), due to instruction pair- ing and packeting. The throughput of all main pipelines is 1 and their latencies are as follows in the implementation:

Pipeline Latency Instruction Preparation (blue) 4 cycles (on cache hit) Arithmetic Pipeline (green) 12 cycles Memory Fetch (red) 8 cycles (on cache hit) Special Operation (yellow) 6 cycles

The instruction preparation requires one cycle for the scheduler, one to access the instruction cache, and two cycles for register file access including swizzling. With a latency of 12 cycles the arithmetic pipeline is compara- tively deep, which shows that this is a mayor point for further optimizations. However, even if only 8 threads are executing on the Shading Processor each having two instructions in-flight, the pipeline can be used optimally. The memory fetch units require a total latency of 8 cycles in the case of a cache hit. Some latency is required by an input FIFO, the address computation, the cache access, and the decoding of the fetched data. With a latency of only 6 cycles and a throughput of 1, the special op- eration can be executed even faster than standard arithmetic instructions. 4.4. IMPLEMENTATION RESULTS 105

This makes frequent vector normalizations possible, which are important in shading computations.

The shading performance of the Shading Processor is evaluated using three different shader programs:

1. Phong Shader: A classical Phong shading model using a single light source that casts no shadows (see Appendix B for assembly code). The shader consists of 53 instructions (paired ones counted twice) and performs vertex normal interpolation (plus transformation from object space to world space), light fall-off, ambient term, specular term with exponent of 4 (including reflection vector computation), and linear combination of the final intensity. This shader is very compute inten- sive as it performs few coherent memory requests but many shading operations.

2. Texture Shader: A simple shader that performs bilinear texture lookups. Only few computations are performed but many incoherent texture fetches make this an example of a bandwidth limited algorithm.

3. Mandelbrot Set: The shader computes the Mandelbrot set (using max- imally 100 iterations) by painting a point that is in the set black and all other points white (see Figure 4.6 for an image and Appendix C for the assembly code). The Mandelbrot algorithm is a branch intensive algorithm that performs a small loop for each pixel. This inner loop consists of 12 instructions that can be reduced to 8 if using paired instructions. The thread’s control flow diverges at the fractal borders of the Mandelbrot set only, as some threads in the packet correspond to the set, others not.

The shaders are evaluated for each pixel at a resolution of 512x386. Neither ray casting nor any other computations are performed, only iterating over the pixels, shading, and write-back of the result to an RGB frame buffer.

SP instr. packeting cache DRAM DRAM frame Shader usage efficiency hit rates efficiency usage -rate Phong 99.5% 100% 87.4% 33.8% 18.3% 31.0 fps Texturing 45.7% 100% 38.6% 34.7% 94.1% 61.1 fps Mandelbrot 99.7% 98.1% n/a 35.9% 2.4% 11.6 fps

Table 4.6: Shading performance evaluation of the Shading Processor FPGA prototype (66 MHz, 16 packets with 4 threads) at 512x384 resolution.

Table 4.6 shows some statistics for the Shading Processor executing the three different shaders. The Phong shader contains many independent instructions, and fetches only few data from memory, thus the usage of the 106 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

Figure 4.6: The Mandelbrot set computed with the Shading Processor at a resolution of 512x386.

Shading Processor pipeline is very high and nearly each cycle an instruction (that is possibly paired) is scheduled to the pipeline. Instruction packeting efficiency (the fraction of active threads in each active clock cycle) is optimal, because no incoherent branching is performed in the Phong shader. Thus for each executed instruction all 4 threads in the packet are active. The texturing shader performs mainly texture lookups and few in- structions for bilinear interpolation, thus the memory interface is used in- tensively instead of the arithmetic pipeline. DRAM usage is 94.1%, thus nearly each cycle the DRAM is accessed. Unfortunately, DRAM is used with an efficiency of only 34%, thus 66% of the active DRAM cycles are ei- ther used for row management, refresh cycles, or the DRAM controller has to wait due to the DRAM timing specification. However, the performance of the texturing shader will compare very well to general CPUs (see later). The computation of the Mandelbrot set is very branch intensive. De- spite no branch speculation is performed on the Shading Processor, the number of threads is sufficient to achieve very high processor usage rates of over 99%. The control flow of the Mandelbrot computation diverges only at the one-dimensional fractal border of the Mandelbrot set, thus the in- struction packeting efficiency is still very high at over 98%. This shows that despite the Mandelbrot computation being branch intensive, the control flow is still very coherent. In the following, the performance of the Shading Processor will be compared against three different CPUs. The first CPU used for comparison is an AMD Athlon 1700 with 256 KB of on-chip cache memory. It is clocked 4.4. IMPLEMENTATION RESULTS 107 at 1.46 GHz and can perform a peak number of 5.84 billion floating point operations per second in SIMD mode. A memory bandwidth of 2.9 GB/s can be achieved for sequential reads. The second CPU is an AMD Opteron 880 Processor with 2.4 GHz clock speed, 1 MB on-chip cache, a peak floating point performance of 9.6 GFlops, and a peak memory bandwidth of 4.8 GB/s for sequential reads. The last CPU used for comparison is an Intel Xeon 2.2 GHz CPU with 512 KB on-chip cache and 5.7 GB/s memory bandwidth for sequential reads. To perform exactly the same computations on both, the Shading Pro- cessor and the CPUs, the assembly code of the Shading Processor shaders is directly converted to C code, where the registers map to C-variables and operations to inlined C functions. The code was compiled with the GCC compiler V. 3.3.1 with all optimizations turned on. The C code does not take advantage of SIMD extensions.

Shader SP AMD Athlon AMD-Opteron Intel-Xeon Phong 31 fps 6.6 fps 12.5 fps 6.6 fps Texturing 58fps 15fps 43fps 26.8fps Mandelbrot 11.6 fps 19.6 fps 33.0 fps 18.8 fps

Table 4.7: Comparison of the shading performance of the Shading Processor FPGA prototype (66 MHz, 16 packets with 4 threads) with an AMD Athlon 1700+ (1.46 GHz), an AMD Opteron 660 (2.4 GHz), and an Intel Xeon 2.2 GHz CPU.

As Table 4.7 shows the Shading Processor performs a factor of 2 to 5 faster than the multi-GHz CPUs for the Phong shader code. The reason is the efficient usage of the SIMD ALUs, the packeting, and the high usage of the pipeline due to multi-threading. Surprisingly, even the bilinear texturing performs faster on the Shad- ing Processor despite the algorithm showing to be bandwidth limited on the Shading Processor and the memory bandwidth of the CPUs is between three to five times higher. For the computation of the Mandelbrot set, the Shading Processor performs worse because its instruction throughput for scalar code is too low. The assembly code of the Mandelbrot shader shows that few scalar instructions could be packet to a vector instruction in the inner loop. Thus for this programm about 264 Million scalar instructions can be executed per second, while current CPUs can perform several billion instructions per second. Further, branch speculation and forwarding increases the usage on the CPU resulting in a 2 to 4 times faster computation compared to the Shading Processor. 108 CHAPTER 4. PROGRAMMABLE SHADING PROCESSOR (SP)

4.5 Conclusions and Future Work

The Shading Processor presented in this chapter combines the features of CPUs and GPUs in order to implement a fully programmable, parallel pro- cessor that performs very efficiently for shading. It can also be used to perform many different highly parallel (and highly coherent) computations which was shown by the computation of the Mandelbrot set. The Shading Processors programming model closely resembles that of current GPUs but extends it significantly towards general purpose comput- ing. This includes full support for recursion, branching, and many more general memory operations. These features are implemented in an efficient multi-level SIMD design, where SIMD is used both at the instruction level (short 4-vectors) and on the thread level where packets of 4 threads are synchronously executed to increase memory and hardware efficiency. In ad- dition, for highly parallel algorithms writing of efficient code for the Shad- ing Processor is trivial because all thread management is performed by the hardware itself, achieving high usage and performance rates. As a con- sequence the prototype implementation of the Shading Processor achieves performance levels similar to high-performance multi-GHz processors that are clocked at a 22 to 36 times higher frequency. Multi-threading is sometimes perceived to be inefficient due to the need to maintain the state for all threads in costly on-chip memory. How- ever, the Shading Processors memory requirements are reasonable but still consume most of the design space (see Chapter 8). The latency of the com- putation, and thus the number of required threads, could be reduced using techniques such as forwarding or speculative branching. Forwarding in par- ticular could use the result of some low latency computations (such as an addition or multiplication) directly as it has been computed. In the current Shading Processor the latencies of all arithmetic instructions are the same latency of 4+12 cycles required for the deep dot product computation. Chapter 5

Skinning Processor

5.1 Introduction

Skinning is the process of attaching a render-able skin, typically a triangle mesh, to an underlying skeleton model that is used to animate the skin. Different skinning approaches with varying degrees of realism and complex- ity have been developed that even consider skin deformations caused by muscles. However, for real-time applications the Skeleton Subspace Defor- mation (SSD) [MTLT88, MTT91] is mostly used, as it is fast, parallel, and very effective. In the literature, the SSD algorithm is also referred to as Smooth Skinning Algorithm, Multi-matrix Skinning, or Linear Blend Skin- ning. The SSD algorithm attaches each vertex of a triangle mesh with a weight to one or more bones in the skeleton it is influenced by. Vertices near the joints of a character are typically influenced by two bones, thus connected to two, while vertices in the center of a bone are mostly only connected to one bone. Typically, most vertices are connected to one bone only, several to two bones, and very few to three or more.

Figure 5.1: Skeleton Subspace Deformation: The vertices are connected via weights to one or more bones of the skeleton model.

109 110 CHAPTER 5. SKINNING PROCESSOR

To compute the vertex position after the skeleton has been posed, the weighted average of the vertex position being transformed by each of the attached bones transformation is computed. For a vertex v, connected with weights wj to M bones with bone transformations Aj the vertex position V is defined by:

M−1 V = wj Aj v Xj=0 This definition shows that the computation of one vertex is not influ- enced by other vertices but from some weights and a few affine transforma- tions of the bones. This property makes a fast implementation of the SSD skinning model using streaming approaches possible. To ensure that undesired scaling will not occur, the weights for a vertex should add up exactly to 1.

M−1 wj = 1 Xj=0 Many shading models also require a vertex normal per vertex of the triangle mesh. These normals can also be computed by the SSD skinning model. If the bone transformations Aj are orthonormal affine transforma- tions (translations and rotations only) the normals can be treated the same way as the vertices. But as normals are vectors, their homogeneous repre- sentation is zero in the last component, thus they are not translated. These orthonormal affine transformations maintain the normal property of being perpendicular to its surface. However, this is not the case for arbitrary bone transformations Aj. To still maintain the normal property in this case the normals must be transformed with the matrices A−1 T , which is the trans- posed of the inverse of the bone transformation. Furthermor e, the length of a skinned normal does not necessarily sum up to 1, thus a normalization in the shader is required before applying most shading models. The SSD algorithm is simple and has found widespread use throughout computer animation but it has some limitations worth noting. For small rotations of bones the algorithm works fine, while the accuracy decreases if the rotation angle increases. This is visible as a noticeable shrink of the volume of the bend. Furthermore, influences of muscles cannot directly be computed by SSD, even though they are important for increasing the realism of the skin. One way to cope with this limitation is to bind additional transformations to the vertices that are only responsible for these skin deformations caused by muscles. A higher level muscle model can be used to modify these additional matrices properly to achieve good visual results. In a similar way additional transformations can be used for facial expressions and simple clothing. 5.2. HARDWARE ARCHITECTURE 111

5.2 Hardware Architecture

Applying the SSD skinning model to all vertices in a mesh in every frame is an expensive and time critical task that should not be performed by the application. The application should rather compute the motion at a higher level of abstraction, such as the skeleton bones, thereby minimizing the computation cost and communication overhead with the rendering engine. To support this abstraction, the DRPU contains a flexible Skinning Processor that implements a more general form of the SSD skinning model that even supports morphing. The Skinning Processor computes the posi- tion of a vertex as the sum of 4-component vectors multiplied by some 3x4 matrices. 4 Let v0, v1, · · · , vM−1 be M virtual vertices ∈ R and A0, A1, · · · , AM−1 be M bone transformations ∈ R3x4, then a vertex position V ∈ R3 is defined as:

M−1 V = Aj vj (5.1) Xj=0

Each vertex of a mesh is defined by such a linear combination. Only the few bone transformation matrices Aj are used to modify the mesh, while the virtual vertices vj stay constant. The difference to the previous defini- tion of the SSD skinning model is that here the vertices and their weights are combined to single virtual vertices. If setting vj = wj · v this formula transforms to the definition of the SSD model from the previous section. On one side this modification simplifies the hardware implementation as no special handling of the weights is required in hardware (such as mul- tiplications with the vertices). On the other side this is a more expressive model because the virtual vertices vj can be chosen freely, not only some weights. This even allows for morphing between two characters with the same mesh topology if the virtual vertex v0 corresponds to the first charac- ter and v1 to the second one. The connected matrices can then be used to morph between both characters. The modification of the SSD formula has a disadvantage for vertices being influenced by many bones, because now several virtual vertices vj rather than several scalar weights need to be stored per vertex. Fortunately, a vertex is mostly influenced by only a few bones, making this a small overhead in practice. To implement Equation 5.1 the Skinning Processor needs to store the bone transformations (because they are re-used for many vertices), com- pute vector/matrix transformations, and to sum up the result. To reduce the additional hardware cost, the Skinning Processor reuses some memory blocks of a vertex synchronization buffer of the Vertex Fetch Unit (see Sec- tion 3.3.3.2) to store up to 32 transformation matrices. For the expensive 112 CHAPTER 5. SKINNING PROCESSOR

Figure 5.2: Skinning Processor: Skinning instructions are fetched from mem- ory. These instructions include matrix rows to store to a matrix register file, and vertices to transform with these matrixes. The matrix transformation is performed by the Geometry Processor, which is also used during rendering. Several transformed vertices can be accumulated in the Accumulate Unit, and then be written to a memory location. vector/matrix transformation the Geometry Processor is also used because it implements exactly this functionality for ray transformations. Only the final accumulation is performed by a new Accumulate Unit that needs ad- ditional hardware resources. The Skinning Processor, see Figure 5.2, implements a simple matrix instruction set and operates with 24-bit floating point accuracy. The instruc- tions form a simple instruction stream fetched in blocks to an instruction buffer and decoded by the Fetch Instruction Unit. The 128-bit wide mem- ory bus can transfer one such instruction per clock cycle to the Skinning Processor. The supported instructions include instructions to load the rows of a 3x4 matrix to one of the 32 matrix registers and others to multiply 4 component vectors with a matrix from a matrix register and to accumulate them. Accumulated results can then be stored to a vertex array in mem- ory indexed by an immediate destination constant of the instruction. All vectors and matrix columns are encoded as immediate values in a 128-bit wide instruction format with 24-bit floating point accuracy (see Figures 5.3 and 5.4). As the accumulation has a small latency of 4 cycles, 4 vertices should 5.2. HARDWARE ARCHITECTURE 113 simultaneously be computed in the Skinning Processor, each using a differ- ent accumulation slot in the Accumulate Unit to prevent pipeline stalls due to data dependencies. The accumulation slot to use is specified in the in- struction. As this would principally allow one instruction to be executed per clock cycle, the memory interface is not capable of delivering one instruction per clock cycle in addition to the write-back of the result.

Figure 5.3: Set Matrix Row Instruction: The immediate vector is stored to the matrix row of the matrix register specified in the instruction.

Figure 5.4: Transform and Accumulate Instruction: The immediate vector is transformed with the matrix in the matrix register using the accumulation slot specified in the instruction. The accumulation slot can be initialized and the result can be stored to the destination location in memory.

This approach stores and encodes all the rules to compute all vertices in a mesh directly in a single sequential instruction stream. Only this single sequential instruction stream is read while a single sequential vertex stream is stored to memory. This simplifies the implementation and optimizes an already bandwidth limited computation by avoiding random memory ac- cesses. 114 CHAPTER 5. SKINNING PROCESSOR

5.3 Prototype Implementation and Results

For demonstration purposes, the prototype of the Ray Casting Units has been extended with the Skinning Processor described in this chapter. As expected, only slightly more FPGA resources were required, for instruction buffers, some small modifications to the Geometry Processor, the additional Accumulation Unit, and a write FIFO for the computed vertex stream. The extended prototype is again clocked at the clock frequency of 66 MHz. Some modification to the driver application was required to support skinned meshes in addition to standard meshes. Via the API one can ini- tially generate skinned vertices, virtual vertices, and skinning matrices. The skinned vertices can be connected to several products of virtual vertices and skinning matrices according to Equation 5.1. During rendering the struc- ture of the skinning model cannot be modified any more, but the matrices can be changed via the API. This causes the driver to update few skinning instructions (that specify these matrices) on the FPGA in each frame. Table 5.1 presents performance results of the Skinning Processor for skinned meshes of different complexity. The table shows the number of vertices (plus normals) to be skinned, the average number of matrices they are influenced by, and the resulting number of skinning instructions. This number matches the number of vertices times the average number of matrices plus some few instructions to load the matrices. The remaining columns show the number of clock cycles required for skinning and the cycles per instruction. The number of cycles per instruction shows to be independent to the number of mesh vertices, which shows a linear scalability of the Skinning Processor with mesh complexity (see Figure 5.5). If operating at 66 MHz the time required for skinning is very low as it would be possible to update the Morphing sequence consisting of 5.4k triangles more than 800 times per second. Skinning is mostly even faster as the update of the B-KD Tree.

instruc- cycles / DRAM Scene vertices matrices tions cycles instr. eff. usage Pipe 0.5k 2 1.1k 3.6k 3.2 51.0% 90.8% Morph700 0.7k 3 4.3k 10.5k 2.4 58.3% 95.2% Morph2400 2.4k 3 14.4k 34.5k 2.4 57.1% 96.9% Morph5400 5.4k 3 32.0k 76.1k 2.3 57.6% 99.0% Random 30.0k 3 181.0k 429.0k 2.4 57.3% 98.5%

Table 5.1: Skinning Processor Results: The Table shows number of skinning instructions that need to be executed for a number of dynamic test scenes with varying complexity. Pictures to the scenes can be found in Chapter 7 (some have different number of triangles). Only each second to third cycle an instruction can be scheduled to the pipeline because the computation is bandwidth limited. 5.3. PROTOTYPE IMPLEMENTATION AND RESULTS 115

Furthermore, the cycles per instructions is larger than one, despite that the pipeline could principally execute one skinning instruction per clock cycle. However, this theoretical peak performance is not reached because the processor is bandwidth limited and accessing the DRAM in 90 to 99% of the cycles.

Figure 5.5: The number of cycles required for skinning of the morph scene with different number of vertices shows a linear relationship between the number of vertices and the skinning time.

However, the peak bandwidth of the memory interface is not achieved, as only in 57% of the cycles (the DRAM efficiency) data can by read or written to DRAM. This is caused by additional cycles required for DRAM row management, switching from read to write mode, and DRAM refresh cycles. As the Skinning Processor only reads a single instruction stream and sequentially writes the skinned vertices, principally a much higher usage rate of DRAM could be achieved if improving the on chip buffer management. A prefetch buffer could read long sequences of skinning instructions from memory while a write buffer could store vertices and write them in large sequences to memory. However, despite the current implementation not being perfect, the DRAM efficiency is already twice as high as during ray casting (see Chapter 3). 116 CHAPTER 5. SKINNING PROCESSOR

5.4 Conclusions and Future Work

This chapter described a high performance Skinning Processor that imple- ments a generalization of the Skeleton Subspace Deformation (SSD) skinning model. The processor is a pure stream processor, thus fetches only one in- struction stream from memory and stores one vertex stream to memory. The instruction stream contains all data required for the computation as immediate constants. This Skinning Processor shares most resources with the Geometry Processor, thus it causes little additional cost in the DRPU architecture. As a consequence skinning and rendering cannot be performed at the same time, but this would not perform well anyway as the Skinning Processor uses the available memory bandwidth extensively. Besides the Skinning Processor, the Shading Processor could also be used to implement the SSD skinning model. However, in the inner loop this would require at least 1 instruction to fetch the vertex, 1 to fetch the address of the matrix to transform, 3 instructions to fetch the three matrix rows, another 3 instructions to transform the vertex, one for the accumulation, one loop increment plus one branch instruction to possibly continue the loop. Because of the packet size of 4, these 11 instructions would perform the same computation as could be performed with 4 skinning instructions, resulting in (at least) 2.75 times lower peak performance. This simple calculation shows that even though the programmable Shading Processor is available, using the Skinning Processor makes sense especially if it was equipped with a better buffering to speed-up the memory interface. On the other hand, in addition to the Skinning Processor the full programmability of the Shading Processor allows complex skinning models to be implemented that might even perform texture fetches or advanced muscle computations. However, for the frequent case of the simple Skeleton Subspace Deformation the Skinning Processor can be used to achieve best performance. This chapter closes the description of the DRPU hardware architec- ture, while the following chapters focus on the prototype hardware imple- mentation. The next chapter describes a new hardware description language HWML that was used to efficiently and compactly implement the described hardware architecture. Chapter 6

Hardware Description Language HWML

6.1 Introduction

Choosing the right hardware description language was a mayor question for the implementation of the DRPU architecture. Different languages had been explored for their applicability but none of them fulfilled our requirements:

1. The hardware description should be in a single language.

2. The code should be compact and easy to read.

3. For design studies different parameterizations should be possible, such as varying: floating point accuracy, packet size, or number of packets.

4. Changes to the code should have easy to understand influences to the net-list.

Best readable code can be achieved with behavioral hardware descrip- tion languages like Mitrion-C [Mit06], for instance, which allows to describe hardware a C-like style. However, such behavioral descriptions may not result in the best performing system with minimal area and the designer is strongly dependent on the quality of the high-level compiler. As a con- sequence it is very difficult to argue about the structure of the generated circuit as the behavioral specification does not describe any details about its implementation that might even change with a new compiler version. Fur- thermore, for optimizations of the system, changes to the high-level design description should have understandable effects on the implementation and its clock speed. This argues for at least some control over the structure of the cir- cuit in the high-level description which can be provided by a pure structural

117 118 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML hardware description language. Especially for high-performance systems de- signers often use structural hardware description tools (such as VHDL or Verilog) that provide transparent control of the resulting implementation. However, most existing tools operate on a very low-level of abstraction to describe circuit structure (schematics, for example) and are not well suited for describing a large and complex system. In particular, the parameteri- zations, required for design studies, are a big problem with most hardware description languages. One reason is that they have to be used in conjunc- tion with other tools (such as memory compilers or core generators) that are not seamlessly built into the system. Thus, changes to the accuracy of the floating point computation would require intensive manual work to generate new memory instances, change bus widths, and to exchange many floating point cores in the design. In order to cope with these limitations of existing systems for struc- tural hardware description the new HWML system (Hardware Meta Lan- guage) based on functional programming techniques has been developed and used to implement the DRPU prototype. HWML was designed in parallel to the DRPU architecture thus it perfectly fits its needs and fulfills the posed requirements. It allows to write very compact code that is easy to read due to many abstractions such as data streams, automatic pipelining, recursive structural definitions, and hierarchy tagging. Thus, an earlier version of the DRPU sys- tem (Shading Processor plus Traversal Processor) had been implemented in only 4500 lines of HWML code in about 6 man-months concurrently with the development of the HWML system itself. The complete DRPU system as de- scribed in this thesis consists of only 8000 lines of code, including debugging code. The hardware description of HWML is highly parameterizable due to the support for abstract polymorphic wires and typed multi-ported memories. Here especially the automatic pipelining makes the parameterizations and abstractions easily possible. As HWML is a library for structural hardware description, changes to the code directly map to the circuit net-list. As designed, the library implements all features required for implementing the DRPU system, thus only a single language describes the complete hardware (including floating point operations such as addition and reciprocal). The HWML description can be simulated and debugged in the ML en- vironment. The result of compiling the HWML description is an RTL/structural VHDL description that is easily mapped to target tech- nology such as an FPGA or ASIC using standard synthesis tools. Here portability was especially important, thus no changes to the HWML code need to be performed when going to a different platform as especially dif- ferent kinds of atomic memories are automatically handled by the library. HWML is implemented in the Moscow ML dialect but mappings to different ML dialects should be trivial. Some basic knowledge of ML pro- gramming is advantageous to understanding some of the following sections. 6.2. LOW LEVEL STRUCTURAL LIBRARY 119

For a brief introduction to ML see [MTH90]. Of course, leveraging an existing programming language to describe hardware systems is not a new idea. Several hardware description libraries for existing programming languages have been developed. Some examples include System C [Sys06a] for C, JHDL [Bri06] for Java, and Lava [Per98] for Haskell. The capabilities of the mother language strongly determine the style of describing the hardware using these libraries. There are also some interesting extensions to the industry standards VHDL and Verilog, such as Bluespec System Verilog [Blu06]. Functional approaches to struc- tural hardware description have been explored by many researchers [Tom06, MS00] but examples of using these languages for large complex systems are few. An advanced language for functional structural hardware description is Lava [Per98]. The main differences to the HWML approach is that HWML additionally supports automatic pipelining of combinational non-cyclic cir- cuits and typed multi-ported memories. The HWML system consists of two parts: a low level structural library and the high level abstractions that build on it. Both are described in the next sections.

6.2 Low Level Structural Library

In order to use ML as a hardware description language a basic library that supports the description of structural circuits needs to be implemented. This structural library implements basic primitive boolean functions such as And, Or, Not, Mux, and others to describe the structure of combinational circuits. A delay function Reg introduces the possibility of creating synchronous cir- cuits using a register clocked by a global clock signal. The introduction of registers also requires the ability to build cyclic circuits. To make this possible in the ML framework one can create a free wire and then assign a different wire to it later, thus closing the loop. Multi-ported memories can also be instantiated and can later be mapped to the selected target platform.

6.2.1 Circuit Creation In order to understand how an ML program using the library maps to a circuit, the following lines of code give a simple example.

fun reg_en en in = let val out = Wire () val _ = Assign(out,Reg (Mux (en,out,in))) in out end

This ML code defines a register function reg en with clock enable that gets a clock enable signal en as its first argument and a wire in as its second. 120 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML

in

01 en

reg

out

Figure 6.1: A register with clock enable built from a standard register using a back loop. If clock enable is true, the register clocks the input in to its output, otherwise it holds its output.

If the function is called it constructs a register with clock enable out of a standard register by using a loop. The function first creates a fresh wire out that will be the output. Assigned to this wire is a register that clocks the input to the output if en is true or it holds the output if en is false. From these simple lines of ML code the library creates a graph representation whose nodes are labeled with the names of the logical functions (not, and, or, mux, reg, ...). This is a very straightforward mapping from the ML description to the hardware structure. The ML program is simply executed by the ML system, and the library calls directly generate graph nodes with the corresponding label, as seen in Figure 6.1. This graph representation can be written out to a VHDL file con- taining behavioral statements for each combinational library function, and instantiated components for some special black boxes like memories. The resulting VHDL code for reg en is: architecture reg_en of reg_en is signal w0,w1,w2,w3,w4: std_logic; begin w3 <= port1; w4 <= port0; w2 <= ((not w3) and w0) or (w3 and w4); reg_w1 : process (lclk) begin if lclk’event and lclk = ’1’ then w1 <= w2; end if; end process reg_w1; w0 <= w1; port2 <= w0; end reg_en;

Fine grained technology mapping into a specific target library such as that for an FPGA or an ASIC standard cell library is done later by standard synthesis tools. Mapping for memory structures is done by HWML 6.2. LOW LEVEL STRUCTURAL LIBRARY 121 and results in FPGA memory instantiations or specifications for an ASIC memory compiler to generate black box memories.

6.2.2 Automatic Pipelining The low-level HWML library can introduce pipelining into the circuit by taking a non-cyclic combinational circuit and creating a balanced pipeline that is adjusted to a specified delay per pipeline stage or specified latency. The pipelining algorithm is separated into three steps. 1. A simple constant propagation algorithm is applied which replaces subgraphs that contain a constant wire by its reduced form, like re- placing And(x, 1) by x. This optimization can reduce the depth of the circuit and help prevent the algorithm from inserting too many pipeline stages. 2. The depth to each of the cells in the circuit is computed. All atomic cells (individual Boolean gates, for example) have a specified delay. The depth of a node in the graph representation is now defined as its own delay plus the maximum of the depth of its input nodes. By applying this definition recursively onto the circuit, one can assign a depth to each graph node. 3. The algorithm now walks the graph and inserts pipeline stages at depth 0 · step, 1 · step, 2 · step, · · · until it reaches the maximum depth of the circuit. The stepping size is set to the specified delay per pipeline stage. The automatic pipelining is an important function that enables the fol- lowing abstractions to work more easily. Because the high level abstractions view the circuit at the data-flow level, circuits may have varying latencies or may be described recursively. Manually inserting register stages into a re- cursively defined multiplier, for example, would be quite complicated, as this type of modification does not map well to the recursive definition scheme. Instead, the functions are described using whatever HWML structures make sense, and the pipeline registers are inserted automatically.

6.2.3 Simulation The HWML library can perform a cycle accurate simulation of the generated graph representation. To feed the simulation with data, an ML test bench applies stimuli to the input wires in each cycle of the simulation. The library propagates the circuit values in each cycle for the primitive gates and memories. In the DRPU implementation the test bench also emulates the behav- ior of the external DRAM by looking on the address and command bus of the DRAM controller and applying stimuli values to the data bus. 122 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML

6.3 High Level Abstractions

Based on the low level functionality HWML uses the ML framework to de- fine higher level abstractions to simplify the hardware description process. Besides the automatic pipelining, memory abstraction, and hierarchy tag- ging, the techniques described in the following would also be possible on top of Lava by using the functional capabilities of Haskell.

6.3.1 Components as Functions The first abstraction is to represent hardware components by ML func- tions. Compared to standard structural components this allows higher level semantics. Thus a function can be polymorphic, which means it can imple- ment different structural behavior depending on the type of input wires, as explained in the next section. The function arguments and result do not necessarily correspond to input wires and output wires. By passing a fresh wire to a function, it can return a result to this argument wire by assigning a different wire to it. Similarly one can use a result wire as input by returning a fresh wire and already using it in the computations.

6.3.2 Abstract Wires The low-level library contains a wire type bit whose usage is quite limited. Assume one likes to multiplex a tuple of two wires, then one has to build a special function mux2 to select between both tuples.

fun mux2 sel ((a0,b0),(a1,b1)) = (Mux sel (a0,b0), Mux sel (a1,b1))

For each type of wire one would need a separate multiplexer function to operate on it. A more elegant method would be to have only a single function that automatically chooses a different multiplexer based on the type of the wire, which is called poly-morphism. Such a function should be usable on all wire types supported in HWML: bits, integer wires, floating point wires, or combinations of them. To make this possible HWML uses a data-type construct of the ML language and defines its own wire data-type as follows: 6.3. HIGH LEVEL ABSTRACTIONS 123

datatype Wire = Bofbit (*booleanwire*) |Iofbitlist (*integerwire*) | F of bit * bit list * bit list (* floating point wire *) |LofWirelist (*listofwires*)

This data-type specification can easily be extended for special needs such as more general fixed point arithmetic. It defines a new ML type called Wire that can either be a simple boolean wire B, an integer wire I represented by a bit list, a floating point wire F consisting of the sign, exponent and mantissa, and a list L of wires. The constructors B,I,F, and L can be used to build abstract wires and to decompose them again using ML pattern matching. These abstract wires are similarly handled internally to VHDL bit vectors but for the hardware designer they contain additional information about the structure and type of the contained bits. Furthermore, they allow polymorphic functions to be written over this abstract wire data-type. For example, the following multiplexer can multiplex arbitrary inputs of the same type.

fun mux (B sel) (B a, B b) = B(Mux sel (a,b))

| mux (B sel) (I a, I b) = I(map2 (Mux sel) (a,b))

| mux (B sel) (F(s0,e0,m0), F(s1,e1,m1)) = F(Mux sel (s0,s1), map2 (Mux sel) (e0,e1), map2 (Mux sel) (m0,m1))

| mux sel (L a,L b) = L(map2 (mux sel) (a,b))

| mux _ _ = raise Error " ... "

This multiplexer function is a polymorphic function over the wires that can select between tuples of arbitrary wires of the same structure. There are four cases in the declaration, three for the atomic cases of boolean, integer, and floating point wires and one for the list of wires where the wire-structure is simply followed. Here the map2 function call maps the multiplexer recursively onto pairs of elements from two lists of wires. If there is a structure mismatch in the two wires the function fails with a runtime type error. 124 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML

The ML language allows the programmer to define new polymorphic operators or functions which increase the readability of the code. In the following the infix operators A && B will stand for logical and, A || B for logical or, A ++ B for addition, and A <- B for the assignment of wire B to the fresh wire A. HWML defines a number of other polymorphic operators including multiplication but those just described are sufficient for explaining the concepts in the rest of this chapter. All the operators are defined to be polymorphic over the supported data-types in a similar way as the described multiplexer. These wire abstractions reduce the size of the code of the DRPU con- siderably because multiple wires can often be bundled together and handled in the bundle by polymorphic functions, rather than writing code for each single wire component, which would be necessary in VHDL for instance. Thus large parts of the ALU of the Shading Processor are described as these would operate on the basic data-type float. Then to execute on packets of rays four component floating point vectors (lists using the L constructor) are put to the ALU, and it automatically generates the desired circuit that operates on packets of four threads, without any extra code.

6.3.3 Stream Abstraction A data stream is a basic concept of hardware architecture defining data flowing along some paths in a data flow graph. Communication schemes are often implemented using data streams. For example, a function call to a component could consist of an input data stream for the argument and an output data stream for the result of that function call. A data stream that can hold arbitrary data can be abstracted by a triple of three wires: a valid signal that indicates if there is currently data on the stream, the data itself, and a busy control signal that indicates the previous stream element to stop if the current stage is busy. The data on the stream can of course be of any complex type, while the valid and busy signals must be of boolean type. Note that the busy signal indicates that the previous unit or stage in the stream should stop, thus its direction is the opposite to the valid and data signals. Using this stream abstraction, one can define functions that operate on streams. For example, one can define a stream statement to create a stream, an assign statement to close a cyclic data-flow, stream multiplexers, a demultiplexer that directs the data steam to one of two output streams, FIFOs, a stop statement to stop a data stream, a delay statement to delay the stream element by one cycle, a semaphore that only allows n stream elements to be present in a region, a merge statement that merges two streams, a split statement that splits a stream into two streams (the inverse of the merge operation), and many more. Note that each of these operations can work on streams that hold arbitrary data. 6.3. HIGH LEVEL ABSTRACTIONS 125

The following example shows the implementation of the merge opera- tion that merges two streams into one. fun merge (valid0,data0,busy0) (valid1,data1,busy1) = let val busy = Wire TyB val _ = busy0 <- valid0 && not valid1 || busy val _ = busy1 <- valid1 && not valid0 || busy in (valid0 && valid1,L[data0,data1],busy) end This function takes a data element from stream 0 and stream 1 and creates a new stream with these two data elements combined together to a tuple. The output data is valid if there is data on both input streams only (valid0 && valid1). Furthermore, one input stream is stopped if it contains data while there is no data in the other stream or if the output stream is stopped. By combining the high-level stream modifiers HWML can generate each type of control flow at a very high level without dealing with low-level control automata. Such a stream merge operation can be used, for instance, if the two arguments for an addition are computed in an unknown cycle. Then one can synchronize the input stream of the first argument and the stream of the second argument with a merge function and put the output stream to the adder. fun stream_add (valid,L[a,b],busy) = (valid,a ++ b,busy)

val sum_steam = stream_add (merge stream0 stream1) Here one can see how well the function abstraction works together with the stream abstraction. The input streams (which consist of wires of different directions) are given as the argument to a function and the resulting output stream can then be processed further. For debugging purposes one can add small watch statements to a stream which makes debugging quite comfortable. These watch functions can interpret the values on a stream and print them to the standard output in a human readable form. This way one can easily follow the computation at arbitrary points in the circuit. The DRPU implementation used the stream abstraction intensively, which makes hardware design very simple, because many different units use the same communication protocol. Furthermore, the writing of complex con- trol automata is mostly not necessary, if intensively using stream abstraction functions, such as stream multiplexes. 126 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML

6.3.4 Pipelines as Stream Elements As values pass through a data-path they are often operated on by func- tions, like the addition in the previous example. In general these compu- tations might be quite complex thus not executable in a single cycle. But it is desirable to write these combinational computations in an as abstract form as possible without spending time manually adjusting the circuit to an appropriate latency. Automatic pipelining transforms these combinational circuits into efficient pipelines by using the pipelining of the low-level library extended with some streaming control. The following example reads two vectors from the two input streams and performs a dot product computation that is pipelined to match a target frequency. fun dot (L[L[x0,y0,z0],L[x1,y1,z1]]) = (x0 ** x1 ++ y0 ** y1) ++ (z0 ** z1) val out_steam = pipeline dot (merge stream0 stream1)

The pipeline function of this example transforms the pipelined func- tion dot to a stream by adding the valid and busy control signals to it. Thus pipelined circuits can be used seamlessly with the stream concept.

6.3.5 Multi-ported Memory Abstraction Nearly every hardware design needs memories in which to store computation state. Often these memories need multiple ports for instance to implement main register files that read multiple arguments per cycle. Such multi- ported memories are abstracted in the HWML framework. One can define a function that creates a typed memory with multiple ports and specified address and data-type. The ports of the memory are returned as functions, that can be called to connect to them. To export a circuit containing memories the library must also be able to synthesize the abstract memory to the atomic memories available in the target platform. Depending on the available memories this might mean that the memory needs to be split horizontally (split data width), vertically (reduce address width) or duplicated to emulate more read ports. The valid atomic memories are then created as black boxes in the low-level library with some circuitry around to connect them. As this synthesis step is at a high level it is efficient and it is guaranteed that optimal memory blocks of the target platform are really used. One might export the same circuit description to a Xilinx FPGA taking efficient usage of the memory hierarchy available there, to other FPGAs, or to an ASIC process by automatically building memories using an appropriate memory compiler. 6.3. HIGH LEVEL ABSTRACTIONS 127

The advantages of this memory abstraction over the manual use of structural atomic memory blocks of the target platform (such as Xilinx block memories) is that the HWML approach is platform independent, as the mapping is performed automatically. Furthermore, it allows polymor- phic functions such as FIFOs to be written, that can buffer arbitrary data by using memories of matching type. Because the memory maintains the type of the wire, design errors can be eliminated as no composition and decomposition of wide bit-vectors is required. Some synthesis tools support mapping behavioral VHDL memory de- scriptions to atomic memory blocks (such as the Xilinx synthesis tools), but only if the hardware designer uses a strict coding style. Here the portability is an issue, as a direct synthesis to an ASIC target platform (with its own memory compiler tools) or going to different FPGA vendors is not simply possible.

6.3.6 Recursion and Higher Level Functions Recursive circuit descriptions play an important role if designing arithmetic units. For instance a recursive parallel prefix function can be used to define a carry lookahead adder, or a recursive balanced reduction operator to com- pute the CRC checksum of a bit vector in logarithmic time. The recursion must of course terminate at a statically determined maximal depth as oth- erwise an infinite circuit is specified. For example, the following recursive definition of a reduction operator that generates a balanced reduction tree over an arbitrary reduction function f, terminates if the length of its input list reaches 1 after some recursive splits.

fun reduce f nil = raise Error "..." | reduce f [x] = x | reduce f lst = let val left = take (lst,length lst / 2) val right = drop (lst,length lst / 2) in f(reduce f left,reduce f right) end

This function first splits the input list into two parts of (nearly) equal length and applies the reduce operation recursively on these parts. Then, the two sub-reductions are combined with the function f again. Note that the reduce function is a higher order function that gets an (associative) func- tion f as input. If one sets this function to the XOR operator the reduction computes the CRC checksum of its input vector: val CRC = reduce xor [B0(),B1(),B0(),B0()].

One can use the same reduction operation to sum a list of integer wires 128 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML for instance, by simply writing reduce add [a,b,c,d]. Obviously, such an adder tree might have a high delay so writing such a line seems of limited use. But, as automatic pipelining is supported one can simply create such a tree and pipeline it later if that is what the designer wants: pipeline (reduce add) (valid,[a,b,c,d],busy).

The recursive description style helped very much to implement the arith- metic units of the DRPU architecture especially for integer adders and mul- tipliers. Based on these recursively defined arithmetic functions that operate on arbitrary bit widths, it was easily possible to build more complex units of arbitrary operand width such as floating point adders, multipliers, or reciprocal square root computations.

6.3.7 Hierarchy Tagging

For optimal automatic or manual placement on an ASIC or FPGA a physi- cal hierarchy of the design is very important. A good hierarchy, required for physical layout, does not always match the logical style in which a circuit is specified. Thus for better readability of code a control automata might be distributed over a larger unit, but the designer likes to place the con- trol statements to a single location of the floor-plan. In VHDL the control statements have to be placed in the same box to place them as a single block. HWML goes further, as it can introduce a hierarchy to the structural circuit by using the opening down hierarchy name and closing up() func- tions, that can also be applied recursively. All primitive components instan- tiated between these function calls are packed to the hierarchy hierarchy name. Opening the same hierarchy again, one can always add additional gates to it regardless of the HWML program hierarchy. For in- stance, one could tag all control statements to put them to their own VHDL component on circuit export. Furthermore, it is possible to define a new reduction operator that packs operations at the same level to a hierarchy. Thus one can later place an arbitrary reduction level by level to get an optimal placement. In the DRPU design hierarchy tagging was intensively used to de- fine a useful structure to the circuit especially to hand-place the design for the ASIC implementations. If desired for placement, modifications to the circuit hierarchy are easily possible through some additional down and up statements without any time consuming manual connection of new boxes. 6.4. CONCLUSION 129

6.4 Conclusion

This chapter described a hardware description library for ML called HWML which enables a number of high-level abstractions and leverages a number of features of the ML functional programming environment. These powerful abstractions include automatic pipelining, abstract polymorphic wires, data streams, multi-ported memories, recursive structural definitions, and hier- archy tagging for the resulting RTL/structural description. The result of compiling the HWML description is a collection of RTL/structural VHDL files that can be mapped to a target technology using standard commercial synthesis tools. All these features have been successfully used to implement the DRPU hardware architecture and they allow compact, expressive and reusable code to be written. Without the development of HWML, it would have been very difficult to implement the DRPU at its level of parameterization and ab- straction. By only changing a single configuration file one can change many design parameters such as the number of threads supported in the system, the number of threads per packet, the computational accuracy (floating point data path width, which also modifies data path widths, register file size, etc.), the pipelining depth, and many others. These degrees of freedom allow design studies by trying different configurations and testing them for efficiency, or generating simple area estimates to find a suitable configuration as performed in Chapter 8. The stream abstraction was used throughout the DRPU design and was essential to the fast development of the system. Rather than develop- ing control automata directly, the stream abstraction generated the control automata as a consequence of the stream structures. Wire abstractions al- lowed much more compact and readable code as complex wires are bundled together and polymorphic structural elements automatically adjusted for the new wire types. Because the DRPU targets two very different technolo- gies (FPGA and ASIC), the memory abstraction that generates different memory structures based on the target was also essential. With the HWML hardware description language an FPGA and ASIC implementation of the DRPU was easily possible without any special HWML code for a specific target platforms. Only the HWML library needs to sup- port the desired target platform to generate the right memory instances or to use available multipliers. The next chapter describes this FPGA implementation of the DRPU prototype including detailed performance evaluations. 130 CHAPTER 6. HARDWARE DESCRIPTION LANGUAGE HWML Chapter 7

DRPU FPGA Implementation

Simulating the performance of the DRPU hardware architecture in software would be possible but difficult to perform cycle accurately as magic parame- ters such as frequency, chip area, and latencies need to be estimated. Thus, design issues that cause timing problems or high area consumption are diffi- cult to find without implementing the design. However, for an experienced hardware designer cycle accurate simulations can give the first rough speed estimates after some hours of simulation. FPGA technology has developed to a state where it can be used as a prototype platform of large hardware designs without too much financial effort. Today’s largest FPGAs can implement up to 89.000 4-input LUTs, 0.774 MByte of on-chip memory, and run at peak frequencies of more than 500 MHz for optimized designs (see Virtex4 XC4VLX200 [Xil06a] for more information). This thesis took advantage of this development as a working FPGA prototype of the DRPU hardware architecture has been implemented.

Figure 7.1: DRPU Prototype Platform from Alpha-Data equipped with one Xilinx Virtex-4 LX 160 FPGA

131 132 CHAPTER 7. DRPU FPGA IMPLEMENTATION

It can render complex and dynamic scenes with more than 20 frames per second including fully programmable shading. Besides performance, the capabilities of the prototype are comparable with that of rasterization hard- ware, with some limitations for dynamic scenes but advantages for shading. The prototype is extended with some test circuits to easily figure out design weaknesses and to perform detailed analysis of performance in dif- ferent scenarios. Thus some special counters count the number of active cycles of the different hardware units, the total computation time, etc. This chapter presents this prototype and detailed live measurements for demo scenes with simple and advanced shading effects like reflections, refractions, and shadows. The used prototype platform (see Section 2.8) consists of a Xilinx Virtex-4 LX 160 FPGA [Xil06b]. However, due to the limited size of the FPGA not all features of the DRPU architecture could be enabled for the prototype: integer and logic operations are not included, which limits mem- ory reads to offsets of precomputed addresses. Write support is limited to using the result register OUT (similar to GPUs). A fixed register stack of 16 entries is provided with no automatic spilling to memory. The hardware description of the DRPU prototype is about 8000 lines of ML [MTH90] code using the HWML library for hardware description (see Chapter 6 and [WBS06b]). The specification is fully parameterizable, thus each of the design parameters, like packet size, number of threads, latencies, and caches, can be changed by adjusting a single configuration file. For the FPGA prototype the configuration was adjusted to achieve the best possible performance with the FPGA by using it to its limits.

7.1 Test Scenes

The following table briefly describes all test scenes used for measurements with the prototype. For the static parts of the scenes the highly optimized OpenRT KD Trees [WBS02] were converted to B-KD Trees, while all index structures for dynamic geometry are build using the techniques from Chap- ter 3.

Shirley6: The Shirley6 scene shows a very sim- ple closed room with textures.

0.5k triangles, 1 object instance, 1.5 MB textures 7.1. TEST SCENES 133

Conference: A complex conference room in- cluding a table and many instantiated chairs.

282k triangles, 52 object instances, 0 MB textures

Office: A fully textured office consisting of 34k triangles.

34k triangles, 1 object instance, 4 MB textures

Working Room: A view towards the detailed modeled desk of a working room with high res- olution textures on the walls.

15k triangles, 1 object instance, 10 MB textures

Spheres: The Working Room from a differ- ent perspective showing some bouncing reflect- ing and refracting spheres to show ray tracing effects. 20k triangles, 6 object instances, 10 MB textures

Hand: A hand animation generated with Poser [Pos06]. Vertices are send to the FPGA via DMA in each frame.

17k triangles, 2 object instances, 16 MB textures

Skeleton: A skeleton model that shows a typi- cal game-like character animation. This mesh is again generated with Poser [Pos06], thus skin- ning is precomputed and vertices are sent via DMA in each frame. 16k triangles, 2 object instances, 0 MB textures

Helix: A complex character model showing the Helix robot from Poser [Pos06]. The model consists of 78k triangles, which is comparable high resolution characters in current computer games. 78k triangles, 2 object instances, 0 MB textures 134 CHAPTER 7. DRPU FPGA IMPLEMENTATION

Pipe: A model of an elastic pipe to show the capabilities of rendering full animations includ- ing skinning. One skinning matrix is attached to each end of the pipe to skin vertices and nor- mals. 1k triangles, 2 object instances, 2 skinning matrices, 1 MB textures

Morph: A morphing sequence between a cube, a sphere, and a cylinder. The demo shows the morphing capabilities of the Skinning Processor and large deformations of the mesh. Again not only the vertices but also the shading normals are skinned. 1.2k triangles, 2 object instances, 3 skinning matrices, 0 MB textures

Gael: A realistic game environment shows the fully textured Gael level from UT2004.

52k triangles, 1 object instance, 2.4 MB textures

DynGael: The static Gael level from UT2004 plus two skeletons and one Morph object to form a game like scenario. 85k triangles, 4 object instances, 2.4 MB textures

SunCOR: A sunflower field with more than 5 thousand flowers (36k triangles each). The scene is an example to show low coherence between adjacent rays. 187,145k triangles, 5,622 object instances, 0.76 MB textures

Random: This scene shows a bad case of dy- namics: a morphing sequence between a cube, a random triangle distribution, and a cylinder. In the first and last stage the B-KD Tree fits well to the geometry, but in the random state the structure of the tree no longer fits the struc- ture of the underlying geometry, resulting in low rendering performance. This image shows the random state in the center of the animation. 4.3k triangles, 1 object instance, 3 skinning matrices, 0 MB textures 7.2. FPGA CONFIGURATION 135

Beetle: This scene shows a car model with ray tracing shading effects such as shadows, reflec- tive car paint shader, and transparent windows. 680k triangles, 1 object instance, 0.064 MB textures

Theater: A complex theater with about 607k triangles, many of them are visible resulting in a large working set and low coherence. 607k triangles, 1 object instance, 0.0 MB textures

7.2 FPGA Configuration

The FPGA prototype architecture is widely multi-threaded, as there are 32 packets with four rays each in the system, making a total number of 128 hardware threads. This number of threads is sufficient to achieve high usage rates of the arithmetic units (see Section 7.2.1). On the FPGA, the floating point computations are limited to 24 bits, because the 18 bit fixed point multipliers already available on the FPGA have been used to reduce the complexity of the design. This limited floating point accuracy is sufficient for most of the test scenes. To reduce the driver overhead, floating point numbers are stored in memory in the 32 bit IEEE 754 format, thus need to be converted on reads and writes by the FPGA. The traversal cache, vertex cache, shading cache, and instruction cache are all direct mapped and hold 512 B-KD Tree nodes, vertices, quad words (128 bits) for shading, or instructions respectively. All statistics in this chapter use a resolution of 512x386 pixels. A DRAM memory interface for a DDR-DRAM was implemented that performs an advanced bank management, by letting a DRAM row open if the next operation also accesses this row. The controller supports auto pre- charge and the address translation splits an address from the FPGA into the row address, bank, and column address in this order to achieve highest probability that a row can stay open. Thus localized memory requests often map to the same row of a bank (as row and bank address are high bits) and more random (but still local) memory requests often map to a different bank (as the bank bits are lower than the row bits). This way, sequential or local reads can be achieved with optimal efficiency. The controller supports bursts of length 2 which is length 1 in the FPGA clock domain. This reduces the efficiency for random memory requests due to the row management overhead which could be hidden if larger burst requests (maybe of size 8) were performed. The prototype architecture uses about 99% of the logic cells, 165 of 136 CHAPTER 7. DRPU FPGA IMPLEMENTATION the 288 block memories (57%), and 58 of the 96 18-bit multipliers (60%) of the FPGA chip. These numbers show that the FPGA is used to its limits which sometimes caused problems with routing and over-mapping.

7.2.1 Latencies and Throughput In hardware architecture the latency of a unit is defined as the average number of cycles it requires from beginning of the computation until it outputs the result. For the pipelined units used in the DRPU hardware architecture this latency is constant for an atomic operation like one traversal step, or one ray/triangle intersection. In the case of many dependencies, large latencies often slow down hardware designs as it requires many cycles for the computations to finish. The throughput of a hardware unit is defined as the average number of results produced per clock cycle. A throughput of 4 means that each cycle four results (such as four single ray /traversal computations) leave the hardware unit while a throughput of 1/2 means that only every second cycle a result is computed (such as one ray/triangle intersection every two cycles). A high latency or low throughput do not necessarily slow down a hard- ware design. For instance, a high latency unit with high throughput can be used efficiently if enough parallel computations are available to execute on it. The other way around, a unit with a low throughput does not neces- sarily slow down a hardware design if only a few computations need to be performed on it. Table 7.1 summarizes the latencies and throughput of the main units of the DRPU prototype design. The Geometry Processor (GP) in partic- ular has a high latency due to many dependencies of the M¨oller Trumbore ray/triangle intersection and the high latencies of the used 24 bit floating point computations. At its operating frequency of 66 MHz, the prototype can perform 66 million B-KD Tree node updates and the same number of vertex matrix multiplications in skinning mode. The Traversal Processor can finish 264 million ray traversal steps and the Shading Processor and 528 million shading instructions respectively. The Geometry Processor is

Unit Latency Throughput Shading Processor 16 8 Traversal Processor 14 4 Geometry Processor 36 1/2 Skinning Processor 36 + 4 1 Update Processor 6 1

Table 7.1: Latency and throughput of the main units of the FPGA proto- type. The Shading Processor computes 8 instructions per clock cycle due to instruction pairing and packeting. 7.2. FPGA CONFIGURATION 137

Figure 7.2: Usage of the Rendering Units for three typical scenes. In the game-like Gael scene, the hardware usage is optimally balanced. For simple scenes (like the Morph scene) the usage of the Traversal Processor is low, because few traversal steps need to be performed. Scenes with a very high number of triangles (such as SunCOR) show low usage rates of all hardware units, because they are bandwidth limited. the slowest unit with a throughput of 1/2 thus a number of 33 million ray/triangle intersections can be performed per second. This is sufficient as ray/triangle intersections need to be computed only each 8th traversal step on average. Building a unit with a high throughput is typically more expensive than building a slow one with low throughput. Because the slowest unit of a chain determines the operating speed, all hardware units that operate in sequence and exchange results should be balanced in throughput and workload such that they operate at the same speed and are equally used. The slowest unit in such a chain should always be an expensive one to get optimal efficiency. Figure 7.2 shows that this is the case for the DRPU architecture. For instance, for the Gael scene, which is a realistic gaming scenario, the usage rates of the units are perfectly balanced. In the Morph scene, the usage of the Traversal Processor is quite low, because few traversal steps are required to traverse a ray through this simple scene. However, as each ray needs to be shaded the usage of the Shading Processor is higher as the usage of the Traversal Processor. Furthermore, each ray is typically intersected with a similar number of triangles, thus the usage of the Geometry Processor is also high. Scenes with many visible triangles (such as the SunCOR scene) show low hardware usage of all three rendering units, because the computation is bandwidth limited. Caches can not help effectively to reduce memory bandwidth in these situations (see Table 7.6). Table 7.2 shows usage rates for more test scenes. 138 CHAPTER 7. DRPU FPGA IMPLEMENTATION

Obviously one can always construct situations where the units are totally unbalanced, for instance if standing in front of a wall the usage of the Traversal Processor will be poor, because all rays directly terminate. The other way around, shooting hundreds of shadow rays will increase the usage of Traversal and Geometry Processor while few shading operations are performed to accumulate the light from the directions.

Morph Scene Gael Scene

Figure 7.3: Usage Dependent on the Number of Packets in the System.

During rendering massive multi-threading is used to cope with the high latencies of the hardware units. Figure 7.3 shows that multiple packets can very effectively increase the usage of the hardware (at the cost of more local memory). The figure shows the usage of the rendering units for the Morph and Gael scene depending on the number of 4-thread packets that are available in the pipeline. For few threads the curve is linear, thus increasing the number of threads increases the performance of the system linearly, while on the right it gets strongly sub-linear. Increasing the number of threads much higher than 32 seems not to make much sense, which will also be shown by an efficiency analysis in Chapter 8. The high throughput of 4 and 8 of the Traversal Processor and Shad- ing Processor, is possible due to synchronous execution of 4 threads in a packet by parallel hardware and due to instruction pairing. However, this synchronous execution of the threads is best possible as long as they exe- cute the same instruction. If the control flow diverges the packet needs to be split and the control-flow of these split packets is then sequentialized. The Shading Processor uses an activity mask to perform this splitting of packets. The expected number of threads that are active in a packet is called the instruction packeting efficiency. As the following table shows this instruction packeting efficiency is mostly over 99% for the test scenes. The reason for this high instruction packeting efficiency is that a packet is split for incoherent branches in shaders (but branches are few) and incoherent shader calls, but larger areas of pixels typically carry the same shader code (even if assigned different shading parameters such as color). This causes packets to be split at only a few one-dimensional borders between different 7.2. FPGA CONFIGURATION 139 shaders or different branches (such as shadow borders).

Instr. Packeting Usage Scene Efficiency TP GP SP Shirley6 99.9% 69.3% 75.7% 92.0% Conference 99.5% 58.5% 69.5% 21.9% Office 99.8% 62.3% 62.4% 75.3% Working Room 99.9% 70.2% 81.3% 50.6% Spheres 99.0% 61.7% 78.0% 34.0% Hand 99.0% 56.3% 80.1% 63.5% Skeleton 99.3% 48.0% 80.4% 62.3% Helix 98.9% 45.8% 59.8% 25.7% Pipe 98.8% 49.5% 88.8% 83.8% Morph 99.4% 51.8% 94.2% 69.8% Gael 99.8% 68.3% 60.8% 32.3% DynGael 99.2% 70.4% 64.7% 34.6% SunCOR 78.0% 20.7% 30.1% 3.8% Random 92.9% 31.8% 39.3% 0.8% Beetle 92.8% 36.2% 42.1% 18.9% Theater 99.2% 49.0% 40.5% 14.7%

Table 7.2: Instruction Packeting Efficiency and Hardware Usage: The in- struction packeting efficiency (average number of active rays in a packet) is very high for most of the sample scenes. This is also true for the Random scene, however the many distributed triangles cause more shadow borders which reduces the instruction packeting efficiency to 92.0%. The worst in- struction packeting efficiency shows the SunCOR scene, as it consists of many triangles thus low coherence during rendering. The usage rates vary depending on the ratio between average number of traversal steps, triangle intersections, or shading computations that need to be performed.

7.2.2 Floating Point Performance The three main parts of the ray tracing algorithm (shading, traversal, and intersection) are floating point intensive thus a high peak floating point per- formance is essential to expect high rendering performance. As Table 7.3 shows, the DRPU prototype contains a high total number of 113 floating point units, distributed over the Shading Processor, Traversal Processor, Geometry Processor, and Skinning Processor. Some floating point compar- isons are used in the Update Processor, but they are ignored here because they are not expensive. At its operating frequency of 66 MHz a peak floating point performance of 7.5 GFlops can be reached with the prototype. This high number of floating point units is the main reason why the FPGA is nearly filled. The floating point adder, especially, cost a high 140 CHAPTER 7. DRPU FPGA IMPLEMENTATION

Unit fadd fmul reciprocal Total Shading Processor 16 16 4 36 Traversal Processor 16 16 4 36 Geometry Processor 20 18 1 39 Skinning Processor 4 - - 4 Total 56 50 9 113

Table 7.3: Number of Floating Point Units. amount of FPGA logic resources, because of the initial alignment of both numbers and later normalization. The floating point multiplication is quite cheap on the FPGA because the 18 bit block multipliers are used, which requires only an additional exponent addition and simple normalization to be performed using FPGA logic. Fixed point arithmetics could reduce the size of the circuit but would be more difficult to handle and would limit shading, where high dynamic ranges are often required to represent princi- pally unbounded light intensities. For traversal fixed point arithmetics could principally be used, however again with some limitations. The bounding box of the scene would need to be normalized (to the unit cube for instance) and the distances that are computed on the ray would be limited in their range. Thus visiting geometry (such as small objects) from far away would be a problem.

7.3 Performance Evaluation

Detailed performance evaluations of the FPGA prototype using all test scenes have been performed. Table 7.4 shows the number of cycles re- quired to skin the mesh (if skinned by Skinning Processor), to update the B-KD Tree, and to render the final image in 512x386 resolution. The shown frame-rates are directly derived from these cycle counts while the real dis- play frame-rate is slightly lower because of the frame-buffer read-back and some driver overhead. All shaders for the Shading Processor are written in assembly code with many hand optimizations. The rendering is performed with three different shaders: Triangle-UV shading (encodes the uv-coordinates of the hit-point into a color), Phong shading with one point light source (but no shadows or further secondary rays), and Phong shading with one point light source with shadows (and further secondary effects). The Phong shaders perform normal interpolation and transformation to world space, vertex color interpolation, texture coordinate interpolation, bilinear texture filter- ing, and Phong formula with a specular exponent of 4. The applied Phong shader is quite complex with a total number of 71 instructions, 26% of them are paired. The Phong shader is different from the one printed in 7.3. PERFORMANCE EVALUATION 141

Appendix B because here shadow rays and textures are also considered. All parameters of the Phong model as well as data per vertex need to be read from memory resulting in 21 memory fetches (57% of them can are paired with arithmetic instructions).

cycles triid Phong shadowed Phong Scene skinning/update render fps render fps render fps Shirley6 - - 1.67M 39.5 3.38M 19.7 3.67M 17.9 Conference - - 5.65M 11.6 6.19M 10.6 12.37M 5.3 Office - - 2.52M 26.1 4.41M 15.0 5.20M 12.7 Working Room - - 2.62M 25.1 4.16M 15.8 6.66M 9.9 Spheres - - 3.61M 18.3 5.54M 11.9 10.66M 6.2 Hand - 118k 2.32M 27.1 3.31M 19.3 3.79M 16.9 Skeleton - 113k 1.86M 33.4 2.53M 25.0 3.22M 19.7 Helix - 602k 2.78M 19.5 3.31M 16.83 6.30M 9.6 Pipe 3.5k 6.1k 1.51M 43.4 2.21M 29.6 2.45M 26.7 Morph 10.5k 7.8k 1.26M 51.4 1.55M 42.1 1.76M 37.0 Gael - - 4.04M 16.3 5.18M 12.7 10.40M 6.3 DynGael 10.7k 121k 3.95M 16.1 4.68M 13.7 9.78M 6.7 SunCOR - - 33.36M 2.0 34.84M 1.9 - - Random 50.5k 17.3k 39.50M 1.7 39.90M 1.6 109.87M 0.6 Beetle - - 5.74M 11.6 6.56M 10.1 19.7M 3.4 Theater - - 8.85M 7.5 10.15M 6.5 19.0M 3.5

Table 7.4: Performance of several test scenes computed on one DRPU FPGA clocked at 66 MHz at a resolution of 512x386 pixels.

The number of cycles required for skinning and updating obviously does not depend on the type of shading as these computations are per- formed independently. A comparison between the different scenes shows that rendering of simple scenes, such as the Shirley6 scene for instance, reaches high performance of up to 17.9 fps while more complex scenes (such as the Gael scene) render with 6.3 fps only. As the shading complexity is essentially the same for these two scenes, the performance decrease comes mainly from much higher ray casting cost. Of most interest is the DynGael scene, because on the one side it consists of a static environment from a current computer game, and on the other side it contains 3 dynamic objects to make a game-like scenario. Thus the scene shows that the DRPU could also be used to render current computer games. The most complex scene is the SunCOR scene, consisting of 33 million triangles. However, because of the incoherent view onto the triangles (see Section 7.1), performance drops to two frames per second. Irregular memory access patterns in particular cause this low performance due to low DRAM efficiency. Thus a ray (that hits a sunflower) traverses mostly a different path through the B-KD Tree and mostly hits a different triangle than neighboring rays. Neither packeting nor caches can help to speed-up these incoherent computations. 142 CHAPTER 7. DRPU FPGA IMPLEMENTATION

A second worst case scenario, shows the Random scene. Here some random motion is generated which results in very poor rendering perfor- mance with the applied B-KD Trees, because the B-KD Tree structure can- not match the structure of the random motion. This results in many overlaps of B-KD Tree bounds and the resulting traversal of many nodes of the tree. The Beetle scene shows complex shading in a complex environment. The Beetle model consists of about 680k triangles and contains a car paint shader that performs Phong shading and accurate reflections. The glass plates contain a transparency shader and each material casts pixel accurate shadows. Despite this complexity in shading, the scene still achieves 3.5 frames per second. An interesting result is that for some scenes (such as the Shirley6 scene) enabling shadows is nearly for free (19.7 versus 17.9 fps). This is pos- sible because the Ray Casting Units and the Shading Processor can operate in parallel. If shading complexity and traversal complexity is optimally bal- anced (such as in Shirley6) the shadow ray can be shot nearly in the same time as shading is performed. This results in an increased usage of the hardware units, as virtually more threads are available (some threads are executed in two units in parallel). If looking to the dynamic scenes, the number of skinning and updat- ing cycles are between one to two orders of magnitude below the rendering cycles, showing that computing dynamic scene changes often is not the bot- tleneck of the DRPU but ray casting and shading.

7.4 Memory Interface

To achieve this high ray tracing performance, some optimizations of the memory interface were required. The Traversal Processor, Geometry Pro- cessor, and Shading Processor of the FPGA prototype are all connected via a 128 bits (16 bytes) wide bus to the memory interface. Without any op- timization they would require a peak bandwidth of 12.6 GB/s at 66 MHz during rendering which is too much for real-time applications. The Rendering Units use packeting to reduce hardware complexity and memory bandwidth. This is true for the Traversal Processor and Ge- ometry Processor that operate on spatial index nodes and vertices, and for the Shading Processor that synchronously executes packets of threads. As it is programmable one can not guarantee that the four threads of a packet read data from the same address, thus four single memory requests would typically be required. However, because of the high coherence, these mem- ory requests would often access the same memory cell. For this reason, the Shading Processor can pack the memory requests to reduce memory bandwidth, such that only that many requests are performed per packet as required. This results in up to 4 packed memory requests for a packet. This 7.4. MEMORY INTERFACE 143

memory packing data per frame Scene efficiency before packing after packing reduction Shirley6 77.7% 59.7 MB 19.2 MB 67.8% Conference 84.0% 41.0 MB 12.2 MB 70.1% Office 78.0% 78.3 MB 25.1 MB 68.0% Working Room 64.2% 59.6 MB 23.2 MB 61.1% Spheres 60.2% 61.7 MB 25.9 MB 58.0% Hand 72.3% 38.7 MB 13.4 MB 65.4% Skeleton 92.4% 27.0 MB 7.3 MB 72.9% Helix 78.4% 19.8 MB 6.3 MB 68.2% Pipe 83.4% 29.4 MB 8.8 MB 70.0% Morph 92.7% 14.1 MB 3.8 MB 73.0% Gael 64.2% 59.6 MB 23.2 MB 61.0% DynGael 76.7% 59.5 MB 19.4 MB 67.4% SunCOR 30.7% 10.6 MB 8.6 MB 18.8% Random 80.2% 17.0 MB 5.3 MB 68.8% Beetle 67.4% 48.8 MB 18.0 MB 27.5% Theater 54.4% 40.9 MB 18.7 MB 54.3%

Table 7.5: Memory Packing Efficiency in the Shading Processor. memory request packing acts like a small cache that reduces bandwidth, as data is fetched once and then distributed to the threads of the packet. For the packet size of 4, this bandwidth reduction can be up to 75% if all threads read from the same address. The expected number of threads that are active in a packed memory request is a measurement of how well this bandwidth reduction works. If all memory requests can be packed, and a packet never gets split, then this memory packing efficiency is 100%. This means that one packed memory request is performed and the data distributed to all four active threads. The memory packing efficiency is typically smaller than the instruc- tion packeting efficiency since an optimal packing of memory requests results in the same activity mask for the memory request as the instruction. Inco- herent memory accesses can only reduce this number of active threads. E.g. at triangle borders the rays in a packet access different triangle data which reduces only the memory packeting efficiency. Table 7.5 shows values for the packing efficiency of over 70%, despite texture lookups are also considered and are difficult to pack. These numbers show that coherence between rays is sufficiently high for these test scenes and a packet size of 4 even on a relatively low image resolution. As a special property, memory packing can reduce memory bandwidth even if the local working set of the rays would be too large to fit well into the caches, as is the case for the Random scene. There are few scenes with lower memory packing efficiency: SunCOR, Beetle, and Theater. These scenes contain many triangles and most of them are visible. Thus it frequently occurs that neighbored rays do not access similar data. For such scenes packeting cannot effectively reduce the 144 CHAPTER 7. DRPU FPGA IMPLEMENTATION bandwidth. A data reduction of mostly 75% is still not enough to feed all the rendering units. One way to further optimize the memory bandwidth is to temporarily store fetched data in a cache and reuse it later if accessed again. This caching can also be applied efficiently during rendering and even the small direct mapped caches of the FPGA with 512 cache lines only, reduces the bandwidth to a high degree, as shown in Table 7.6. The hit-rates of the caches are mostly over 90%, except for the Shading Processor, because it performs texture lookups and textures cannot be cached well. The table shows that after caching the most external bandwidth is consumed by the Traversal Processor for traversal and Shading Processor for shading. The Geometry Processor consumes little bandwidth because triangle vertices are optimally shared between triangles. In incoherent situations, such as the SunCOR scene, Theater, or Beetle caching is no longer effective. For incoherent cases, even increasing the size of the cache could not result in higher hit-rates, because data is not reused very often by the computation. For a speed-up of scenes such as SunCOR raw memory bandwidth to the external DRAM chips would be required to get better performance numbers or some level-of-detail technique. Unfortunately, DRAM memory chips suffer from bad performance for non-sequential memory requests, because one needs to open and close mem- ory rows to access their content. The Traversal Processor, Geometry Proces- sor, and Shading Processor generate non-sequential memory requests that can be processed with a bandwidth of only 300 MB/s by the used DRAM . The possible peak bandwidth would be 1 GB/s, thus an efficiency of about 20-30% can be reached (see Table 7.7). For non-sequential requests this is comparably high, as each request needs to open the correct memory row, wait 2 cycles, perform its request, wait 2 cycles until data is available, close the row again, and wait 2 cycles before opening the next row. This sequence would result in a number of 10 clock cycles for a single memory request and bad DRAM memory usage of 10% only. As DRAM chips contain several independent memory banks consecutive requests can often be interleaved and rows can be left opened if accessed by the following memory request again. Even higher efficiency is achievable if performing burst requests of 4 quad words (burst size 8), instead of single quad word requests (burst size 2) as the prototype does. Unfortunately, increasing the burst size showed to reduce performance on the FPGA, as much data is fetched but not used. Larger caches and a clever data layout would be required to make larger burst requests effective. For instance, if increasing the burst size such that DRAM could be accessed with an efficiency of 60% but only 50% of that data could be used, then this would result in a real efficiency of only 30%. This is a small improvement compared to 25% that is achieved with requests of optimal size. For incoherent scenes, using larger burst sizes is never a good 7.4. MEMORY INTERFACE 145 tes, after cache TP GP SP total 0.27 0.072.42 2.76 0.72 3.1 3.02 6.1 1.504.79 0.26 0.27 0.880.12 0.81 2.0 6.05 0.02 6.4 0.06 1.40 6.50 0.2 13.9 2.915.77 0.401.93 7.65 0.68 10.5 0.27 10.9 3.32 16.9 0.20 5.5 0.03 1.436.55 1.6 1.32 2.32 10.1 10.00 3.43 1.26 14.7 79.12 26.34 6.27 101.7 45.60 8.60 5.4 59.6 99.95 25.6543.70 0.31 11.50 125.8 3.7 58.9 data per frame [MB] ains that many visible triangles that traversal, geometry before cache scene comes from the many B-KD Tree nodes that need to be olution of 512x386 pixels. Most scenes show high cache hit ra cache (of 512 B-KD Tree nodes) is too small to hold the working y. TP GP SP total 19.353.4 8.728.4 23.0 19.2 12.1 11.4 337.6 25.1 88.6 65.0 14.723.9 10.4 12.9 7.210.9 6.3 32.3 49.2 6.6 43.2 16.2 4.0 23.3 21.5 88.7 33.851.1 10.823.5 23.3 17.4 389.0 26.1 12.8 13.4 94.6 18.1 49.8 7.5 8.847.1 34.5 15.7 19.398.5 82.1 50.0 18.1 166.6 115.0 63.3 8.7131.1 187.1 46.3 18.7 196.1 416.3 164.4 5.2 585.9 cache hit rate TP GP SP 98.6% 99.2%81.3% 85.7% 85.1%91.5% 89.6% 93.7% 88.0% 88.7% 96.1%91.8% 59.8% 97.9%89.8% 75.3% 97.5%80.0% 96.3% 93.2%98.3% 87.2% 99.6%98.9% 83.8% 99.7%87.7% 98.5% 91.4%86.1% 72.1% 91.6%31.2% 88.0% 58.4%76.0% 28.5% 84.4%61.8% 94.1% 77.0%69.5% 79.3% 81.5% 70.9% 91.4% 96.3% 67.2% Shirley6 Conference Office Working Room Spheres Helix Pipe Gael DynGael Random Scene Hand Skeleton Morph SunCOR Beetle Theater Table 7.6: Cache hit-rates andhowever bandwidth the reduction slightly for lower atraversed. traversal res hit The traversal rates is ofset coherent, the but (of Random the 252 direct B-KDintersection, mapped Tree and nodes) shading often efficiently. access The different SunCOR data scene per cont ra 146 CHAPTER 7. DRPU FPGA IMPLEMENTATION

Skinning Processor Update Processor Rendering Scene efficiency usage efficiency usage efficiency usage Shirley6 - - - - 32.8% 21.9% Conference - - - - 23.7% 65.8% Office - - - - 27.7% 35.4% Working Room - - - - 28.5% 61.8% Spheres - - - - 26.1% 76.5% Hand - - 36.5% 100.0% 21.7% 55.1% Skeleton - - 36.1% 100.0% 25.8% 27.0% Helix - - 36.6% 100.0% 22.8% 60.2% Pipe - - 39.6% 100.0% 28.4% 24.5% Morph 58.3% 95.4% 38.5% 100.0% 35.1% 11.3% Gael - - - - 26.5% 67.2% DynGael 57.6% 95.2% 36.2% 100.0% 24.8% 59.0% SunCOR - - - - 20.4% 99.1% Random 57.6% 98.1% 36.7% 100.0% 35.1% 57.8% Beetle - - - - 20.9% 91.2% Theater - - - - 21.6% 92.9%

Table 7.7: DRAM Memory Statistics: The table shows some statistics of the DRAM usage during Skinning, Updating, and Rendering. One sees that the Skinning and Updating are both bandwidth limited (DRAM usage near 100%). The efficiency that DRAM is accessed is the highest for the Skinning Processor (as it purely operates on streams) and lower for the Update Processor. The rendering units have the lowest DRAM efficiency because they perform irregular memory requests. However, DRAM usage is far away off 100% for most scenes, thus rendering is computation limited. Different is the situation for the incoherent SunCOR, Beetle, and Theater scene. Their DRAM usage is high (more than 90%) thus they are bandwidth limited. choice as neighboring data is rarely used by the rendering units. In contrast to the Rendering Units, the Skinning Processor and Up- date Processor perform much more sequential memory requests, thus their DRAM efficiency is much higher as shown in Table 7.7.

7.5 Scalability with Number of FPGAs

Scalability is an important aspect for rendering especially if a single render- ing unit (such as our FPGA prototype) does not achieve high enough perfor- mance for an application. It has long been shown that real-time ray tracing in software scales extremely well to tens to hundreds of CPUs [WPS+03, PSL+99] even for complex shaders and scenes. This is possible because the 7.6. SCALABILITY WITH NUMBER OF TRIANGLES 147 rendering part of ray tracing is embarrassingly parallelizable if the band- width to the scene database is sufficient for each rendering unit. For the FPGA prototype we solve this issue by providing each FPGA with a copy of the complete scene. This on the one side guarantees the bandwidth to the scene scales with the number of rendering FPGAs, on the other side this scheme is simple to implement. With our hardware equipment we can scale the architecture to use two FPGAs for rendering where each is programmed with a copy of the same DRPU design (see Table 7.8). The load balancing is performed by the driver application that subdivides the screen into 12 tiles that are scheduled to both FPGAs. If one FPGA finishes its tile it gets assigned a new one. Using this approach the pure rendering time (excluding setup time for dynamic scenes) scales mainly linear. Unfortunately, with the two FPGA cards we can measure the scala- bility only to two FPGAs, however using more FPGAs for rendering should cause no problems as the scene bandwidth scales linearly with the approach. The only bottleneck may become a communication overhead for controlling the FPGAs by the driver, or reading back the computed pixels for display. But even high resolution images of 1024x768 displayed at 60 fps would re- quire a bandwidth of only 188 MB/s, which would be feasible with current PCI Express standards. Scaling the performance in the case of many dynamic scene changes is more challenging, as these dynamic changes need to be computed by or sent to the different parallel FPGAs for rendering. The FPGA prototype uses an approach where each FPGA skins and then updates all dynamic objects of the scene. As a limitation this approach only allows for scaling the rendering, not the setup time to compute the dynamics. For instance the total display time for the the Helix model does not scale linearly as shown in Table 7.8. The reason is the large setup time of 50ms that is in the same range as the rendering time on one FPGA. Thus by using two FPGAs the render time reduces linearly (from 58ms to 30ms) but the total display time improves only by 20% (from 105ms to 85ms).

7.6 Scalability with Number of Triangles

To get higher accuracy digital production companies often use scenes that are highly triangulated such that many triangles often map to the same pixel. This results in scenes with multi-millions of triangles that are difficult to render. If using hierarchical spatial index structures ray tracing is logarithmi- cally in the number of scene primitives as analyzed by different researchers [Hav01, Wal04]. However, this logarithmic complexity property is only valid if the cost for accessing memory is constant. For the DRPU 148 CHAPTER 7. DRPU FPGA IMPLEMENTATION n edrn iea ela h civddslyfaerate frame display achieved the as well as column time The proto rendering DRPU shadows. and the without of shading scalability Phong the using shows resolution table This 7.8: Table Random Pipe Helix Scene SunCOR DynGael Gael Morph Skeleton Hand Spheres Room Working Office Conference Shirley6 eu ieRne ieTtltm Frame-rate time Total time Render time Setup 735. 0. . fps 9.5 105.1 57.9 47.3 . 097. 40fps 14.0 71.4 70.9 0.5 . 0. 0. .5fps 1.65 fps fps 1.86 11.4 fps 11.5 604.4 fps 31.3 535.5 87.7 fps 30.0 87.1 602.7 32.0 fps 17.9 535.0 33.3 fps 13.7 78.5 fps 86.6 10.85 55.6 1.7 31.0 72.9 fps 0.5 13.3 31.9 92.1 9.2 fps 9.4 fps 0.5 16.8 47.2 75.2 1.0 63.1 1.4 91.5 106.8 59.4 8.5 74.6 9.8 104.6 0.6 58.8 0.6 2.2 0.6 PA[ms] FPGA 1 hwtestptm sinn lsudt lsDAupload) DMA plus update plus (skinning time setup the show s yewt h ubro sdFGsfrrneiga 512x386 at rendering for FPGAs used of number the with type ihoeadtoFPGAs. two and one with eu ieRne ieTtltm Frame-rate time Total time Render time Setup 144. 271. fps 19.0 52.7 fps 11.7 fps 29.5 fps 23.0 41.3 85.3 33.9 43.5 11.4 30.1 23.9 31.8 55.2 10.0 11.7 . 593. 73fps 27.3 36.6 35.9 0.7 . 1. 1. . fps 3.2 fps 3.5 fps 21.4 314.8 fps 55.8 278.7 fps 52.6 46.8 312.7 17.9 278.0 19.0 fps 21.1 46.2 2.1 16.6 fps 0.7 24.8 17.4 47.4 fps 17.3 fps 0.7 27.2 40.2 1.3 57.7 1.6 46.4 36.7 39.5 55.3 1.0 36.1 0.8 2.4 0.6 PA [ms] FPGAs 2 7.6. SCALABILITY WITH NUMBER OF TRIANGLES 149

Figure 7.4: Scalability with number of scene triangles for constant working set achieved through a fixed view on one of multiple copies of the Gael scene. hardware architecture this is often not the case, especially because of packet- ing, caching, and the limited memory bandwidth. Thus for very incoherent scenes with many visible triangles performance drops down because mostly few threads per packet are active, few data is reused through the cache, and the memory bandwidth cannot feed the units any more. However, if the computation stays coherent and the working set con- stant, then the performance of a ray tracing system is logarithmically in the scene complexity. To show this with the DRPU prototype, a view into the Gael scene has been fixed and rendered. However, besides the viewable part different numbers of copies of the Gael level have been located. These copies do not influence the rendered image but the scene complexity. This kind of scene is typical for in-door scenes where occlusion is high. As Figure 7.4 shows the performance for this set of scenes is mainly the same, overlayed with some noise introduced by the cache. The reason for the similar perfor- mance is that only few additional traversal steps need to be applied to find the right start cell. From there the visited cells remain exactly the same.

Figure 7.5: Rendering performance for different triangle depth (number of triangles per pixel) for a regularly triangulated wall. 150 CHAPTER 7. DRPU FPGA IMPLEMENTATION

A second measurement shows how performance depends on the num- ber of visible triangles or coherence. Therefore, a regularly triangulated wall was rendered at different distances. Figure 7.5 shows the number of triangles per pixel of the different views and the related rendering speed of the FPGA. As the number of traversal steps and ray/triangle intersections are the same for each view the frame-rate should stay equal. However, it drops down with higher triangle density and lower coherence. The reason shows the right diagram, that plots the usage of the external DRAM mem- ory chip, which increases if coherence drops down thus the computations become bandwidth limited for this incoherent case. If the DRAM usage reaches 100% the frame-rate should stay constant because from this point memory requests again have constant cost (even if quite expensive).

7.7 Conclusions

This chapter presented the fully working FPGA prototype of the DRPU hardware architecture that implements a DRPU configuration with reduced complexity such as 24 bit floating point accuracy. Performance statistics for many different test scenes, such as the Gael scene which shows a level from a current computer games, have been pre- sented. For these test scenes, frame-rates from 10 to 40 fps have been achieved at 512x384 resolution depending on scene and shading complexity. The rendering performance scales well to two FPGAs, while skinning and update are more difficult to parallelize. The usage rates of the rendering units show to be well balanced be- tween Traversal Processor, Geometry Processor, and Shading Processor for typical scenes. Optimizations to the memory interface (such as memory packing and caching) are very effective for coherent scenes making the mem- ory interface sufficiently fast. However, incoherent scenes do not benefit from any optimizations as they tend to be bandwidth limited. This is shown by a sample scene where a highly triangulated plane is visited from different distances. Views with high incoherence result in poor render performance despite the computational work for each view is nearly the same. While the FPGA allows for impressive demonstrations of the DRPU architecture and already achieves high performance, it is not fast enough to be used in current computer games, especially if computing at 1024x768 resolution. However, it is very useful as a debugging platform and to accu- rately estimate performance for ASIC implementations that are analyzed in the next chapter. Chapter 8

DRPU ASIC Implementation

This section describes an ASIC mapping of the DRPU prototype using a 130nm CMOS process from UMC [UMC06]. The ASIC platform is quite different from the FPGA platform used for prototyping. Logic in particular is much more expensive on the FPGA, while on-chip memories are very ex- pensive on the ASIC. These different costs would also result in a different DRPU implementation if optimized to the ASIC process. Particularly tech- niques that reduce the latencies of the operations such as forwarding could reduce on-chip memory due to a fewer number of threads. However, these techniques are often very logic consuming and complicated, thus have not been implemented on the FPGA. For simplicity the FPGA prototype implementation was mapped with- out architectural changes to the ASIC process by selecting an optimal con- figuration. This optimal DRPU configuration is determined, by looking at an efficiency estimate for speed per die area. For the most efficient con- figuration, precise area and frequency results are presented that allow for estimating ray tracing performance that could be achieved with this 130nm CMOS process. In addition an extrapolation of the results from the 130nm process to a 90nm process using constant field scaling is presented, and shows that with clock rates of 400 MHz, and a 196mm2 die area ray tracing performance levels of 80 to 280 fps can be achieved at 1024x768 resolution in the test scenes. This is 70 times faster than achievable with the OpenRT software ray tracer and standard multi-GHz desktop CPUs. For simplicity, the DRPU core was not connected to external DRAM or PCI Express in the ASIC implementation. Thus, this chapter concen- trates on synthesis, and the place and route of the DRPU core, which is sufficient for performance estimates. No simulation of the gate level net-list was performed, because some additional circuit for initialization of the chip

151 152 CHAPTER 8. DRPU ASIC IMPLEMENTATION would be required first. Low-level checks like cross talk and DFT have been omitted, thus the ASIC design presented here cannot by manufactured as it is, but gives very precise performance estimates for an ASIC implementation of the DRPU architecture.

8.1 Synthesis of different DRPU Configurations

Logic synthesis is one of the most important steps of a design flow, as it transforms the HDL description of the circuit (in our case the VHDL de- scription generated by HWML), to a technology specific gate level net-list. This transformation is performed under some user constraints, such as max- imal area, operating frequency, and the cell libraries to use. The gate-level specification is used as input for the later place and routing phase (see next section). As the complete DRPU architecture is described at a high level of abstraction using HWML, it was easily possible to map several different configurations to the ASIC process. Besides generating a VHDL description for combinational logic this mapping also required generating memory in- stances using an SRAM memory compiler from Virtual Silicon [Sil06]. For each type of memory, a special SRAM component that fits in the number of ports, words, and the word width was generated. This results in a total number of about 36 different single and two-port memory configurations for a single configuration of the DRPU design. The ASIC version does not suffer from space limitations, thus the floating point data path was increased to a full 32-bit single-precision width. Integer arithmetic has been included as well as other features that had been disabled in the FPGA version. To make a conservative performance extrap- olation, the ASIC version implements larger caches that are even four-way set associative: Shading Processor (16 KBytes), Traversal Processor (16 KBytes), and Geometry Processor (16 KBytes). For synthesis the Design Compiler (DC) was used, which is a widely used design tool from Synopsis [Inc06d]. It optimizes the circuit for speed, area, and rout-ability by applying a delay and wiring model. After synthesis, a report shows a frequency estimate and the minimal area the generated net- list would occupy on the die. This post-synthesis area is sufficiently precise to perform an efficiency estimation for different configurations of the DRPU. The most important parameters are the packet size, which is evaluated from 1 to 4, and the number of packets, which is tested from 1 to 32. The post-synthesis area for both parameterizations is shown in Figure 8.1 and Figure 8.2. The FPGA prototype was used to extrapolate the performance that could be reached with the different configurations at a fixed frequency of 266 MHz. This allows the design efficiency e to be computed as the quotient of the number 8.1. SYNTHESIS OF DIFFERENT DRPU CONFIGURATIONS 153 of pixels per second #R and the ASIC area A of the configuration.

#R pixels e = [e]= A mm2 · cycles For the Gael scene, this efficiency estimate is shown in Figure 8.3 and Figure 8.4 in millions. Different scenes behave very similarly to the Gael scene, thus no additional figures for other scenes are provided. It turns out that the efficiency of the design increases up to a number of 25 packets, then it decreases again. This is due to a saturation of the pipeline if there are more threads than required to hide the latencies. Increasing the packet size shows an increase in efficiency because the computation is mostly highly coherent, and some circuitry, such as instruc- tion cache, instruction scheduling, many smaller per-packet FIFOs, control stack, etc. can be shared for the packet. This is also the reason why the area curve depending on the packet size does not hit the origin. With these results, the configuration of 32 packets and a packet size of 4 was chosen for the DRPU ASIC. No larger packet size was evaluated because we want to limit the worst case performance drop for very incoher- ent computations, such as highly complex scenes, to a maximum of 75%. Furthermore, a larger packet size would cause problems during routing, be- cause some communication needs to synchronize over the large packets. The Geometry Processor is optimized to a packet size of 4 in computation speed, thus for larger packet sizes no high efficiency can be expected. For the num- ber of packets, 16 and 32 are the best fitting configurations with a power of 2 but with a higher efficiency for the 32 packet version.

Figure 8.1: Post Synthesis Area de- Figure 8.2: Post Synthesis Area de- pending on Packet Size. pending on Number of Packets.

For the chosen configuration, the design flow was started again with greater effort during synthesis. First the VHDL design consisting of 315 files was loaded, as well as the typical case libraries for the UMC standard cells, and the 36 different memory types. The maximum transition time was set to 3ns, which is equivalent to a target frequency of 333 MHz. Using a machine 154 CHAPTER 8. DRPU ASIC IMPLEMENTATION

Figure 8.3: Efficiency of the DRPU Figure 8.4: Efficiency of the DRPU depending on Packet Size. depending on the Number of Packets. with 3 GB of main memory it was possible to perform a flat synthesis of the complete design. The timing constraints were nearly met with a worst negative slack of 0.11ns, and a resulting maximal operating frequency of 321 MHz for the typical case (1.25V, 25◦C). The worst case (1.08V, 125◦C) frequency is 161 MHz. The area for combinational logic is 3.9 mm2, while the area for non-combinational logic (registers and memories) is 33.9 mm2. The memories in particular take a large part of the overall chip area of 37.9 mm2, mainly because of multi-threading. The tool reports a power of 1.8W, which seems too optimistic for the 113 floating point units, but no more detailed power analysis was performed. At over 58%, the Shading Processor consumes the largest part of the DRPU architecture, see Figure 8.5. This number will increase yet further after Place and Route and is mainly due to the very large register files, whith total 1.2 MBits. These register files consume a lot of raw space on the chip, and need to be placed sparsely to make routing possible. The Traversal Processor and Geometry Processor were required for efficient ray tracing and make up 29% of the chip. Here the largest area is spent on traversal stack, ray storage, and the connection of the Geometry Processor to memory (caches and temporary triangle storage). At 4.2%, the Update Processor required for B-KD Tree updates consumes a tiny part of the chip, while the skinning needs hardly any additional resources as it is shared with the Geometry Processor. Some other units consume additional space, such as the DRAM memory interface, DMA interface, write buffer for pixel data, etc. Today’s high end rasterization graphics cards use a 90nm process for their implementation such as the ATI R520 with a 288mm2 die area. We want to scale the DRPU design to use a comparable amount of hardware resources, but as we have no access to a 90nm process we cannot provide precise timing results using Cadence SOC Encounter, but extrapolations us- 8.2. PLACE AND ROUTE (P&R) 155

gates area relative Unit [M] [mm2] area SP 4.2 22.1 58.3% TP 1.1 6.0 15.8% GP 1.0 5.2 13.7% Update P. 0.3 1.6 4.2% Others 0.6 3.0 8.0%

Figure 8.5: Size of different Parts of the DRPU ASIC. ing constant field scaling are reasonably accurate [WH05]. If one scales the dimensions of a process by s using constant field scaling, then the frequency scales by a factor of 1/s. Similarly, the feature size decreases by s in each dimension, giving a s2 smaller total area. This principle was used, for deriv- ing estimates for the 90nm process, which gives a post synthesis typical case operating frequency of 465 MHz and a post-synthesis chip area of 18 mm2. Tables 8.1, 8.2, and 8.3 summarize the 130nm synthesis results and these 90nm estimates.

8.2 Place and Route (P&R)

While synthesis optimized the circuit by only taking logical equations, and abstract timing and wiring models into account, the cell-net-list needs to be placed at a physical location to get accurate results. This placement and the later wiring is performed in the Place and Route (P&R) stage, which is the most important part in the ASIC design flow, and also the most difficult one. Badly placed designs can become particularly unroutable if routing congestion (wire density) gets too high at some locations. This high density can stress the routing resources or cause long parallel wires to induce noise in neighboring ones causing malfunction of the chip. Place and Route was performed using Cadence SOC Encounter [Sys06b]. Again there was enough memory to perform the preferable flat routing of the complete design, rather than a hierarchical one. The gate-level net-list from synthesis has been imported to Encounter, together with the required cell and memory libraries from Virtual Silicon. These technology libraries include the .lib files which specify the electrical characteristics of the stan- dard cells, the .lef files for layout information such as pin placement and metal layer usage, and the Verilog .v files that contain the abstract interface for each cell. In Encounter, a die size of 7mm x 7mm was chosen, which means a cell density of about 50% for distributed logic. Choosing a smaller die can cause many problems with routing congestion and would leave no space for 156 CHAPTER 8. DRPU ASIC IMPLEMENTATION

combinational flip-flops memories total 429k 200k 280 629k

Table 8.1: Standard Cell Counts After Synthesis of the DRPU Architecture.

Process combinational non-combinational total 130nm 3.9 mm2 33.9 mm2 37.9 mm2 90nm 1.9 mm2 16.1 mm2 18.0 mm2

Table 8.2: Area results of synthesis of the DRPU architecture using a 130nm process. The values for the 90nm process are estimated using constant field scaling.

timing [ns] Process worst typical 130nm 5.57 ns 3.11 ns 90nm 3.84 ns 2.15 ns

Table 8.3: Timing results of synthesis of the DRPU architecture using a 130nm process, for the typical case (1.25V,25◦C) and the worst case (1.08V, 125◦C). The values for the 90nm process are estimated using constant field scaling. later optimization buffers and clock tree synthesis. Now, power rings, and horizontal power stripes have been added at a distance of 0.1 mm, and power routing was performed. However, the distance and thickness of the power stripes was not analyzed to be sufficient. Too small power wires (especially at the external pin drivers) can cause inducted voltages that cause chip malfunctions. Now logic can be placed to a physical location of the die, which can principally be performed automatically by the tools. However, this auto- matic placement gave no optimal results as the four SIMD ALUs of the Shading Processor in particular, had not been separated by the placement tool. Thus intensive hand placement of all main components of the design, and pre-placement of all the memory blocks to good locations was required. Low effort place and route showed routing congestion problems, thus the hand placement was refined several times. The wiring of the source reg- ister access in the Shading Processor, caused problems in particular, due to swizzling. After this hand-placement stage, a high effort timing driven placement was performed, which places and moves standard cells to match timing constraints. In-place optimizations further improved the placement by modifying drive strengths. Now the clock tree placement was performed automatically. In this stage P&R tries to optimally place and route the clock tree and additionally 8.2. PLACE AND ROUTE (P&R) 157

Figure 8.6: Plot of the DRPU ASIC shown with only four of the six levels of metal wiring so that the memories are visible. The main units of the design (Shading Processor, Traversal Processor, Geometry Processor, and Update Processor) are highlighted. Smaller units are ignored and are located somewhere in-between. One sees that the Shading Processor consumes most area on the chip. inserts clock-tree buffers to minimize clock skew. In the following a post clock-tree optimization step was applied to the placed net-list. The final route phase, is the most time and resource consuming part of Place and Route. Besides, meeting setup and hold time of the registers, it inserts additional buffers and resized standard cells to match setup time, hold time, and the target frequency. A high effort final routing showed that the design is routable with a total wire length of 114m. Some post-routing optimizations to the design slightly improved the timing. The most important result of Place and Route is the frequncy and final chip area, which should be as small as possible to decrease manufac- turing costs and to improve yield. Because no pad placement was performed, the area numbers are without space for padding (additional 20µm on each side). The 7mm x 7mm core has a total size of 49mm2 and after place and route the design contains 187k flip-flops and 505k combinational standard cells. The area and cell counts are higher as after the synthesis step, due to wiring, clock tree, and optimization buffer insertion. The post layout timing estimates for the current version are 161 MHz (6.21ns) for worst case (1.08V, 125◦C) and 299 MHz (3.34ns) for typical case (1.25V, 25◦C). These precise speed estimates are lower than the post synthesis estimates and are approximately 70% of the maximum possible speed of the on-chip 158 CHAPTER 8. DRPU ASIC IMPLEMENTATION

combinational flip-flops memories total 505k 187k 280 692k

Table 8.4: Post Place and Route Cell Counts.

Process total 130nm 49.0 mm2 90nm 23.3 mm2

Table 8.5: Post place and route area results. The values for the 90nm process are estimated using constant field scaling.

timing [ns] Process worst typical 130nm 6.21 ns 3.34 ns 90nm 4.28 ns 2.30 ns

Table 8.6: Post place and route timing results of the DRPU architecture using a 130nm process, for the typical case (1.25V,25◦C) and worst case (1.08V, 125◦C) timing. The values for the 90nm process are estimated using constant field scaling. memories generated with our memory compiler. A clock-rate of 266 MHz (3.76ns) should easily be achievable if the chip would be fabricated. At this frequency the DRPU ASIC would have a theoretical peak performance of 30.0 GFlops. Tables 8.4, 8.5, and 8.6 summarize the post Place and Route results for the 130nm version and extrapolate results for a 90nm version using constant field scaling. A plot of the DRPU layout is shown in Figure 8.6 with the mem- ories visible as the large blocks, and the main parts of the design being highlighted. In total there are approximately 9 million non-memory tran- sistors in the DRPU (692k standard cells, 187k of them are flip flops) and approximately 2.57 MBits of on-chip RAM in the caches (0.6 MBits), reg- ister files (1.2 MBits), and other memory structures (0.77 MBits) that are implemented in 280 generated memory blocks.

8.3 Performance Evaluation

DRPU FPGA: The fully functional FPGA prototype, configured as de- scribed in Section 7, runs at 66 MHz with 1GB/s peak memory bandwidth between the on-board DRAM and the on-chip caches. It turns out that half the peak memory bandwidth is sufficient for most of the test scenes (where the DRAM usage of below 50%), thus the available bandwidth is scaled down to only 0.5 GB/s by some test circuit for all measurements. 8.3. PERFORMANCE EVALUATION 159

The data in this chapter differs further from the numbers in Chapter 7 as here a resolution of 1024x768 pixels is used. The higher coherence caused by this resolution often improves rendering speed, thus the frame-rates from Chapter 7 cannot simply be divided by 4 for the FPGA version. The perfor- mance of the DRPU FPGA is measured directly from the running hardware by counting the number of cycles required to update spatial index structures and to compute the image.

DRPU ASIC: The timing of the DRPU ASIC, with packet size 4 and 32 hardware threads (packets) is estimated from the post layout timing analysis using Cadence SOC Encounter. Because of the four times higher frequency, performance numbers for this ASIC version are derived by scaling the FPGA frame-rates by four. Because of the larger caches of the ASIC, this perfor- mance estimate is quite conservative if the ASIC were to be connected to a 2.1 GB/s DRAM memory interface, which has 4 times the memory speed of the DRPU FPGA.

DRPU4 ASIC: The DRPU4 ASIC gives performance estimates for a version that puts 4 copies of the basic DRPU ASIC on a single chip. This DRPU4 ASIC would fit on a 14mm x 14mm die (196mm2) at 130nm. Some additional area would be required for the interconnection of the 4 DRPU ASIC copies. However, as we didn’t build this network, it is not included in any numbers. If run at 266 MHz the 452 floating point units of the DRPU4 ASIC would provide a peak floating point performance of 120.0 GFlops. The per- formance can again be scaled up linearly (factor of 4) if the chip were con- nected to a DRAM memory interface with 8.5 GB/s peak bandwidth, which can be implemented quite feasibly with two 64-bit wide DDR2 memory in- terfaces clocked at effective 532 MHz. The results of these extrapolations are still quite accurate, because frequency (and area) are precise, and memory

OpenRT PS3- DRPU DRPU DRPU4 DRPU8 P4/Northw. Cell FPGA ASIC ASIC ASIC Freq [MHz] 2,667.0 3,200.0 66.0 266.0 266.0 400.0 GFlops 10.6 256.0 7.5 30.0 120.2 361.6 process [nm] 130 90 90 130 130 90 die size [mm2] 145.0 221.0 - 49.0 196.0 186.6 bandwidth [GB/s] 8.5 25.0 0.5 2.1 8.5 25.6

Table 8.7: Comparison of the different hardware architectures: the OpenRT software implementation running on a Pentium 4, the Cell implementation, the DRPU FPGA implementation, the DRPU and DRPU4 ASIC implemen- tations on a 130nm process, and the extrapolation of the DRPU8 ASIC to a 90nm process. 160 CHAPTER 8. DRPU ASIC IMPLEMENTATION bandwidth is extrapolated linearly.

DRPU8 ASIC: Next performance levels are extrapolated that could be achieved with the DRPU design if going from the 130nm process to a 90nm process by using the area and frequency values derived with constant field scaling. The typical case delay of 2.30ns for the 90nm process, would allow an operating frequency of 400 MHz (2.5ns). With the die size of 23.3mm2 one can put eight times the standard DRPU ASIC on a single die, which makes a total die size of 186.4mm2. Again the interconnect bus for the DRPU ASIC copies is not included in any estimations, nor the area occupied by an external interface (like PCI Express). To feed the chip with enough data one would need to connect this DRPU8 ASIC to a 25.6 GB/s memory interface. External memory inter- faces at this speed are difficult to implement, but realistic when considering current high-end GPUs with external bandwidths of more than 40 GB/s. Compared to the DRPU4 ASIC, this DRPU8 ASIC would again give a speed-up of three. The 904 floating point units would provide a peak float- ing point performance of 362 GFlops, which is very close to the peak floating point performance of today’s GPUs. Because of the high rendering perfor- mance, a high-speed PCI Express connection would be required to download the rendered pixels for display. The results from the DRPU8 extrapolations are less accurate as the other numbers, because of the constant field scaling which only gives approximated frequency numbers.

Table 8.7 gives an overview of frequency, peak floating point per- formance, and die characteristics for the FPGA prototype, the different ASIC design versions, and the Pentium 4 chip and Cell processor used for speed comparison. We use a subset of the benchmark scenes for speed comparisons between different architectures, see Table 8.8. A comparison

DRPU DRPU DRPU4 DRPU8 Scene OpenRT Cell FPGA ASIC ASIC ASIC Shirley6 3.2 fps 180.0 fps 5.0 fps 20.0 fps 80.0 fps 240.0 fps Office 2.6 fps n/a 4.1 fps 16.4 fps 65.6 fps 196.8 fps Conference 2.0 fps 60.0 fps 3.4 fps 13.6 fps 54.4 fps 163.2 fps Gael 2.0 fps n/a 3.8 fps 15.2 fps 60.8 fps 182.4 fps Beetle n/a 42.2 fps 3.4 fps 13.6 fps 54.4 fps 163.2 fps

Table 8.8: Performance comparison of the OpenRT software implementation running on a Pentium 4 with 2.66 GHz, the Cell ray tracer, the DRPU FPGA running at 66 MHz, post-layout estimates for the DRPU ASIC, and estimates for the DRPU4 ASIC running at 266 MHz, and the DRPU8 ASIC running at 400 MHz. All performance numbers are for 1024x768 resolution for a Phong shader, without shadows and any further secondary rays. 8.4. CONCLUSIONS 161 of the performance of the FPGA prototype and the three ASIC versions against the OpenRT software ray tracer running on an Intel Pentium-4 at 2.66 GHz [DWBS03] and a Cell implementation of ray tracing [BWSF06] are performed. The results show that the FPGA version outperforms the soft- ware implementation by 40% to 70% even though clocked at a 40 times lower frequency (see Table 8.7). The DRPU8 ASIC version would outperform the software ray tracer by a factor of up to 75. A comparison to a Cell implemen- tation of ray tracing shows up to 2.5 times higher performance, despite the hardware complexity being similar (see Table 8.7), and the DRPU8 ASIC performing much more complex shading (including textures). This shows the high efficiency of the DRPU architecture compared to general purpose designs. Table 8.9 shows performance numbers for some more test scenes that are rendered with a realistically complex Phong shader with 71 instructions (the same shader as in Chapter 7) that perform bilinear texture lookup, diffuse term, specular term, light fall-off, vertex normals interpolation, ver- tex colors interpolation, and shadows. For the test scenes, the estimated performance of the DRPU8 ASIC is between 80 and 280 frames per sec- ond depending on the scene. Most important are typical game-like scenes, such as the Gael level which renders with more than 90 frames per second at 1024x768 resolution even with two animated Skeleton instances. This performance is sufficient for game play and would leave a lot of room for improved image quality using adaptive oversampling techniques. Such adap- tive oversampling techniques compute a first image of the scene, on which an edge detector is applied. The alias at the edges can then be removed by a second pass that shoots additional rays only at the edges.

8.4 Conclusions

This chapter presented Synthesis and Place and Route of the DRPU de- sign, to obtain very precise timing and area results. The results have also been extrapolated from the 130nm process to a 90nm process, which is stan- dard for today’s rasterization graphics cards. The estimates show that high performance levels sufficient for game play are achievable with ray tracing hardware, especially if it is possible to use high-end 90nm ASIC technology. However, this thesis only deals with bandwidth as we did not build nor ac- curately simulate the memory interface required to connect the rendering units to memory. If designing a high-speed memory network additional la- tency could be introduced, however we believe that the effects would not be that high and could be hidden by the multi-threading approach. A DRPU ASIC could offer much higher quality of image and realism due to the use of ray tracing rather than rasterization. This would simplify content creation, and make physically correct reflections, refractions, and 162 CHAPTER 8. DRPU ASIC IMPLEMENTATION rvdstenme fcce eurdfrudtn fteB- the of updating for n required shadows a cycles for with of versions number DRPU the the of provides performance Estimated 8.9: Table ept h eoybnwdho h PAwsreduced. was slight FPGA are the table of th this bandwidth 7, of memory Chapter numbers the in the despite results of the some coherence to higher Compared computation. the for Beetle DynGael Gael Helix Skeleton Room Working Office Conference Spheres Shirley6 Scene Theater Hand hn hdn,adtxue.Fae e eodaedirectl are second per Frames textures. and shading, Phong , rage objects triangles 0k1 1 608k 680k 52 282k .k1 0.5k 5 1 15k 5 4 1 2 85k 2 52k 2 78k 6 16k 17k 1 20k 34k paerender update 2k33M 121k 18M 11M 602k 13M 113k 118k cycles 24M - 50M 56M - - 34M - 36M - 18M - 14M - 42M - yhge hndvdn h ubr rmCatr7truh4, through 7 Chapter from numbers the dividing than higher ly DTe n edrn fteiae t12x6 resolution 1024x768 at images the of rendering and Tree KD PADP SCDP4AI RU ASIC DRPU8 ASIC DRPU4 ASIC DRPU FPGA . p 12fs4. p 3. fps 134.4 fps 44.8 fps 11.2 fps 2.8 . p . p 11fs6. fps 63.4 fps 57.6 fps 96.0 fps 91.2 fps 21.1 fps 19.2 fps 168.0 fps 32.0 fps 283.2 fps 30.4 fps fps fps 5.3 240.0 86.4 fps fps 56.0 4.8 fps fps 8.0 94.4 fps fps fps fps 7.6 80.0 172.8 fps 78.0 fps 1.3 28.8 fps fps 1.2 14.0 fps 225.6 fps fps 2.0 23.6 fps 57.6 fps fps fps 1.9 20.0 26.0 fps 7.2 fps 3.5 fps 75.2 fps 5.9 fps fps 5.0 14.4 fps 6.5 fps 1.8 fps 18.8 fps 3.6 fps 1.6 fps 4.7 s ausaefra4tmshge eouin sti causes this As resolution. higher times 4 a for are values ese me fbnhaksee fvrigcmlxt.Tetable The complexity. varying of scenes benchmark of umber optdfo h ubro ylsrequired cycles of number the from computed y 8.4. CONCLUSIONS 163 accurate shadows possible in computer games. An ASIC implementation of the DRPU8 ASIC version would bring ray tracing close to a point where it could be a viable alternative to rasterization and be used in current computer games. 164 CHAPTER 8. DRPU ASIC IMPLEMENTATION Chapter 9

Final Summary, Conclusions, and Future Work

This chapter gives a final summary of the contributions of each chapter in this thesis, a brief description of possible future work, and finally the thesis conclusion. In Chapter 1 an overview on the rasterization and ray tracing algo- rithm was performed. Both algorithms are compared to each other in differ- ent aspects such as shading, performance, dynamic scenes, scalability etc. It shows that ray tracing has many advantages for shading and scalability, but some issues in performance and support for dynamic scenes. This thesis shows that the performance issue can be fixed through the use of dedicated ray tracing hardware. Chapter 2 gives a short overview of the DRPU hardware architec- ture. All its individual hardware units, the connection between them, and basic principles such as multi-threading and packet approaches have been discussed. Chapter 3 describes the hardware units used for handling ray casting in dynamic scenes in detail. B-KD Trees are described as a new spatial index structure for handling dynamic scenes. For efficient traversal through this index structure a special Traversal Processor and Geometry Processor are used. The bounds of the index structure are modified using an Update Processor. With this approach many kinds of dynamic scenes can be handled efficiently, with some performance limitations for random (or incoherent) motion. Chapter 4 describes a special Shading Processor, with a similar in- struction set to fragment processors of current GPUs. Additional features of this processor are hardware managed recursion, flexible memory access, and the implementation of a “trace” instruction, to efficiently use the Traversal Processor and Geometry Processor for ray casting. Chapter 5 described a Skinning Processor that is optimized for the

165 166 CHAPTER 9. FINAL SUMMARY, CONCLUSIONS, AND FUTURE WORK

Skeleton Subspace Deformation skinning model. It can be added to the design with little overhead as it shares the Geometry Processor for matrix transformations. Using the Skinning Processor, all the scene changes can be computed on the hardware with little communication with the host PC. Chapter 6 described the hardware description library HWML devel- oped to implement a prototype of the DRPU architecture. The functional language ML has been exploited as the mother language to implement a powerful structural hardware description library. HWML made possible the implementation of the DRPU architecture in only 8000 lines of code. Chapter 7 described the prototype implementation of the DRPU archi- tecture in detail, including many measurements for several example scenes. It shows that the FPGA achieves high efficiency especially for scenes with high coherence. Highly triangulated scenes cause the FPGA to be band- width limited. Chapter 8 described an ASIC implementation of the DRPU architec- ture using a 130nm CMOS process. An optimal configuration of the ASIC was determined and synthesis and place and route have been performed us- ing standard tools. The results have been extrapolated to a 90nm process which shows that with high end CMOS technology, the ray tracing perfor- mance of about 70 single core CPUs could be implemented onto a single chip. Frame-rates of more than 90 fps for resolutions of 1024x768 pixels can easily be achieved including complex shading.

9.1 Future Work

There are many areas of possible future work for ray tracing hardware. One of the most important aspects would be to optimize the efficiency the external DRAM is accessed during rendering. Many cycles are wasted by the DRAM row management, however, by arranging the spatial index structure in small blocks, one could more efficiently access larger blocks of memory. At many parts the design was optimized to the FPGA platform used for prototyping. On an ASIC platform, memory is very expensive compared to distributed logic (on the FPGA it is vice versa) thus a major goal for an ASIC optimization of the design would be to reduce the required on-chip memory consumption. This could best be achieved by reducing the num- ber of threads by reducing latency. For instance in the Shading Processor forwarding could reduce latency of many instructions (such as for a simple add), as data could be used as far as it is computed. A different point not covered in this thesis is texturing. Texturing is supported in software, but high quality texture filtering (such as anisotropic filtering) is far too expensive on the Shading Processor. A connection of the design to a GPU-like texturing unit could help here. Writing shaders in assembly code is very error prone and time consum- 9.2. FINAL CONCLUSIONS 167 ing. Furthermore, handwritten code is often sub-optimal because humans tend to write code in a readable form, but for high speed assembly code the instructions of different computations often need to be merged. To overcome this limitation a shading compiler for the DRPU would be very important to support, for instance, RenderMan shaders. Power consumption is becoming more and more important in hardware design. No power analyis or power optimizations are applied in the design, but some would be easy to implement, such as disabling pipeline stages that do not perform any computations. More attention should be given to the spatial index structure. Despite B-KD Trees being a good compromise between full AABVHs and KD Trees, their size could possibly be decreased further. Reducing the number of pointers or supporting traditional KD Tree nodes could help here. Occlusion queries are used in rasterization hardware to determine if parts of the scene are visible. A similar approach could also be added to the DRPU framework as the Traversal Processor could watch if some B-KD Tree nodes are accessed during rendering. This information could be used by the application to determine dynamic parts of the scene that need to be modified. Furthermore, virtual memory management could help to render large scenes by using the DRAM as a large cache that holds the working set of the current and possibly future frames.

9.2 Final Conclusions

This thesis showed that special ray tracing hardware can achieve high perfor- mance especially if one were to implement it using a high end 90nm process. The DRPU uses B-KD Trees to efficiently handle coherent dynamic scene changes, which closes the largest remaining gap between rasterization and ray tracing. Despite the DRPU design still not being perfect, this thesis provides the best available performance estimation for a dedicated ray tracing hard- ware solution. Performance estimations for a 90nm version indicate that ray tracing performance in a single chip could be high enough to be used in computer games. However, it is difficult to argue how the area of computer graphics will change in the future, but ray tracing will definitely play a very important role. If dedicated hardware solutions for ray tracing can keep up with the rapid development of the speed of general purpose processor is a difficult question. However, in a time where power consumption becomes more and more important dedicated hardware solutions always have advantages over general purpose machines, because they are optimized to their application and consume less power. 168 CHAPTER 9. FINAL SUMMARY, CONCLUSIONS, AND FUTURE WORK

Due to its advantages, I hope that fast, hardware-accelerated ray trac- ing will soon become widely available, as it provides a robust, easy-to-use, and powerful basis for advanced 3D graphics. Appendix A

Abbreviations

Full Name Description TP Traversal Processor Traverses packets of 4 rays through the B- KD Tree. GP Geometry Processor Intersects packets of 4 rays with triangles or transforms them to the local coordinate space of an object instance. RCU Ray Casting Unit Performs ray casting and consists of TP and GP. SP Shading Processor Optimized processor that shades packets of 4 rays. DRPU Dynamic Ray Process- The hardware architecture described in ing Unit this thesis. GPU Graphics Processing Name for programmable rasterization Unit graphics cards. HWML Hardware Meta Lan- A functional hardware description lan- guage guage developed and used to implement the DRPU architecture in this thesis. SaarCOR Saarbr¨ucken’s Coher- The hardware project at the Computer ence Optimized Ray Graphics Lab at Saarland University Tracer where this thesis was written. OpenRT Open Ray Tracing Li- A ray tracing application interface with Library brary open specification developed at Saarland University.

169 170 APPENDIX A. ABBREVIATIONS

Full Name Description SIMD Single Instruction Mul- Computation concept of executing the tiple Data same instruction component wise on vec- tors. DRAM Dynamic Random Ac- An external memory chip that stores the cess Memory data in small capacitors. BVH Bounding Volume Hier- Tree like spatial index structure that archy stores a bounding representation for each node. AABVH Axis-aligned Bounding BVH with axis aligned boxes as bounds. Volume Hierarchy B-KD Bounded KD Tree Bounding Volume Hierarchy that stores Tree bounds in a single dimension only. DMA Allows direct data transfers without the use of a CPU. RTL Register Transfer Logic Logic that uses registers and combina- tional circuits for the computations. FIFO First in first out buffer A hardware unit that acts like a buffer. The order of items is not modified and a limited number of items can be stored. LUT Lookup Table Typically a programmable 4 input boolean function that is used to im- plement distributed logic in an FPGA device. FPGA Field Programmable Hardware device that can implement ar- Gate Array bitrary circuits. ASIC Application Specific In- CMOS device that implements a user ap- tegrated Circuit plication. CMOS Complementary Metal Hardware process that uses p- and n- Oxide Semiconductor channel MOSFET transistors to imple- ment integrated circuits. PCB Printed Circuit Board A board that electrically connects elec- tronic components. SAH Surface Area Heuristic Probabilistic heuristic to estimate the quality of a spatial index structure. Appendix B

Phong Shader

; compute hit position madR0,R3.z,R1,R0 ;float3H=ORG+t*DIR

; vertex 1 operations add R15.x,R3.x,R3.y + ; float upv = u+v

; get addresses load_i32 I0,[R2.y] ; (av0,av1,av2,amat) = *(int4*)[HITTRI]

; save loaded addresses mov R4,I0

; compute shadow ray addR5,C7,-R0+ ;float3Ds=C_L-H

; load vertex 1 load_i32 I0,[R4.y] ; (ap1,ac1,an1,_) = *(int4*)[av1] mad R0,C9.x,-R1,R0

; compute 1-(u+v) add R15.y,C6.y,-R15.x ; float uv = 1-upv

; save (ap1,ac1,an1,_) mov R6,I0

; load color of vertex 1 load_f32 I2,[R6.z] ; float3 n1 = *(float3*)[an1]

171 172 APPENDIX B. PHONG SHADER

; load vertex 2 load_i32 I3,[R4.z] ; (ap2,ac2,an2,_) = *(int4*)[av2]

; compute light intensity dp3_rcp_x R7,R5,R5 ; float3 I’ = 1/dp3(Ds,Ds) dp3_rsq_y R7,R5,R5 ; float divlength = 1/sqrt(dp3(Ds,Ds))

; save vertex 2 mov R6,I3

; interpolate vertex 1 mulR9,R3.x,I2 ;float3lN=u*n1

; compute normalized shadow ray mulR10,S.y,R5+ ;float3nDs=divlength*Ds load_f32 I2,[R6.z] ; float3 n2 = *(float3*)[an2]

; compute light intensity mulR11,S.x,C8+ ;float3I=C_LI*I’ load_i32 I3,[R4.x] ; (ap0,ac0,an0,_) = *(int4*)[av0]

; save vertex 0 mov R6,I3 + load_f32 I0,[R2.x + 3] ; Mx = *(float4*)[HITOBJ+0]

; interpolate vertex 1 madR9,R3.y,I2,R9 ;float3lN+=v*n2

; load color of vertex 0 load_f32 I2,[R6.z] ; float3 n0 = *(float3*)[an0]

; interpolate vertex 0 mad R9,R15.y,I2,R9 ; float3 lN += uv*n0

; load object transformation load_f32 I1,[R2.x + 4] ; My = *(float4*)[HITOBJ+1]

; transform vertex normals dp3R14.x,I0,R9+ ;floatN’.x=dp3(Mx,lN) load_f32 I2,[R2.x + 5] ; Mz = *(float4*)[HITOBJ+2] dp3R14.y,I1,R9 ; floatN’.y=dp3(My,lN) dp3R14.z,I2,R9 ; floatN’.z=dp3(Mz,lN) 173

; compute inverse length of normal dp3_rsq_z R8,R14,R14 ; float dlenN = 1/sqrt(dp3(N’.x,N’.y,N’.z))

; compute normalized normal mulR14,S.z,R14 ;float3N=dlenN*N’

; compute cosine between D and N dp3_sat R8.x,R1,-R14 ; float dpDN = sat(dp3(D,-N)) dp3_sat R8.y,R10,R14 ; float cosaP = sat(dp3(nDs,N)) dp3_sat R8.z,R10,-R14 ; float cosaN = sat(dp3(nDs,-N))

; compute reflected direction mad S0,2*R8.x,R14,R1 ; float3 Dr = 2 * dpDN * N + D add R8.y,R8.y,R8.z ; float cosa = cosaP - cosaN add R8.x,C6.y,R8.x + ; float T7 = 1-dpDN

; load diffuse color load_f32 I1,[R4.w + 3] ; (Cd,refl) = *(float3*)[amat+3]

; compute cosine between normal and reflected direction dp3_sat R2.z,S0,R10 + ; float cosb = sat(dp3(Dr,nDs))

; load ambient color load_f32 I0,[R4.w + 2] ; (Ca,refr) = *(float3*)[amat+2] mul R2.z,R2.z,R2.z +

; load specular color load_f32 I2,[R4.w + 4] ; (Cs,spec) = *(float4*)[amat+4] mul R2.z,R2.z,R2.z ; float T100 = T101 * T101

; combine colors addR8.w,C6.y,R8.x ;floatomRf= 1- Rf

; accumulate color mov R7,I0 mulR12,R8.y,R11 ;T4=cosa*I mov R11,I1

; accumulate color madR7,R12,R11,R7 ;Cacc+=T4*T5 174 APPENDIX B. PHONG SHADER shadow: mulR11,R2.z,I2 ;T6=Cs*cosb_pow_n mul R15.w,I1.w,R8.w ; T8 = refl * omRf

; accumulate color mad R7,R11,R15.w,R7 ; Cacc += T6*T7 mov OUT,R7 + return Appendix C

Mandelbrot Shader

;initialize current location with correct scaling add R15,P,-C20.y dp2h R0.x,C22,R15 dp2h R0.y,C23,R15 mov R2.xyzw, C20.x mov R3.zw, C20.x mov OUT, C20.x ;color of the mandelbrot-set: black

;cx = min_cx + x * dx; mad R1.x, R0.x, -C21.z, -C21.x ;cy = min_cy + y * dy; mad R1.y, R0.y, -C21.w, -C21.y point_iteration: add R15.x, C20.z, -R2.x + if x < 0 jmp select_color

;xt = x’*x’ - y’*y’ + cx; mad R4.x, R3.w, R3.w, R1.x mad R4.x, R3.z, R3.z, -R4.x

;yt = 2*x’*y’ + cy; mad R4.y, 2*R3.z, R3.w, R1.y

;x’ = xt, y’ = yt; mov R3.zw, R4.xxxy

175 176 APPENDIX C. MANDELBROT SHADER

;max-iter will be reached if y < 0 add R15.y, C20.w, -R2.w + if y < 0 return

;iter = iter + 1; mul R2.x, R3.z, R3.z + add R2.w, R2.w, C24.w

;abs = x’*x’ + y’*y’; mad R2.x, R3.w, R3.w, R2.x + jmp point_iteration select_color: mov OUT, C20.y + return ; white Bibliography

[AL04] Timo Aila and Samuli Laine. Alias-free shadow maps. In Pro- ceedings of EUROGRAPHICS Symposium on Rendering 2004, pages 161–166. Eurographics Association, 2004.

[Alp06] Alpha-Data. ADM-XRC-II. http://www.alphadata.uk.co, 2006.

[AM90] A. Apodaca and M. Mantle. Renderman: Pursuing the future of graphics. IEEE Computer Graphics & Applications, 10(4):44– 49, July 1990.

[App68] Arthur Appel. Some Techniques for Shading Machine Render- ings of Solids. SJCC, pages 27–45, 1968.

[Are88] Jeff Arenberg. Ray/Triangle Intersection with Barycentric Co- ordinates. http://www.acm.org/tog/resources/RTNews/html/- rtnews5b.html, 1988.

[Arv88] J. Arvo. Linear-time voxel walking for octrees. Ray Trac- ing News (available at http://www.acm.org/tog/resources/- RTNews/html/rtnews2d.html, 1(5), March 1988.

[ASR98] Paolo Montuschi Andrea Sanna and Massimo Rossi. A Flexible Algorithm for Multiprocessor Ray Tracing. Technical report, 1998.

[AW87a] Amanatides and Woo. A fast voxel traversal algorithm for ray tracing. In Proceedings of EUROGRAPHICS 1987, pages 3–10, 1987.

[AW87b] John Amanatides and Andrew Woo. A Fast Voxel Traversal Algorithm for Ray Tracing. In Proceedings of Eurographics, pages 3–10. Eurographics Association, 1987.

[Ben06] Carsten Benthin. Realtime Ray Tracing on current CPU Archi- tectures. PhD thesis, Saarland University, 2006.

[Blu06] Bluespec. Bluespec Webpage, http://www.bluespec.com, 2006.

177 178 BIBLIOGRAPHY

[BP90] Didier Badouel and Thierry Priol. An Efficient Parallel Ray Tracing Scheme for Highly Parallel Architectures. IRISA - Cam- pus de Beaulieu - 35042 Rennes Cedex France, 1990.

[Bri06] Brigham Young University, USA. JHDL hardware description language. http://www.jhdl.org, 2006.

[BWS04] Carsten Benthin, Ingo Wald, and Philipp Slusallek. Interactive Ray Tracing of Free-Form Surfaces. In Proceedings of Afrigraph, pages 99–106, November 2004.

[BWSF06] Carsten Benthin, Ingo Wald, Michael Scherbaum, and Heiko Friedrich. Ray Tracing on the CELL Processor. In IEEE Sym- posium on Interactive Ray Tracing, 2006.

[CHH02] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In Proceedings of Graphics Hardware, pages 37–46. Eu- rographics Association, 2002.

[Cor02] Intel Corporation. Introduction to Hyper-Threading Technol- ogy, http://developer.intel.com/technology/hyperthread, 2002.

[Cor06a] Intel Corporation. Official Webpage of Intel Corporation, www.intel.com, 2006.

[Cor06b] Microsoft Corporation. The Official Webpage of Microsoft Cor- poration, http://www.microsoft.com, 2006.

[Cor06c] Mircosoft Corporation. Official DirectX Webpage from Mi- crosoft, http://www.microsoft.com/directx, 2006.

[Cor06d] nVidia Corporation. Official nVidia Webpage, www.nvidia.com, 2006.

[CT82] R. L. Cook and K. E. Torrance. A reflectance model for com- puter graphics. ACM Trans. Graph., 1(1):7–24, 1982.

[CWBV83] John Cleary, Brian Wyvill, Graham Birtwistle, and Reddy Vatti. A Parallel Ray Tracing Computer. In Proceedings of the Association of Simula Users Conference, pages 77–80, 1983.

[DCDS05] Andreas Dietrich, Carsten Colditz, Oliver Deussen, and Philipp Slusallek. Realistic and Interactive Visualization of High- Density Plant Ecosystems. In Natural Phenomena 2005, Pro- ceedings of the Eurographics Workshop on Natural Phenomena, pages 73–81, August 2005. BIBLIOGRAPHY 179

[Dre05] Patrick Drecker. Programmable Shading for the SaarCOR Ar- chitecture. Technical report, Computer Graphics Group, Saar- land University, Germany, 2005.

[DWBS03] Andreas Dietrich, Ingo Wald, Carsten Benthin, and Philipp Slusallek. The OpenRT Application Programming Interface – Towards A Common API for Interactive Ray Tracing. In Pro- ceedings of the 2003 OpenSG Symposium, pages 23–31, 2003.

[Fis83] Joseph A. Fisher. Very Long Instruction Word Architectures and the ELI-512. In Proceedings of the 10th Symposium on Computer Architectures, pages 140–150, 1983.

[FKN80] Henry Fuchs, Zvi M. Kedem, and Bruce F. Naylor. On visible surface generation by a priori tree structures. In SIGGRAPH ’80: Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pages 124–133. ACM Press, 1980.

[FR03] Joshua Fender and Jonathan Rose. A High-Speed Ray Trac- ing Engine Built on a Field- Programmable System. In IEEE International Conf. On Field-Programmable Technology, pages 188–195, December 2003.

[FS05] Tim Foley and Jeremy Sugerman. KD-tree acceleration struc- tures for a GPU raytracer. In HWWS ’05 Proceedings, pages 15–22. ACM Press, 2005.

[FvDFH97] Foley, van Dam, Feiner, and Hughes. Computer Graphics – Principles and Practice, second edition in C. Addison Wesley, 1997.

[Gam06] EPIC Games. Unreal 3 Engine, 2006.

[Gee06] Marcus Geelnard. The Official RayLab Homepage. http://www.etek.chalmers.se/e4geeln/raylab/index.html, 2006.

[Gla84] Andrew S. Glassner. Space subdivision for fast ray tracing. IEEE Computer Graphics and Applications, 4(10):15–22, 1984.

[GM05] Enrico Gobbetti and Fabio Marton. Far voxels: A multireso- lution framework for interactive rendering of huge complex 3D models on commodity graphics platforms. ACM Trans. Graph., 24(3):878–885, 2005.

[GS87] Jeffrey Goldsmith and John Salmon. Automatic creation of object hierarchies for ray tracing. IEEE Computer Graphics and Applications, 7(5):14–20, May 1987. 180 BIBLIOGRAPHY

[Hal01] D. Hall. The AR350: Today’s ray trace rendering processor. In Proceedings of the EUROGRAPHICS/SIGGRAPH workshop on Graphics Hardware - Hot 3D Session, 2001.

[Hav01] Vlastimil Havran. Heuristic Ray Shooting Algorithms. PhD thesis, Faculty of Electrical Engineering, Czech Technical Uni- versity in Prague, 2001.

[Inc06a] Advanced Micro Devices Inc. The Official AMD Webpage, http://www.amd.com, 2006.

[Inc06b] Silicon Graphics Inc. OpenGL Webpage from Silicon Graphics, http://www.sgi.com/products/software/opengl/, 2006.

[Inc06c] Silicon Graphics Inc. SGI Webpage, http://www.sgi.com, 2006.

[Inc06d] Synopsys Inc. Synopsys Webpage, http://www.synopsys.com, 2006.

[IRT06] IRTC. Official Webpage of the Internet Ray Tracing Competi- tion, http://www.irtc.org/, 2006.

[Jan86] F. W. Jansen. Data structures for ray tracing. In Proceedings of the workshop on Data structures for Raster Graphics, pages 57–73, 1986.

[JC95] Hendrik Wann Jensen and Niels Jorgen Christensen. Photon maps in bidirectional monte carlo ray tracing of complex ob- jects. In Computers and Graphics vol. 19 (2), pages 215–224, March 1995.

[Jen03] Hendrik Wann Jensen. SIGGRAPH course on Monte Carlo Ray Tracing, 2003.

[JMB04] Gregory S. Johnson, William R. Mark, and Christopher A. Burns. The Irregular Z-Buffer and its Application to Shadow Mapping. Technical report, The University of Texas at Austin, Department of Computer Sciences, 2004. Technical Report TR- 04-09, April 15.

[Kaj86] James T. Kajiya. The rendering equation. In Computer Graph- ics (SIGGRAPH ’86 Proceedings), volume 20, pages 143–150, 1986.

[KH95] M. J. Keates and Roger J. Hubbold. Interactive ray tracing on a virtual shared-memory parallel computer. Computer Graphics Forum, 14(4):189–202, 1995. BIBLIOGRAPHY 181

[KHM+98] James T. Klosowski, Martin Held, Joseph S. B. Mitchell, Henry Sowizral, and Karel Zikan. Efficient Collision Detection Using Bounding Volume Hierarchies of k-DOPs. IEEE Transactions on Visualization and Computer Graphics, 4(1):21–36, 1998.

[KiSSO02] Hiroaki Kobayashi, Ken ichi Suzuki, Kentaro Sano, and Nobuyuki Oba. Interactive Ray-Tracing on the 3DCGiRAM Architecture. In Proceedings of ACM/IEEE MICRO-35, 2002.

[KK86] Timothy L. Kay and James T. Kajiya. Ray Tracing Complex Scenes. Computer Graphics (Proceedings of ACM SIGGRAPH), 20(4):269–278, 1986.

[Kol] Craig Kolb. Rayshade Home-Page. http://graphics.stanford.- edu/˜cek/rayshade/rayshade.html.

[KS97] G¨unter Knittel and Wolfgang Straßer. VIZARD - Visualization Accelerator for Realtime Display. In HWWS ’97: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graph- ics hardware, pages 139–146, New York, NY, USA, 1997. ACM Press.

[KZK+06] Y. Kaeriyama, D. Zaitsu, K. Komatsu, K. Suzuki, N. Ohba, and T. Nakamura. Hardware for a Ray Tracing Technique Using Plane-Sphere Intersections. 2006.

[LAM01] Jonas Lext and Tomas Akenine-M¨oller. Towards Rapid Recon- struction for Animated Ray Tracing. In Eurographics 2001 – Short Presentations, pages 311–318, 2001.

[Lan06a] The Verilog Hardware Description Language. http://www.verilog.com, 2006.

[Lan06b] VHDL Hardware Description Language. VHDL Wikipedia En- try, http://de.wikipedia.org/wiki/VHDL, 2006.

[LLC06] Bethesda Softworks LLC. Official Webpage of Elderscrolls, http://www.elderscrolls.com, 2006.

[MBFS06] Gerd Marmitt, Roman Brauchle, Heiko Friedrich, and Philipp Slusallek. Accelerated and Extended Building of Implicit kd- Trees for Volume Ray Tracing. In Proceedings of 11th Inter- national Fall Workshop - Vision, Modeling, and Visualization (VMV) 2006, Aachen, Germany, November 2006. Akademische Verlagsgesellschaft Aka. 182 BIBLIOGRAPHY

[MFS05] Gerd Marmitt, Heiko Friedrich, and Philipp Slusallek. Recent Advancements in Ray-Tracing based Volume Rendering Tech- niques. In G¨unther Greiner, Joachim Hornegger, Heinrich Nie- mann, and Marc Stamminger, editors, Proceedings of 10th In- ternational Fall Workshop - Vision, Modeling, and Visualiza- tion (VMV) 2005, pages 131–138, Erlangen, Germany, Novem- ber 2005. Akademische Verlagsgesellschaft Aka. [Mit06] Mitrion. The Mitrion-C Hardware Description Language, www.mitrion.com, 2006. [MKS98] M. Meissner, U. Kanus, and W. Strasser. VIZARD II, A PCI- Card for Real-Time Volume Rendering. In EUROGRAPH- ICS/SIGGRAPH Workshop on Graphics Hardware, 1998. [Mor06] MorphoSys. Webpage of the MorphoSys Project: http://www.eng.uci.edu/morphosys/, 2006. [MS00] Alan Mycroft and Richard Sharp. A statically allocated paral- lel functional language. In Automata, Languages and Program- ming, pages 37–48, 2000. [MS06] Gerd Marmitt and Philipp Slusallek. Fast Ray Traversal of Tetrahedral and Hexahedral Meshes for Direct Volume Render- ing. In Thomas Ertl, Ken Joy Joy, and Beatriz Santos, editors, Proceedings of Eurographics/IEEE-VGTC Symposium on Visu- alization (EuroVIS) 2006, Lisbon, Portugal, May 2006. [MT97] Tomas M¨oller and Ben Trumbore. Fast, minimum storage ray triangle intersection. Journal of Graphics Tools, 2(1):21–28, 1997. [MTH90] Robin Milner, Mads Tofte, and Robert Harper. The Definition of Standard ML, 1990. [MTLT88] N. Magnenat-Thalmann, R. Laperriere, and D. Thalmann. Joint-dependent local deformations for hand animation and ob- ject grasping. In Proceedings on Graphics interface ’88, pages 26–33, Toronto, Ont., Canada, Canada, 1988. Canadian Infor- mation Processing Society. [MTT91] Nadia Magnenat-Thalmann and Daniel Thalmann. Human body deformations using joint-dependent local operators and finite-element theory. pages 243–262, 1991. [Muu95] Michael J. Muuss. Towards real-time ray-tracing of combinato- rial solid geometric models. In Proceedings of BRL-CAD Sym- posium ’95, June 1995. BIBLIOGRAPHY 183

[MWM01] M. McCool, C. Wales, and K. Moule. Incremental and hierar- chical hilbert order edge equation polygon rasterization, 2001.

[Neb97] Jean-Christophe Nebel. A Mixed Dataflow Algorithm for Ray- Tracing on the CRAY T3E. In Third European CRAY-SGI MPP Workshop, September 1997.

[Pat03] Parikshit Patidar. Hardware Accelerator for Ray Tracing, M.Tech Project Part I Report. Technical report, 2003.

[Per98] Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: Hardware Design in Haskell. In ICFP ’98 in Baltimore (Maryland, USA), 1998.

[PH04] Matt Pharr and Greg Humphreys. Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, 2004.

[PHK+99] Hanspeter Pfister, Jan Hardenbergh, Jim Knittel, Hugh Lauer, and Larry Seiler. The VolumePro real-time ray-casting system. Computer Graphics, 33, 1999.

[Pho75] Bui-Tuong Phong. Illumination for Computer Generated Pic- tures. j-CACM, 18(6):311–317, 1975.

[Pop06] Stefan Popov. Stackless KD-Tree Traversal for Ray Tracing on Graphics Hardware, Master Thesis, 2006.

[Pos06] Poser. Poser Web Page. http://www.e-frontier.com, 2006.

[PSL+99] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike Sloan. Interactive ray tracing. In Interactive 3D Graphics (I3D), pages 119–126, April 1999.

[Pur04] Timothy J. Purcell. Ray Tracing on a Stream Processor. PhD thesis, Stanford University, 2004.

[Ray06] POV Ray. POV Ray Webpage: Persistance of Vision, http://www.povray.org/, 2006.

[RSH00] Erik Reinhard, Brian Smits, and Chuck Hansen. Dynamic Ac- celeration Structures for Interactive Ray Tracing. In Proceedings of the Eurographics Workshop on Rendering, pages 299–306, Brno, Czech Republic, June 2000.

[RSH05] Alexander Reshetov, Alexei Soupikov, and Jim Hurley. Multi- Level Ray Tracing Algorithm. ACM Transaction of Graphics, 24(3):1176–1185, 2005. (Proceedings of ACM SIGGRAPH). 184 BIBLIOGRAPHY

[RW80] Steve M. Rubin and Turner Whitted. A three-dimensional rep- resentation for fast rendering of complex scenes. Computer Graphics, 14(3):110–116, July 1980.

[San76] L. A. Santal´o. Integral geometry and geometric probability, vol- ume 1 of Encyclopedia of mathematics and its applications ; V. 1 Section, Probability. Addison-Wesley Pub. Co., 1976.

[Sch06] J¨org Schmittler. SaarCOR - A Hardware Architecture for Re- altime Ray Tracing. PhD thesis, Saarland University, 2006.

[SEDT+03] Marcos Sanchez-Elez, Haitao Du, Nozar Tabrizi, Yun Long, Nader Bagherzadeh, and Milagros Fern´andez. Algorithm op- timizations and mapping scheme for interactive ray tracing on a reconfigurable architecture. Computers and Graphics, 27(5):701–713, 2003.

[SF90] K. R. Subramanian and D. S. Fussel. A search structure based on k-d trees for efficient ray tracing. Technical Report PhD Dissertation, Tx 78712-1188, The University of Texas at Austin, December 1990.

[Sil06] Virtual Silicon. Webpage of Virtual Silicon: Your Source For IP, http://www.virtual-silicon.com/, 2006.

[SLS03] J¨org Schmittler, Alexander Leidinger, and Philipp Slusallek. A Virtual Memory Architecture for Real-Time Ray Tracing Hard- ware. Computer and Graphics, Volume 27, Graphics Hardware, pages 693–699, 2003.

[Slu06] Philipp Slusallek. Webpage of the Computer Graphics Lab at Saarland University, http://graphics.cs.uni-sb.de, 2006.

[Smi98] Brian Smits. Efficiency Issues for Ray Tracing. Journal of Graphics Tools, 3(2):1–14, 1998.

[SPS95] Philipp Slusallek, Thomas Pflaum, and Hans-Peter Seidel. Us- ing procedural RenderMan shaders for global illumination. In Computer Graphics Forum (Proc. of Eurographics ’95, pages 311–324, 1995.

[Sri02] James Richard Srinivasan. Hardware Accelerated Ray Tracing, Part II Computer Science. PhD thesis, Jesus College, 2002.

[SS95] Ph. Slusallek and H.-P. Seidel. Vision: An Architecture for Global Illumination Calculations. In IEEE Transactions on Vi- sualization and Computer Graphics, 1(1), pages 77–96, March 1995. BIBLIOGRAPHY 185

[SWS02] J¨org Schmittler, Ingo Wald, and Philipp Slusallek. SaarCOR – A Hardware Architecture for Ray Tracing. In Proceedings of the ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 27–36, 2002.

[SWW+04] J¨org Schmittler, Sven Woop, Daniel Wagner, Wolfgang J. Paul, and Philipp Slusallek. Realtime Ray Tracing of Dynamic Scenes on an FPGA Chip. In Proceedings of Graphics Hardware, 2004.

[Sys06a] SystemC Community. SystemC, www.systemc.org, 2006.

[Sys06b] Cadence Design Systems. Official Webpage of Cadence Design Systems, http://www.cadence.com, 2006.

[TI06] ATI Technologies Inc. Official ATI webpage, www.ati.com, 2006.

[TL03] Tomas Akenine-M¨oller Thomas Larsson. Strategies for Bound- ing Volume Hierarchy Updates for Ray Tracing of Deformable Models. Technical report, February 2003.

[Tom06] Tom Hawkins. Confluence tutorial and reference manual. www.launchbird.com, 2006.

[UMC06] UMC. Webpage of UMC: The SoC Solution Foundry, http://www.umc.com, 2006.

[Uni06] Saarland University. Homepage of Saarland University, http://www.uni-sb.de, 2006.

[vdB97] G. van den Bergen. Efficient Collision Detection of Complex Deformable Models using AABB Trees. In Journal of Graphics Tools, 2(4), pages 1–14, 1997.

[Wal04] Ingo Wald. Realtime Ray Tracing and Interactive Global Il- lumination. PhD thesis, Computer Graphics Group, Saarland University, 2004.

[War92] G.J. Ward. Measuring and modeling anisotropic reflection. In Proceedings of SIGGRAPH, 1992.

[WBS02] Ingo Wald, Carsten Benthin, and Philipp Slusallek. OpenRT – A Flexible and Scalable Rendering Engine for Interactive 3D Graphics. Technical report, Computer Graphics Group, Saar- land University, http://www.openrt.de, 2002.

[WBS03] Ingo Wald, Carsten Benthin, and Philipp Slusallek. Distributed Interactive Ray Tracing of Dynamic Scenes. In Proceedings of 186 BIBLIOGRAPHY

the IEEE Symposium on Parallel and Large-Data Visualization and Graphics (PVG), 2003.

[WBS06a] Ingo Wald, Solomon Boulos, and Peter Shirley. Ray Tracing De- formable Scenes using Dynamic Bounding Volume Hierarchies (revised version). Technical Report, SCI Institute, University of Utah, No UUSCI-2006-023, 2006.

[WBS06b] Sven Woop, Erik Brunvand, and Philipp Slusallek. HWML: RTL/Structural Hardware Description using ML. Technical re- port, Computer Graphics Lab, Saarland University, 2006.

[WDS04] Ingo Wald, Andreas Dietrich, and Philipp Slusallek. An Inter- active Out-of-Core Rendering Framework for Visualizing Mas- sively Complex Models. In Rendering Techniques 2004, Pro- ceedings of the Eurographics Symposium on Rendering, pages 81–92, 2004.

[Weg06] Tomasz Wegrzanowski. RenderMan Compiler for the RPU Hardware Architecture, Master Thesis, 2006.

[WGS04] Ingo Wald, Johannes G¨unther, and Philipp Slusallek. Balancing Considered Harmful – Faster Photon Mapping using the Voxel Volume Heuristic. Computer Graphics Forum, 22(3):595–603, 2004. (Proceedings of Eurographics).

[WH05] Neil Weste and David Harris. CMOS VLSI Design: A Circuits and Systems Perspective. Addison Wesley, 2005.

[Whi80] Turner Whitted. An Improved Illumination Model for Shaded Display. CACM, 23(6):343–349, 1980.

[WIK+06] Ingo Wald, Thiago Ize, Andrew Kensler, Aaron Knoll, and Steven G Parker. Ray Tracing Animated Scenes using Coherent Grid Traversal. ACM SIGGRAPH 2006, 2006.

[WK06] Carsten W¨achter and Alexander Keller. Instant Ray Tracing: The Bounding Interval Hierarchy. In Rendering Techniques 2006: EuroGraphics Symposium on Rendering, 2006.

[WKB+02] Ingo Wald, Thomas Kollig, Carsten Benthin, Alexander Keller, and Philipp Slusal lek. Interactive Global Illumination using Fast Ray Tracing. In Paul Debevec and Simon Gibson, editors, Rendering Techniques 2002, pages 15–24, Pisa, Italy, June 2002. Eurographics Association, Eurographics. (Proceedings of the 13th Eurographics Workshop on Rendering). BIBLIOGRAPHY 187

[WMS06] Sven Woop, Gerd Marmitt, and Philipp Slusallek. B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes. In Proceedings of Graphics Hardware, 2006.

[WPS+03] Ingo Wald, Timothy J. Purcell, J¨org Schmittler, Carsten Ben- thin, and Philipp Slusallek. Realtime Ray Tracing and its Use for Interactive Global Illumination. In Eurographics State of the Art Reports, pages 85–122, 2003.

[WSBW01] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. Interactive Rendering with Coherent Ray Tracing. Computer Graphics Forum, 20(3):153–164, 2001. Proceedings of Eurographics.

[WSS05] Sven Woop, J¨org Schmittler, and Philipp Slusallek. RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing. In SIGGRAPH 2005 Conference Proceedings, pages 434 – 444, 2005.

[Xil06a] Xilinx. Virtex4 Homepage. http://www.xilinx.com/virtex4, 2006.

[Xil06b] Xilinx. Xilinx Homepage. http://www.xilinx.com, 2006.

[Zim03] Paul Zimmons. The Cell Architecture, 2003. http://www.ps3portal.com/downloads/cell.ppt.